Automated Browserless OAuth Authentication for Twitter

Introduction

My first impression after having my first encounter with the OAuth protocol was: bureaucracy meets the web. It’s understandable that in order to authorize third party applications users must approve  access to their own information, but if I want to access my personal information under my own application why do I need to complete all this “paperwork”?

Also, user experience suffers when you have to jump to the browser and return to your application as part of the workflow. Mobile and desktop apps need more alternatives to work around that. Twitter offers the xAuth API for desktop and mobile applications but you have to send a request with “plenty of details” and may have to wait a long time to get it.

This article describes how to use the OAuth 3-legged protocol with a headless browser like HtmlUnit to get tokens from twitter without user intervention.

The example uses HtmlUnit and Jython. If you want to use HtmlUnit under .NET I recommend looking at Using HtmlUnit on .NET for Headless Browser Automation (using IKVM). WP7 developers may also want to look at the .NET article to see if it could be applied to Silverlight.

Once you obtain the token you can keep it to use in future calls. Be aware that tokens may expire based on conditions such as time. Ethically, the automated application should ask users to either allow or deny applications access to twitter.

Prerequisites

  1. JRE or JDK
  2. Download and Install the latest Jython version. Run the .jar and install it in your preferred directory (e.g: /opt/jython).
  3. Download and decompress setuptools-0.6c11.tar.gz
  4. Go to the setuptools directory. Install the package under Jython with: sudo /opt/jython/bin/jython setup.py install
  5. Download and decompress python-twitter-0.8.1.tar.gz
  6. Look at the required dependencies for python-twitter and install them with Jython:
    1. http://cheeseshop.python.org/pypi/simplejson
    2. http://code.google.com/p/httplib2/
    3. http://github.com/simplegeo/python-oauth2
    4. You’ll need to change the file oauth2/__init__.py for Jython 2.5 compatibility:
from urlparse import parse_qs, parse_qsl

to:

try:

from urlparse import parse_qsl, parse_qs

except ImportError:

from cgi import parse_qsl, parse_qs

 

  1. Under the python-twitter-0.8.1 directory download the HtmlUnit compiled binaries from http://sourceforge.net/projects/htmlunit/files/ (we are using HtmlUnit 2.8 for this example).
  2. Go to the python-twitter-0.8.1 directory and Install the python-twitter package under Jython:
    1. sudo /opt/jython/bin/jython setup.py install
  3. Create a twitter application for testing and get its key and secret.

Example

get_access_token.py

Changes

  1. Replace consumer_key and consumer_secret with your application key/secret.
  2. Add the following imports and get_pincode function:
import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def get_pincode(url, username, password):
  webclient = WebClient(BrowserVersion.FIREFOX_3_6)
  page = webclient.getPage(url)

  twitter_username_or_email = page.getByXPath("//input[@id='username_or_email']")[0]
  twitter_password = page.getByXPath("//input[@id='password']")[0]
  allow_button = page.getByXPath("//input[@id='allow']")[0]

  twitter_username_or_email.setValueAttribute(username)
  twitter_password.setValueAttribute(password)

  page = allow_button.click()

  code = page.getByXPath("//kbd/code")[0]

  return code.getTextContent()
  1. Replace:
pincode = raw_input('Pincode? ')

with:

  twitter_username = None # replace it with your twitter username
  twitter_password = None # replace it with your twitter password
  print "Geting pincode"
  pincode = get_pincode('%s?oauth_token=%s' % (AUTHORIZATION_URL, request_token['oauth_token']),  twitter_username, twitter_password)
  print "pincode =", pincode

 

run.sh

#!/bin/sh
/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" get_access_token.py

Complete source code

#!/usr/bin/python2.4
#
# Copyright 2007 The Python-Twitter Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import sys

# parse_qsl moved to urlparse module in v2.6
try:
  from urlparse import parse_qsl
except:
  from cgi import parse_qsl

import oauth2 as oauth

# HTMLUnit related code
import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def get_pincode(url, username, password):
  webclient = WebClient(BrowserVersion.FIREFOX_3_6)
  page = webclient.getPage(url)

  twitter_username_or_email = page.getByXPath("//input[@id='username_or_email']")[0]
  twitter_password = page.getByXPath("//input[@id='password']")[0]
  allow_button = page.getByXPath("//input[@id='allow']")[0]

  twitter_username_or_email.setValueAttribute(username)
  #password.text = password
  #password.setText(password) # HtmlPasswordInput
  twitter_password.setValueAttribute(password)

  page = allow_button.click()

  code = page.getByXPath("//kbd/code")[0]

  return code.getTextContent()

REQUEST_TOKEN_URL = 'https://api.twitter.com/oauth/request_token'
ACCESS_TOKEN_URL  = 'https://api.twitter.com/oauth/access_token'
AUTHORIZATION_URL = 'https://api.twitter.com/oauth/authorize'
SIGNIN_URL        = 'https://api.twitter.com/oauth/authenticate'

consumer_key    = None
consumer_secret = None
twitter_username = None
twitter_password = None

if consumer_key is None or consumer_secret is None:
  print 'You need to edit this script and provide values for the'
  print 'consumer_key and also consumer_secret.'
  print ''
  print 'The values you need come from Twitter - you need to register'
  print 'as a developer your "application".  This is needed only until'
  print 'Twitter finishes the idea they have of a way to allow open-source'
  print 'based libraries to have a token that can be used to generate a'
  print 'one-time use key that will allow the library to make the request'
  print 'on your behalf.'
  print ''
  sys.exit(1)

signature_method_hmac_sha1 = oauth.SignatureMethod_HMAC_SHA1()
oauth_consumer             = oauth.Consumer(key=consumer_key, secret=consumer_secret)
oauth_client               = oauth.Client(oauth_consumer)

print 'Requesting temp token from Twitter'

resp, content = oauth_client.request(REQUEST_TOKEN_URL, 'GET')

if resp['status'] != '200':
  print 'Invalid respond from Twitter requesting temp token: %s' % resp['status']
else:
  request_token = dict(parse_qsl(content))

  print ''
  print 'Please visit this Twitter page and retrieve the pincode to be used'
  print 'in the next step to obtaining an Authentication Token:'
  print ''
  print '%s?oauth_token=%s' % (AUTHORIZATION_URL, request_token['oauth_token'])
  print ''

  print "Geting pincode"
  pincode = get_pincode('%s?oauth_token=%s' % (AUTHORIZATION_URL, request_token['oauth_token']), twitter_username, twitter_password)
  print "pincode =", pincode

#  pincode = raw_input('Pincode? ')

  token = oauth.Token(request_token['oauth_token'], request_token['oauth_token_secret'])
  token.set_verifier(pincode)

  print ''
  print 'Generating and signing request for an access token'
  print ''

  oauth_client  = oauth.Client(oauth_consumer, token)
  resp, content = oauth_client.request(ACCESS_TOKEN_URL, method='POST', body='oauth_verifier=%s' % pincode)
  access_token  = dict(parse_qsl(content))

  if resp['status'] != '200':
    print 'The request for a Token did not succeed: %s' % resp['status']
    print access_token
  else:
    print 'Your Twitter Access Token key: %s' % access_token['oauth_token']
    print '          Access Token secret: %s' % access_token['oauth_token_secret']
    print ''

Conclusion

We have seen how to getOAuth tokens with a headless browser. This approach can be applied to other services such as Facebook and LinkedIn. A partial list of other services you can play with is available at: http://wiki.oauth.net/w/page/12238551/ServiceProviders

Look at our previous article Web Scraping Ajax and Javascript Sites for more information about setting up and usage HtmlUnit and Jython.

Sadly the prerequisites part requires an important extra effort to have it working quickly but once you have setup all the development environment it’s plain sailing.

Resources

  1. OAuth articles from Eran Hammer-Lahav
  2. OAuth 2.0 for Android Applications
  3. OAuth Will Murder Your Children
  4. Do Facebook Oauth 2.0 Access Tokens Expire?
  5. OAuth2 for iPhone and iPad applications
  6. Movistar BlueVia’s official API for SMS

Photo taken by mariachily

Google Search NoAPI

History

Way back in 2001 I wanted to be able to query Google automatically. Since Google did not provide an official API,  I developed a small simple Google Search “NoAPI” scraper  and published it as Googolplex. Google launched a SOAP based API but on December 20, 2006 they stopped accepting signups for the API1 and suspended it on August 31, 20092.  This shows that creating a service or product based on web APIs is a very risky business without an SLA contract. Google soon launched another API called Google Ajax Web Search API3 under a different license. This second API was suspended on November 1, 20104. You may wonder if Google is a bipolar creature. You can see the latest post at Fall Housekeeping.

Google has undergone a lot of changes since 2001 and Googolplex and other  libraries like xgoogle are now part of Internet history. A similar new library  is available at Mario Vilas Google Search Python blog post as Quickpost: Using Google Search from your Python code.

It’s not clear why Google vacilates over what could be an additional source of revenue, but it is clear that we should expect Google to provide an official and easy to use API. There are ways Google could restrict abuse of their APIs by third parties. It’s very common to offer a free alternative for low volume searches and charge for more intensive uses like Yahoo BOSS does.

In this article we’ll examine one way of crawling information in AJAX/Javascript based sites.

Crawling Google As A Browser

If you go to Google and look at the html source code you’ll be astonished to see pure Javascript obfuscated code. Even after searching the source is not clearer.

So, here is our code to get Google’s results using htmlunit/jython,we don’t have any affiliation with them,jwejust like it!). Look at our Web Scraping Ajax and Javascript Sites for more information.

google.py

import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def query(q):
   webclient = WebClient(BrowserVersion.FIREFOX_3_6)
   url = "http://www.google.com"
   page = webclient.getPage(url)

   query_input = page.getByXPath("//input[@name='q']")[0]
   query_input.text = q
   search_button = page.getByXPath("//input[@name='btnG']")[0]
   page = search_button.click()
   results = page.getByXPath("//ol[@id='rso']/li//span/h3[@class='r']")

   c = 0
   for result in results:
      title = result.asText()
      href = result.getByXPath("./a")[0].getAttributes().getNamedItem("href").nodeValue
      print title, href
      c += 1

   print c,"Results"

if __name__ == '__main__':
   query("google web search api")

run.sh

/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" google.py

Alternatives

The following search engines provide official APIs for search:

Homework

  1. Write a clean function/class to do Google queries and handle exceptions.
  2. Modify the function to handle nested and paged results
  3. Modify the function again, this time to include descriptions.

Final Notes

The approach taken by Mario Vilas is more API like, our approach here is a defensive measure against NoAPIs. This is another good example where HtmlUnit does its job.

BTW the noapi.com domain is available5

See Also

  1. Extraction of Main Text Content Using the Google Reader NoAPI
  2. The Data Portability Fact Sheet

References

  1. Beyond the SOAP Search API
  2. A well earned retirement for the SOAP Search API
  3. Google AJAX Search API beta Version 1.0 Available
  4. Fall Housekeeping
  5. The noapi.com domain is available at the time of writing of this article. Register it now! (Disclaimer: affiliate link).

Additional Resources

  1. Google Search API?
  2. Google Deprecates Their SOAP Search API
  3. Google Search API Dropped
  4. Is this API going to be closed down?
  5. Yahoo BOSS Switching To Paid Model In Early 2011
  6. Thoughts on Yahoo! BOSS Monetization Announcement
  7. Google to Start Charging for Prediction API
  8. Update on Whitelisting (Twitter API policies discussion)
  9. From “Businesses” To “Tools”: The Twitter API ToS Changes

Web Scraping Ajax and Javascript Sites

Introduction

Most crawling frameworks used for scraping cannot be used for Javascript or Ajax. Their scope is limited to those sites that show their main content without using scripting. One would also be tempted to connect a specific crawler to a Javascript engine but it’s not easy to do. You need a fully functional browser with good DOM support because the browser behavior is too complex for a simple connection between a crawler and a Javascript engine to work. There is a list of resources at the end of this article to explore the alternatives in more depth.

There are several ways to scrape a site that contains Javascript:

  1. Embed a web browser within an application and simulate a normal user.
  2. Remotely connect to a web browser and automate it from a scripting language.
  3. Use special purpose add-ons to automate the browser
  4. Use a framework/library to simulate a complete browser.

Each one of these alternatives has its pros and cons. For  example using a complete browser consumes a lot of resources, especially if we need to scrape websites with a lot of pages.

In this post we’ll give a simple example of how to scrape a web site that uses Javascript. We will use the htmlunit library to simulate a browser. Since htmlunit runs on a JVM we will use Jython, an [excellent] programming language,which is a Python implementation in the JVM. The resulting code is very clear and focuses on solving the problem instead of on the aspects of programming languages.

Setting up the environment

Prerequisites

  1. JRE or JDK.
  2. Download the latest version of Jython from http://www.jython.org/downloads.html.
  3. Run the .jar file and install it in your preferred directory (e.g: /opt/jython).
  4. Download the htmlunit compiled binaries from: http://sourceforge.net/projects/htmlunit/files/.
  5. Unzip the htmlunit to your preferred directory.

Crawling example

We will scrape the Gartner Magic Quadrant pages at: http://www.gartner.com/it/products/mq/mq_ms.jsp . If you look at the list of documents, the links are Javascript code instead of hyperlinks with http urls. This is may be to reduce crawling, or just to open a popup window. It’s a very convenient page to illustrate the solution.

gartner.py

import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def main():
   webclient = WebClient(BrowserVersion.FIREFOX_3_6) # creating a new webclient object.
   url = "http://www.gartner.com/it/products/mq/mq_ms.jsp"
   page = webclient.getPage(url) # getting the url
   articles = page.getByXPath("//table[@id='mqtable']//tr/td/a") # getting all the hyperlinks

   for article in articles:
      print "Clicking on:", article
      subpage = article.click() # click on the article link
      title = subpage.getByXPath("//div[@class='title']") # get title
      summary = subpage.getByXPath("//div[@class='summary']") # get summary
      if len(title) > 0 and len(summary) > 0:
         print "Title:", title[0].asText()
         print "Summary:", summary[0].asText()
#     break

if __name__ == '__main__':
   main()

run.sh

/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" gartner.py

Final notes

This article is just a starting point to move ahead of simple crawlers and point the way for further research. As this is a simple page, it is a good choice for a clear example of how Javascript scraping works.You must do your homework to learn to crawl more web pages or add multithreading for better performance. In a demanding crawling scenario a lot of things must be taken into account, but this is a subject for future articles.

If you want to be polite don’t forget to read the robots.txt file before crawling…

If you like this article, you might also be interested in

  1. Distributed Scraping With Multiple Tor Circuits
  2. Precise Scraping with Google Chrome
  3. Running Your Own Anonymous Rotating Proxies
  4. Automated Browserless OAuth Authentication for Twitter

Resources

  1. HtmlUnit
  2. ghost.py is a webkit web client written in python
  3. Crowbar web scraping environment
  4. Google Chrome remote debugging shell from Python
  5. Selenium web application testing systemWatirSahiWindmill Testing Framework
  6. Internet Explorer automation
  7. jSSh Javascript Shell Server for Mozilla
  8. http://trac.webkit.org/wiki/QtWebKit
  9. Embedding Gecko
  10. Opera Dragonfly
  11. PyAuto: Python Interface to Chromum’s automation framework
  12. Related questions on Stack Overflow
  13. Scrapy
  14. EnvJS: Simulated browser environment written in Javascript
  15. Setting up Headless XServer and CutyCapt on Ubuntu
  16. CutyCapt: Capture WebKit’s rendering of a web page.
  17. Google webmaste blog: A spider’s view of Web 2.0
  18. OpenQA
  19. Python Webkit DOM Bindings
  20. Berkelium Browser
  21. uBrowser
  22. Using HtmlUnit on .NET for Headless Browser Automation (using IKVM)
  23. Zombie.js
  24. PhantomJS
  25. PyPhantomJS
  26. CasperJS
  27. Web Inspector Remote
  28. Offscreen/Headless Mozilla Firefox (via @brutuscat)
  29. Web Scraping with Google Spreadsheets and XPath
  30. Web Scraping with YQL and Yahoo Pipes

Photo taken by xiffy