Automated Browserless OAuth Authentication for Twitter

Introduction

My first impression after having my first encounter with the OAuth protocol was: bureaucracy meets the web. It’s understandable that in order to authorize third party applications users must approve access to their own information, but if I want to access my personal information under my own application why do I need to complete all this “paperwork”?

Also, user experience suffers when you have to jump to the browser and return to your application as part of the workflow. Mobile and desktop apps need more alternatives to work around that. Twitter offers the xAuth API for desktop and mobile applications but you have to send a request with “plenty of details” and may have to wait a long time to get it.

This article describes how to use the OAuth 3-legged protocol with a headless browser like HtmlUnit to get tokens from twitter without user intervention.

The example uses HtmlUnit and Jython. If you want to use HtmlUnit under .NET I recommend looking at Using HtmlUnit on .NET for Headless Browser Automation (using IKVM). WP7 developers may also want to look at the .NET article to see if it could be applied to Silverlight.

Once you obtain the token you can keep it to use in future calls. Be aware that tokens may expire based on conditions such as time. Ethically, the automated application should ask users to either allow or deny applications access to twitter.

Prerequisites

JRE or JDK
Download and Install the latest Jython version. Run the .jar and install it in your preferred directory (e.g: /opt/jython).
Download and decompress setuptools-0.6c11.tar.gz
Go to the setuptools directory. Install the package under Jython with: sudo /opt/jython/bin/jython setup.py install
Download and decompress python-twitter-0.8.1.tar.gz
Look at the required dependencies for python-twitter and install them with Jython:
1. http://cheeseshop.python.org/pypi/simplejson
2. http://code.google.com/p/httplib2/
3. http://github.com/simplegeo/python-oauth2
4. You’ll need to change the file oauth2/__init__.py for Jython 2.5 compatibility:

from urlparse import parse_qs, parse_qsl

to:

try:

from urlparse import parse_qsl, parse_qs

except ImportError:

from cgi import parse_qsl, parse_qs

Under the python-twitter-0.8.1 directory download the HtmlUnit compiled binaries from http://sourceforge.net/projects/htmlunit/files/ (we are using HtmlUnit 2.8 for this example).
Go to the python-twitter-0.8.1 directory and Install the python-twitter package under Jython:
1. sudo /opt/jython/bin/jython setup.py install
Create a twitter application for testing and get its key and secret.

Example

get_access_token.py

Changes

Replace consumer_key and consumer_secret with your application key/secret.
Add the following imports and get_pincode function:

import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def get_pincode(url, username, password):
  webclient = WebClient(BrowserVersion.FIREFOX_3_6)
  page = webclient.getPage(url)

  twitter_username_or_email = page.getByXPath("//input[@id='username_or_email']")[0]
  twitter_password = page.getByXPath("//input[@id='password']")[0]
  allow_button = page.getByXPath("//input[@id='allow']")[0]

  twitter_username_or_email.setValueAttribute(username)
  twitter_password.setValueAttribute(password)

  page = allow_button.click()

  code = page.getByXPath("//kbd/code")[0]

  return code.getTextContent()

Replace:

pincode = raw_input('Pincode? ')

with:

  twitter_username = None # replace it with your twitter username
  twitter_password = None # replace it with your twitter password
  print "Geting pincode"
  pincode = get_pincode('%s?oauth_token=%s' % (AUTHORIZATION_URL, request_token['oauth_token']),  twitter_username, twitter_password)
  print "pincode =", pincode

run.sh

#!/bin/sh
/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" get_access_token.py

Complete source code

#!/usr/bin/python2.4
#
# Copyright 2007 The Python-Twitter Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import sys

# parse_qsl moved to urlparse module in v2.6
try:
  from urlparse import parse_qsl
except:
  from cgi import parse_qsl

import oauth2 as oauth

# HTMLUnit related code
import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def get_pincode(url, username, password):
  webclient = WebClient(BrowserVersion.FIREFOX_3_6)
  page = webclient.getPage(url)

  twitter_username_or_email = page.getByXPath("//input[@id='username_or_email']")[0]
  twitter_password = page.getByXPath("//input[@id='password']")[0]
  allow_button = page.getByXPath("//input[@id='allow']")[0]

  twitter_username_or_email.setValueAttribute(username)
  #password.text = password
  #password.setText(password) # HtmlPasswordInput
  twitter_password.setValueAttribute(password)

  page = allow_button.click()

  code = page.getByXPath("//kbd/code")[0]

  return code.getTextContent()

REQUEST_TOKEN_URL = 'https://api.twitter.com/oauth/request_token'
ACCESS_TOKEN_URL  = 'https://api.twitter.com/oauth/access_token'
AUTHORIZATION_URL = 'https://api.twitter.com/oauth/authorize'
SIGNIN_URL        = 'https://api.twitter.com/oauth/authenticate'

consumer_key    = None
consumer_secret = None
twitter_username = None
twitter_password = None

if consumer_key is None or consumer_secret is None:
  print 'You need to edit this script and provide values for the'
  print 'consumer_key and also consumer_secret.'
  print ''
  print 'The values you need come from Twitter - you need to register'
  print 'as a developer your "application".  This is needed only until'
  print 'Twitter finishes the idea they have of a way to allow open-source'
  print 'based libraries to have a token that can be used to generate a'
  print 'one-time use key that will allow the library to make the request'
  print 'on your behalf.'
  print ''
  sys.exit(1)

signature_method_hmac_sha1 = oauth.SignatureMethod_HMAC_SHA1()
oauth_consumer             = oauth.Consumer(key=consumer_key, secret=consumer_secret)
oauth_client               = oauth.Client(oauth_consumer)

print 'Requesting temp token from Twitter'

resp, content = oauth_client.request(REQUEST_TOKEN_URL, 'GET')

if resp['status'] != '200':
  print 'Invalid respond from Twitter requesting temp token: %s' % resp['status']
else:
  request_token = dict(parse_qsl(content))

  print ''
  print 'Please visit this Twitter page and retrieve the pincode to be used'
  print 'in the next step to obtaining an Authentication Token:'
  print ''
  print '%s?oauth_token=%s' % (AUTHORIZATION_URL, request_token['oauth_token'])
  print ''

  print "Geting pincode"
  pincode = get_pincode('%s?oauth_token=%s' % (AUTHORIZATION_URL, request_token['oauth_token']), twitter_username, twitter_password)
  print "pincode =", pincode

#  pincode = raw_input('Pincode? ')

  token = oauth.Token(request_token['oauth_token'], request_token['oauth_token_secret'])
  token.set_verifier(pincode)

  print ''
  print 'Generating and signing request for an access token'
  print ''

  oauth_client  = oauth.Client(oauth_consumer, token)
  resp, content = oauth_client.request(ACCESS_TOKEN_URL, method='POST', body='oauth_verifier=%s' % pincode)
  access_token  = dict(parse_qsl(content))

  if resp['status'] != '200':
    print 'The request for a Token did not succeed: %s' % resp['status']
    print access_token
  else:
    print 'Your Twitter Access Token key: %s' % access_token['oauth_token']
    print '          Access Token secret: %s' % access_token['oauth_token_secret']
    print ''

Conclusion

We have seen how to getOAuth tokens with a headless browser. This approach can be applied to other services such as Facebook and LinkedIn. A partial list of other services you can play with is available at: http://wiki.oauth.net/w/page/12238551/ServiceProviders

Look at our previous article Web Scraping Ajax and Javascript Sites for more information about setting up and usage HtmlUnit and Jython.

Sadly the prerequisites part requires an important extra effort to have it working quickly but once you have setup all the development environment it’s plain sailing.

Resources

Photo taken by mariachily

Google Search NoAPI

History

Way back in 2001 I wanted to be able to query Google automatically. Since Google did not provide an official API, I developed a small simple Google Search “NoAPI” scraper and published it as Googolplex. Google launched a SOAP based API but on December 20, 2006 they stopped accepting signups for the API¹ and suspended it on August 31, 2009². This shows that creating a service or product based on web APIs is a very risky business without an SLA contract. Google soon launched another API called Google Ajax Web Search API³ under a different license. This second API was suspended on November 1, 2010⁴. You may wonder if Google is a bipolar creature. You can see the latest post at Fall Housekeeping.

Google has undergone a lot of changes since 2001 and Googolplex and other libraries like xgoogle are now part of Internet history. A similar new library is available at Mario Vilas Google Search Python blog post as Quickpost: Using Google Search from your Python code.

It’s not clear why Google vacilates over what could be an additional source of revenue, but it is clear that we should expect Google to provide an official and easy to use API. There are ways Google could restrict abuse of their APIs by third parties. It’s very common to offer a free alternative for low volume searches and charge for more intensive uses like Yahoo BOSS does.

In this article we’ll examine one way of crawling information in AJAX/Javascript based sites.

Crawling Google As A Browser

If you go to Google and look at the html source code you’ll be astonished to see pure Javascript obfuscated code. Even after searching the source is not clearer.

So, here is our code to get Google’s results using htmlunit/jython,we don’t have any affiliation with them,jwejust like it!). Look at our Web Scraping Ajax and Javascript Sites for more information.

google.py

import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def query(q):
   webclient = WebClient(BrowserVersion.FIREFOX_3_6)
   url = "http://www.google.com"
   page = webclient.getPage(url)

   query_input = page.getByXPath("//input[@name='q']")[0]
   query_input.text = q
   search_button = page.getByXPath("//input[@name='btnG']")[0]
   page = search_button.click()
   results = page.getByXPath("//ol[@id='rso']/li//span/h3[@class='r']")

   c = 0
   for result in results:
      title = result.asText()
      href = result.getByXPath("./a")[0].getAttributes().getNamedItem("href").nodeValue
      print title, href
      c += 1

   print c,"Results"

if __name__ == '__main__':
   query("google web search api")

run.sh

/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" google.py

Alternatives

The following search engines provide official APIs for search:

Yahoo Search API
Blekko Search API: Ask for a key or use the RSS search feed.
Duck Duck Go API.
Bing Search API
Twitter Search API

Homework

Write a clean function/class to do Google queries and handle exceptions.
Modify the function to handle nested and paged results
Modify the function again, this time to include descriptions.

Final Notes

The approach taken by Mario Vilas is more API like, our approach here is a defensive measure against NoAPIs. This is another good example where HtmlUnit does its job.

BTW the noapi.com domain is available⁵

References

Beyond the SOAP Search API
A well earned retirement for the SOAP Search API
Google AJAX Search API beta Version 1.0 Available
Fall Housekeeping
The noapi.com domain is available at the time of writing of this article. Register it now! (Disclaimer: affiliate link).

Additional Resources

Web Scraping Ajax and Javascript Sites

Introduction

Most crawling frameworks used for scraping cannot be used for Javascript or Ajax. Their scope is limited to those sites that show their main content without using scripting. One would also be tempted to connect a specific crawler to a Javascript engine but it’s not easy to do. You need a fully functional browser with good DOM support because the browser behavior is too complex for a simple connection between a crawler and a Javascript engine to work. There is a list of resources at the end of this article to explore the alternatives in more depth.

There are several ways to scrape a site that contains Javascript:

Embed a web browser within an application and simulate a normal user.
Remotely connect to a web browser and automate it from a scripting language.
Use special purpose add-ons to automate the browser
Use a framework/library to simulate a complete browser.

Each one of these alternatives has its pros and cons. For example using a complete browser consumes a lot of resources, especially if we need to scrape websites with a lot of pages.

In this post we’ll give a simple example of how to scrape a web site that uses Javascript. We will use the htmlunit library to simulate a browser. Since htmlunit runs on a JVM we will use Jython, an [excellent] programming language,which is a Python implementation in the JVM. The resulting code is very clear and focuses on solving the problem instead of on the aspects of programming languages.

Setting up the environment

Prerequisites

JRE or JDK.
Download the latest version of Jython from http://www.jython.org/downloads.html.
Run the .jar file and install it in your preferred directory (e.g: /opt/jython).
Download the htmlunit compiled binaries from: http://sourceforge.net/projects/htmlunit/files/.
Unzip the htmlunit to your preferred directory.

Crawling example

We will scrape the Gartner Magic Quadrant pages at: http://www.gartner.com/it/products/mq/mq_ms.jsp . If you look at the list of documents, the links are Javascript code instead of hyperlinks with http urls. This is may be to reduce crawling, or just to open a popup window. It’s a very convenient page to illustrate the solution.

gartner.py

import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def main():
   webclient = WebClient(BrowserVersion.FIREFOX_3_6) # creating a new webclient object.
   url = "http://www.gartner.com/it/products/mq/mq_ms.jsp"
   page = webclient.getPage(url) # getting the url
   articles = page.getByXPath("//table[@id='mqtable']//tr/td/a") # getting all the hyperlinks

   for article in articles:
      print "Clicking on:", article
      subpage = article.click() # click on the article link
      title = subpage.getByXPath("//div[@class='title']") # get title
      summary = subpage.getByXPath("//div[@class='summary']") # get summary
      if len(title) > 0 and len(summary) > 0:
         print "Title:", title[0].asText()
         print "Summary:", summary[0].asText()
#     break

if __name__ == '__main__':
   main()

run.sh

/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" gartner.py

Final notes

This article is just a starting point to move ahead of simple crawlers and point the way for further research. As this is a simple page, it is a good choice for a clear example of how Javascript scraping works.You must do your homework to learn to crawl more web pages or add multithreading for better performance. In a demanding crawling scenario a lot of things must be taken into account, but this is a subject for future articles.

If you want to be polite don’t forget to read the robots.txt file before crawling…

If you like this article, you might also be interested in

Resources

Photo taken by xiffy

Data Big Bang Blog

Creativity and Problem Solving for Data Science (whatever it may mean…) | An experimental spin-off from Nektra Advanced Computing

Menu

Tag Archives: jython

Automated Browserless OAuth Authentication for Twitter

Introduction

Prerequisites

Example

get_access_token.py

Changes

run.sh

Complete source code

Conclusion

Resources

Google Search NoAPI

History

Crawling Google As A Browser

google.py

run.sh

Alternatives

Homework

Final Notes

See Also

References

Additional Resources

Web Scraping Ajax and Javascript Sites

Introduction

Setting up the environment

Prerequisites

Crawling example

gartner.py

run.sh

Final notes

If you like this article, you might also be interested in

Resources