The Data Portability Fact Sheet

Introduction

Parallego has been announced on TechCrunch after a stealth period as the latest social network that will challenge Facebook and Google Plus. Their investors include big names like Sequoia Capital, Andreessen Horowitz and Union Square Ventures, and they have top angels like Ron Conway. They really love developers, so they offer an API to show their commitment to openness.

Parallego doesn’t really exist, but announcements like this are part of startup breaking news about the web and entrepreneurship. These companies emphasize their love for developers and claim to be open because they provide APIs. The truth is that when you test their APIs you usually find a number of problems:

You can read the information but cannot write or modify it.
You have access to certain information but other information is unavailable.
The rate of API calls is low, so you can only make a few calls and must wait a certain period of time to continue.
You cannot make parallel requests in a multiprocess or multithreaded application.
There is no way to quickly pay for the service and access a better service. Google API Console is a step in that direction but a lot of important Google NoAPIs are unavailable.
Some OAuth2 protocol implementation does not work with the existing development libraries.
The service says it welcomes new applications, but this is not the case for new UIs and mobile clients. See Twitter to Devs: Don’t Make Twitter Clients… Or Else [mashable.com]
You cannot even export your own information. The time you have spent adding content to this service is lost once you leave it.
There is no love for developers: the forums are filled with questions and there are no official answers. See Rate limit with billing enabled [google.com] and Graph API rate limit? [facebook.com]
The company often changes its policies. The web mashup that you did seven months ago that attracted thousands of users is useless because the new API revision does not give you the data that you need for some specific features. See Should facebook pay compensation for deprecated API calls and changes [facebook.com]
Old content is removed without warning.

After a while, you begin to doubt, close your eyes and rethink again about the word “Open”. It seems somewhat meaningless. If you are older you may remember that Microsoft was accused of being closed, but you may also remember that in the worst case you could reverse engineer and access all the internals yourself. You need advanced knowledge of tools like IDA Pro, OllyDbg, and WinDbg of course, but it was possible. You can’t reverse engineer the cloud, however you can scrape the information, but this is time consuming both in terms of development and running time.

And while “Open” is repeated in every announcement from high profile web companies, your brain does not register the word anymore just like you do not see any of the ads on Google because your brain made has made its own AdBlock extension.

Data Portability Classification

For all of the above reasons we think the best initiative towards transparency is adding a fact sheet to every service so we can compare them and know how “open” they really are. WikiMatrix is a good example of how comparisons could be made.

Marco Paol from DBB has been informally collecting information about some web services and has put it in a public spreadsheet on Data Portability Comparison

Please feel free to send us clarifications, suggestions, and fixes.

Resources

Open Data and Linked Data [wikipedia.org]
DataPortability project [wikipedia.org]
Small data [smalldata.org]
The open data manual [opendatamanual.org]
Is It Open Data?
Open Data mailing lists [okfn.org]
Synaptic/Web
Open Knowledge Foundation Blog
The Friend of a Friend (FOAF) project
theinfo.org: Community for Getting, Processing, and Visualizing Large Data Sets
Plagiarism Today
PeopleBrowsr’s case against Twitter heads back to state court after federal court ruling
Archive Team archivists

The parallego.com domain is available at the time of writing of this article. Register it now! (Disclaimer: affiliate link).
Top Trumps Photos taken by noodlepie. Don’t forget to see the Top Trumps Prototypes series.

Language Identification for Text Mining and NLP

The Tower of Babel and ships in a large marine landscape.

Introduction

Language Identification is a key task in the text mining process. Successful analysis of extracted text with natural language processing or machine learning training requires a good language identification algorithm. If it fails to recognize the language, this error will nullify subsequent processes. NLP algorithms must be adjusted for different corpuses and according to the grammar of different languages. Certain NLP software is best suited to certain languages. For example NLTK is the most popular natural language processing package for English under Python, but as FreeLing is best for Spanish. The efficiency of language processing depends on many factors.

A very high level model for text analysis includes the following tasks:

Text Extraction
Text can be extracted by: scraping a web site, importing it in a specific format, getting it from a database, or accessing it via an API.

Text Identification
Text identification is a process which can separate interesting text from other content or format that adds noise to the analysis. For example a blog can include advertising, menus, and other information besides the main content.

NLP
NLP is a set of algorithms to aid in the processing of different languages. See links to NLP software packages and articles here.

Machine Learning
Machine learning is a necessary step for tasks such as collaborative filtering, sentiment analysis and clustering.

Software Alternatives

There is a lot of language identification software available on the web. NLTK uses Crúbadán, while Gate includes TextCat. At Data Big Bang, we like to use Google Language API because it is very accurate even for just one word. It also includes an accuracy measure in the response.

Sadly, Google has deprecated the Google Language API Family and we have added them to our “Google NoAPI” list. They can be used until they are shut down.

Example Including an API Key

Google highly recommends including an API key with the API request. You can get one at http://code.google.com/apis/loader/signup.html or with the new Google API Console https://code.google.com/apis/console/. Use it as follows:

language-identification.py

#!/usr/bin/python

# Language Detection using Google Language API: http://code.google.com/apis/language/translate/v2/getting_started.html
# It can handle unicode texts. You need to add your exception/errors catching.
import sys
import urllib
import urlparse
import simplejson

ENDPOINT = "https://www.googleapis.com/language/translate/v2/detect"
KEY = "" # Insert your key here. Get it from: https://code.google.com/apis/console/

def detect_language(text):
   utf8_encoded_text = text.encode('utf-8')
   query_field = urllib.urlencode({'key':KEY, 'q':utf8_encoded_text})
   parsed_url = urlparse.urlparse(ENDPOINT)
   url = urlparse.urlunparse((parsed_url[0], parsed_url[1], parsed_url[2], parsed_url[3], query_field, parsed_url[5]))

   data = simplejson.loads(urllib.urlopen(url).read())
   response = data['data']['detections'][0][0]

   return response # it answers: {'isReliable': , 'confidence': , 'language': }

if __name__ == '__main__':
   terminal_encoding = sys.stdin.encoding
   text = raw_input("Text? ")
   unicode_text = text.decode(terminal_encoding)
   response = detect_language(unicode_text)

   print response

Google Language API for language identification is very easy to use and was very permissive in terms of usage limitation but now the rate limit status can be found in the console.

Benchmarking

Different language identification algorithms can be easily benchmarked against the Google’s. Testing with single words and small sentences is a good indicator, especially if the algorithms will be used for services like twitter where the sentences are very short.

Resources

Automated Browserless OAuth Authentication for Twitter

Introduction

My first impression after having my first encounter with the OAuth protocol was: bureaucracy meets the web. It’s understandable that in order to authorize third party applications users must approve access to their own information, but if I want to access my personal information under my own application why do I need to complete all this “paperwork”?

Also, user experience suffers when you have to jump to the browser and return to your application as part of the workflow. Mobile and desktop apps need more alternatives to work around that. Twitter offers the xAuth API for desktop and mobile applications but you have to send a request with “plenty of details” and may have to wait a long time to get it.

This article describes how to use the OAuth 3-legged protocol with a headless browser like HtmlUnit to get tokens from twitter without user intervention.

The example uses HtmlUnit and Jython. If you want to use HtmlUnit under .NET I recommend looking at Using HtmlUnit on .NET for Headless Browser Automation (using IKVM). WP7 developers may also want to look at the .NET article to see if it could be applied to Silverlight.

Once you obtain the token you can keep it to use in future calls. Be aware that tokens may expire based on conditions such as time. Ethically, the automated application should ask users to either allow or deny applications access to twitter.

Prerequisites

JRE or JDK
Download and Install the latest Jython version. Run the .jar and install it in your preferred directory (e.g: /opt/jython).
Download and decompress setuptools-0.6c11.tar.gz
Go to the setuptools directory. Install the package under Jython with: sudo /opt/jython/bin/jython setup.py install
Download and decompress python-twitter-0.8.1.tar.gz
Look at the required dependencies for python-twitter and install them with Jython:
1. http://cheeseshop.python.org/pypi/simplejson
2. http://code.google.com/p/httplib2/
3. http://github.com/simplegeo/python-oauth2
4. You’ll need to change the file oauth2/__init__.py for Jython 2.5 compatibility:

from urlparse import parse_qs, parse_qsl

to:

try:

from urlparse import parse_qsl, parse_qs

except ImportError:

from cgi import parse_qsl, parse_qs

Under the python-twitter-0.8.1 directory download the HtmlUnit compiled binaries from http://sourceforge.net/projects/htmlunit/files/ (we are using HtmlUnit 2.8 for this example).
Go to the python-twitter-0.8.1 directory and Install the python-twitter package under Jython:
1. sudo /opt/jython/bin/jython setup.py install
Create a twitter application for testing and get its key and secret.

Example

get_access_token.py

Changes

Replace consumer_key and consumer_secret with your application key/secret.
Add the following imports and get_pincode function:

import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def get_pincode(url, username, password):
  webclient = WebClient(BrowserVersion.FIREFOX_3_6)
  page = webclient.getPage(url)

  twitter_username_or_email = page.getByXPath("//input[@id='username_or_email']")[0]
  twitter_password = page.getByXPath("//input[@id='password']")[0]
  allow_button = page.getByXPath("//input[@id='allow']")[0]

  twitter_username_or_email.setValueAttribute(username)
  twitter_password.setValueAttribute(password)

  page = allow_button.click()

  code = page.getByXPath("//kbd/code")[0]

  return code.getTextContent()

Replace:

pincode = raw_input('Pincode? ')

with:

  twitter_username = None # replace it with your twitter username
  twitter_password = None # replace it with your twitter password
  print "Geting pincode"
  pincode = get_pincode('%s?oauth_token=%s' % (AUTHORIZATION_URL, request_token['oauth_token']),  twitter_username, twitter_password)
  print "pincode =", pincode

run.sh

#!/bin/sh
/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" get_access_token.py

Complete source code

#!/usr/bin/python2.4
#
# Copyright 2007 The Python-Twitter Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import sys

# parse_qsl moved to urlparse module in v2.6
try:
  from urlparse import parse_qsl
except:
  from cgi import parse_qsl

import oauth2 as oauth

# HTMLUnit related code
import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def get_pincode(url, username, password):
  webclient = WebClient(BrowserVersion.FIREFOX_3_6)
  page = webclient.getPage(url)

  twitter_username_or_email = page.getByXPath("//input[@id='username_or_email']")[0]
  twitter_password = page.getByXPath("//input[@id='password']")[0]
  allow_button = page.getByXPath("//input[@id='allow']")[0]

  twitter_username_or_email.setValueAttribute(username)
  #password.text = password
  #password.setText(password) # HtmlPasswordInput
  twitter_password.setValueAttribute(password)

  page = allow_button.click()

  code = page.getByXPath("//kbd/code")[0]

  return code.getTextContent()

REQUEST_TOKEN_URL = 'https://api.twitter.com/oauth/request_token'
ACCESS_TOKEN_URL  = 'https://api.twitter.com/oauth/access_token'
AUTHORIZATION_URL = 'https://api.twitter.com/oauth/authorize'
SIGNIN_URL        = 'https://api.twitter.com/oauth/authenticate'

consumer_key    = None
consumer_secret = None
twitter_username = None
twitter_password = None

if consumer_key is None or consumer_secret is None:
  print 'You need to edit this script and provide values for the'
  print 'consumer_key and also consumer_secret.'
  print ''
  print 'The values you need come from Twitter - you need to register'
  print 'as a developer your "application".  This is needed only until'
  print 'Twitter finishes the idea they have of a way to allow open-source'
  print 'based libraries to have a token that can be used to generate a'
  print 'one-time use key that will allow the library to make the request'
  print 'on your behalf.'
  print ''
  sys.exit(1)

signature_method_hmac_sha1 = oauth.SignatureMethod_HMAC_SHA1()
oauth_consumer             = oauth.Consumer(key=consumer_key, secret=consumer_secret)
oauth_client               = oauth.Client(oauth_consumer)

print 'Requesting temp token from Twitter'

resp, content = oauth_client.request(REQUEST_TOKEN_URL, 'GET')

if resp['status'] != '200':
  print 'Invalid respond from Twitter requesting temp token: %s' % resp['status']
else:
  request_token = dict(parse_qsl(content))

  print ''
  print 'Please visit this Twitter page and retrieve the pincode to be used'
  print 'in the next step to obtaining an Authentication Token:'
  print ''
  print '%s?oauth_token=%s' % (AUTHORIZATION_URL, request_token['oauth_token'])
  print ''

  print "Geting pincode"
  pincode = get_pincode('%s?oauth_token=%s' % (AUTHORIZATION_URL, request_token['oauth_token']), twitter_username, twitter_password)
  print "pincode =", pincode

#  pincode = raw_input('Pincode? ')

  token = oauth.Token(request_token['oauth_token'], request_token['oauth_token_secret'])
  token.set_verifier(pincode)

  print ''
  print 'Generating and signing request for an access token'
  print ''

  oauth_client  = oauth.Client(oauth_consumer, token)
  resp, content = oauth_client.request(ACCESS_TOKEN_URL, method='POST', body='oauth_verifier=%s' % pincode)
  access_token  = dict(parse_qsl(content))

  if resp['status'] != '200':
    print 'The request for a Token did not succeed: %s' % resp['status']
    print access_token
  else:
    print 'Your Twitter Access Token key: %s' % access_token['oauth_token']
    print '          Access Token secret: %s' % access_token['oauth_token_secret']
    print ''

Conclusion

We have seen how to getOAuth tokens with a headless browser. This approach can be applied to other services such as Facebook and LinkedIn. A partial list of other services you can play with is available at: http://wiki.oauth.net/w/page/12238551/ServiceProviders

Look at our previous article Web Scraping Ajax and Javascript Sites for more information about setting up and usage HtmlUnit and Jython.

Sadly the prerequisites part requires an important extra effort to have it working quickly but once you have setup all the development environment it’s plain sailing.

Resources

Photo taken by mariachily

Scraping vs Antiscraping

Introduction

It’s not possible to jump into the subject of scrapers without confronting antiscraping techniques. The reverse is also true: if you want to develop good antiscraping techniques you must think like a scraper developer. Similarly, real hackers needs knowledge of security technologies while a good security system benefits from simulated attacks. This kind of “game dynamics” also applies to security algorithms. For example one of the best known public encryption algorithms, RSA, was invented by Ron Rivest, Adi Shamir and Leonard Adleman. Ron and Adi invented new algorithms and Adelman was in charge of breaking them. They eventually came up with RSA¹.

Antiscraping Measures and How to Pass Them

A preliminary chart:

Antiscraping techniques	Scraping techniques
The site only enables crawling by a known search engine bot.	The scraper can access the search engine cache.
The site doesn’t allow the same IP to access a lot of pages in a short period of time.	Use Tor, a set of proxies, or a crawling service like 80legs.
The site shows a captcha if it’s crawled in a massive way.	Use anti-captcha techniques or services like Mechanical Turk where real people can give the answer. Another alternative is to listen to the captcha and use voice recognition with noise.
The site uses javascript.	Use a javascript enabled crawler.

Many antiscraping measures are annoying for visitors. For example if you’re a “search engine junkie” you’ll find pretty quickly that Google shows you a captcha thinking that you are a bot.

Digression

I believe the web should follow a MVC (Model View Controller) type pattern where you can access the data (the model) independently of how you interact with it. This would enable stronger connections between different sites. Linked Data is one of such initiative, but there are others. Data Portability and APIs are a step towards this pattern, but when you are using APIs from large sites you realize that they’ve put a lot of limits. Starting a whole business based on third party APIs is very risky. You only have to look at the past to see a lot of changes on API features and policies. Facebook, Google and Twitter are good examples. API providers are afraid of losing control of their sites and the profits they generate. We need new business models which can get around this problem and benefit both API providers and consumers. In this sense should be created new business models not only based on advertising. One common approach is to charge for the use of the API. There are other models like that followed by the Guardian, which distribute their ads via their API. APIs carrying advertising is a promising concept. We hope that more creative people will came up with new models for a better MVC web.

References

Leonard Adleman Interview

Google Search NoAPI

History

Way back in 2001 I wanted to be able to query Google automatically. Since Google did not provide an official API, I developed a small simple Google Search “NoAPI” scraper and published it as Googolplex. Google launched a SOAP based API but on December 20, 2006 they stopped accepting signups for the API¹ and suspended it on August 31, 2009². This shows that creating a service or product based on web APIs is a very risky business without an SLA contract. Google soon launched another API called Google Ajax Web Search API³ under a different license. This second API was suspended on November 1, 2010⁴. You may wonder if Google is a bipolar creature. You can see the latest post at Fall Housekeeping.

Google has undergone a lot of changes since 2001 and Googolplex and other libraries like xgoogle are now part of Internet history. A similar new library is available at Mario Vilas Google Search Python blog post as Quickpost: Using Google Search from your Python code.

It’s not clear why Google vacilates over what could be an additional source of revenue, but it is clear that we should expect Google to provide an official and easy to use API. There are ways Google could restrict abuse of their APIs by third parties. It’s very common to offer a free alternative for low volume searches and charge for more intensive uses like Yahoo BOSS does.

In this article we’ll examine one way of crawling information in AJAX/Javascript based sites.

Crawling Google As A Browser

If you go to Google and look at the html source code you’ll be astonished to see pure Javascript obfuscated code. Even after searching the source is not clearer.

So, here is our code to get Google’s results using htmlunit/jython,we don’t have any affiliation with them,jwejust like it!). Look at our Web Scraping Ajax and Javascript Sites for more information.

google.py

import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def query(q):
   webclient = WebClient(BrowserVersion.FIREFOX_3_6)
   url = "http://www.google.com"
   page = webclient.getPage(url)

   query_input = page.getByXPath("//input[@name='q']")[0]
   query_input.text = q
   search_button = page.getByXPath("//input[@name='btnG']")[0]
   page = search_button.click()
   results = page.getByXPath("//ol[@id='rso']/li//span/h3[@class='r']")

   c = 0
   for result in results:
      title = result.asText()
      href = result.getByXPath("./a")[0].getAttributes().getNamedItem("href").nodeValue
      print title, href
      c += 1

   print c,"Results"

if __name__ == '__main__':
   query("google web search api")

run.sh

/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" google.py

Alternatives

The following search engines provide official APIs for search:

Yahoo Search API
Blekko Search API: Ask for a key or use the RSS search feed.
Duck Duck Go API.
Bing Search API
Twitter Search API

Homework

Write a clean function/class to do Google queries and handle exceptions.
Modify the function to handle nested and paged results
Modify the function again, this time to include descriptions.

Final Notes

The approach taken by Mario Vilas is more API like, our approach here is a defensive measure against NoAPIs. This is another good example where HtmlUnit does its job.

BTW the noapi.com domain is available⁵

References

Beyond the SOAP Search API
A well earned retirement for the SOAP Search API
Google AJAX Search API beta Version 1.0 Available
Fall Housekeeping
The noapi.com domain is available at the time of writing of this article. Register it now! (Disclaimer: affiliate link).

Additional Resources

Web Scraping Ajax and Javascript Sites

Introduction

Most crawling frameworks used for scraping cannot be used for Javascript or Ajax. Their scope is limited to those sites that show their main content without using scripting. One would also be tempted to connect a specific crawler to a Javascript engine but it’s not easy to do. You need a fully functional browser with good DOM support because the browser behavior is too complex for a simple connection between a crawler and a Javascript engine to work. There is a list of resources at the end of this article to explore the alternatives in more depth.

There are several ways to scrape a site that contains Javascript:

Embed a web browser within an application and simulate a normal user.
Remotely connect to a web browser and automate it from a scripting language.
Use special purpose add-ons to automate the browser
Use a framework/library to simulate a complete browser.

Each one of these alternatives has its pros and cons. For example using a complete browser consumes a lot of resources, especially if we need to scrape websites with a lot of pages.

In this post we’ll give a simple example of how to scrape a web site that uses Javascript. We will use the htmlunit library to simulate a browser. Since htmlunit runs on a JVM we will use Jython, an [excellent] programming language,which is a Python implementation in the JVM. The resulting code is very clear and focuses on solving the problem instead of on the aspects of programming languages.

Setting up the environment

Prerequisites

JRE or JDK.
Download the latest version of Jython from http://www.jython.org/downloads.html.
Run the .jar file and install it in your preferred directory (e.g: /opt/jython).
Download the htmlunit compiled binaries from: http://sourceforge.net/projects/htmlunit/files/.
Unzip the htmlunit to your preferred directory.

Crawling example

We will scrape the Gartner Magic Quadrant pages at: http://www.gartner.com/it/products/mq/mq_ms.jsp . If you look at the list of documents, the links are Javascript code instead of hyperlinks with http urls. This is may be to reduce crawling, or just to open a popup window. It’s a very convenient page to illustrate the solution.

gartner.py

import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def main():
   webclient = WebClient(BrowserVersion.FIREFOX_3_6) # creating a new webclient object.
   url = "http://www.gartner.com/it/products/mq/mq_ms.jsp"
   page = webclient.getPage(url) # getting the url
   articles = page.getByXPath("//table[@id='mqtable']//tr/td/a") # getting all the hyperlinks

   for article in articles:
      print "Clicking on:", article
      subpage = article.click() # click on the article link
      title = subpage.getByXPath("//div[@class='title']") # get title
      summary = subpage.getByXPath("//div[@class='summary']") # get summary
      if len(title) > 0 and len(summary) > 0:
         print "Title:", title[0].asText()
         print "Summary:", summary[0].asText()
#     break

if __name__ == '__main__':
   main()

run.sh

/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" gartner.py

Final notes

This article is just a starting point to move ahead of simple crawlers and point the way for further research. As this is a simple page, it is a good choice for a clear example of how Javascript scraping works.You must do your homework to learn to crawl more web pages or add multithreading for better performance. In a demanding crawling scenario a lot of things must be taken into account, but this is a subject for future articles.

If you want to be polite don’t forget to read the robots.txt file before crawling…

If you like this article, you might also be interested in

Resources

Photo taken by xiffy

Menu

Introduction

Data Portability Classification

Resources

Introduction

Example Including an API Key

Resources

Introduction

Prerequisites

Example

get_access_token.py

Changes

run.sh

Complete source code

Conclusion

Resources

Introduction

Antiscraping Measures and How to Pass Them

Digression

See Also

References

Further reading

History

Crawling Google As A Browser

google.py

run.sh

Alternatives

Homework

Final Notes

See Also

References

Additional Resources

Introduction

Setting up the environment

Prerequisites

Crawling example

gartner.py

run.sh

Final notes

If you like this article, you might also be interested in

Resources