January 3, 2012

Helping Search Engines to Find Content in the Invisible Web

Discovering Hidden Web Resources

Search engines and social networks are digital telescopes. It is extremely difficult and time consuming to find web resources outside of their lens. It’s a search craft. Our intuition knows that there are interesting invisible information but we can’t touch it.

IMDB contains a lot of information about users but the site only offers sharing as a collateral feature. If we search on Google we can’t find all the users sharing their movie rankings. At the time of writing of this article the query: site:imdb.com inurl:”user/*/ratings” was returning a few results on Google. How we can help people, through search engines, to find more web resources? This article shows the first 10 million results of the Distributed Scraping With Multiple Tor Circuits process. In a short time Google will index this article and include these new resources so everyone can find them.

In the meantime you have the great honor to see web resources that are invisible for search engines. These page contains the first 10 million of IMDB users sharing their movie’s ratings. We have included a script below to get their ratings taking advantage of the comma separated value export offered by IMDB.

Python Code for Exporting IMDB Ratings in Comma Separated Values

get-user-ratings.py

#!/usr/bin/python2.7

import pymongo
import urllib2

MONGODB_HOSTNAME = 'localhost'

HTML = """
<html>
<body>
{0}
</body>
</html>
"""

EXPORT_URL = "http://www.imdb.com/list/export?list_id=ratings&author_id={0}"

def main():
   conn = pymongo.Connection(MONGODB_HOSTNAME, 27017)
   db = conn.scraping
   coll = db.imdb.ratings

   items = coll.find({'last_response':200})

   links = ""

   i = 0
   for item in items:
      url = item['url']
      index = 'ur{0:07}'.format(item['index'])
      filename = 'ur{0}.csv'.format(item['index'])
      links += "<a href='{0}'>{1}</a><br>".format(url, index)

     with open(filename, "wt") as h:
        h.write(urllib2.urlopen(EXPORT_URL.format(index)).read())

   print HTML.format(links)

if __name__ == '__main__':
   main()

Resources

  1. Discovering URLs through User Feedback
  2. Invisible Web
  3. Deep Web Research 2012

Photo taken by gari.baldi

December 22, 2011

Running Your Own Anonymous Rotating Proxies

Rotating Proxies with HAProxy

Most web browsers and scrapers can only be configured to use one proxy per protocol. You can get around this limitation by running different instances of browsers and scrapers. Google Chrome and Firefox allow multiple profiles. However, running hundreds of browser instances is unwieldy.

A better option is to set up your own proxy to rotate among a set of Tor proxies.The Tor application implements a SOCKS proxy. Start multiple Tor instances on one or more machines and networks, then configure and run an HTTP load balancer to expose a single point of connection instead of adding the rotating logic within the client application. On the Distributed Scraping With Multiple Tor Circuits article we learned how to set up multiple Tor SOCKS proxies for web scraping and crawling. However our sample code launched multiple threads each of which uses a different proxy. In this example we use the HAProxy load balancer with a round-robin strategy to rotate our proxies.

When you are dealing with web crawling and scraping sites with Javascript, using a real browser with a high performance Javascript engine like V8 may be the best approach. Just configuring our rotating proxy in the browser does the trick. Another option is using HTMLUnit but the the V8 Javascript Engine parses web pages and runs Javascript more quickly. If you are using a browser you must be particularly careful to keep the scraped site from correlating your multiple requests. Try disabling cookies, local storage, and image loading, and only enabling Javascript, indeed, you need to cache as many requests as possible. If you need to support cookies, you have to run different browsers with different profiles.

Setup and Configuration

Prerequisites

  1. Tor
  2. DeleGate
  3. HAProxy

HAProxy Configuration File

rotating-tor-proxies.cfg

global
        daemon
        maxconn 256

defaults
        mode http
        timeout connect 5000ms
        timeout client 50000ms
        timeout server 50000ms

frontend rotatingproxies
        bind *:3128
        default_backend tors
        option http_proxy

backend tors
        option http_proxy
        server tor1 localhost:3129
        server tor1 localhost:3130
        server tor1 localhost:3131
        server tor1 localhost:3132
        server tor1 localhost:3133
        server tor1 localhost:3134
        server tor1 localhost:3135
        server tor1 localhost:3136
        server tor1 localhost:3137
        server tor1 localhost:3138
        balance roundrobin

Running

Run the following script, which launches many instances of Tor. Then runs one instance of delegated per Tor, and finally runs HAProxy to rotate the proxy servers. We have to use DeleGate because HAProxy does not support SOCKS.

#!/bin/bash
base_socks_port=9050
base_http_port=3129 # leave 3128 for HAProxy
base_control_port=8118

# Create data directory if it doesn't exist
if [ ! -d "data" ]; then
	mkdir "data"
fi

#for i in {0..10}
for i in {0..9}

do
	j=$((i+1))
	socks_port=$((base_socks_port+i))
	control_port=$((base_control_port+i))
	http_port=$((base_http_port+i))
	if [ ! -d "data/tor$i" ]; then
		echo "Creating directory data/tor$i"
		mkdir "data/tor$i"
	fi
	# Take into account that authentication for the control port is disabled. Must be used in secure and controlled environments

	echo "Running: tor --RunAsDaemon 1 --CookieAuthentication 0 --HashedControlPassword \"\" --ControlPort $control_port --PidFile tor$i.pid --SocksPort $socks_port --DataDirectory data/tor$i"

	tor --RunAsDaemon 1 --CookieAuthentication 0 --HashedControlPassword "" --ControlPort $control_port --PidFile tor$i.pid --SocksPort $socks_port --DataDirectory data/tor$i

	echo 	"Running: ./delegate/src/delegated -P$http_port SERVER=http SOCKS=localhost:$socks_port"

	./delegate/src/delegated -P$http_port SERVER=http SOCKS=localhost:$socks_port
done

haproxy -f rotating-tor-proxies.cfg

See Also

  1. Distributed Scraping With Multiple Tor Circuits
  2. Web Scraping Ajax and Javascript Sites

Resources

  1. HAProxy The Reliable, High Performance TCP/HTTP Load Balancer
  2. DeleGate Multi-Purpose Application Level Gateway
  3. Python twisted proxyclient cascade / upstream to squid
  4. How SOPA’s ‘circumvention’ ban could put a target on Tor
September 3, 2011

HTML Cleaners and Tidiers

Tag Soup

When you are crawling a website you will come across a lot of malformed web pages. Some typical problems are unclosed tags, mishandling of comments or of css styles. Modern browsers have to do a good job of cleaning HTML to build the correct DOM without ambiguities. Due to performance and scalability limitations, it is more efficient to process HTML with a parser instead of using a browser or headless browsers such as HTMLUnit or PhantomJS. If your HTML parser does not incorporate the cleaning or fixing process, you will have to use an HTML cleaner or tidier.

As in other processing pipelines if you fail to clean up malformed HTML, all subsequent processes will be stalled. It is important to choose a good HTML cleaner. Many cleaners fail to do their jobs.

HTML Cleaner List

The list of HTML cleaners is long, but the list of good ones is pretty short. In our experience the best choice is lxml.html. Other cleaners often have trouble.

Comprehensive Resources

  1. lxml.html
  2. Beautiful Soup
  3. lxml.html vs Beautiful Soup
  4. Cleaning Word’s Nasty HTML
  5. HTML Cleaners query
  6. Tag soup
August 25, 2011

Extraction of Main Text Content Using the Google Reader NoAPI

Theo van Doesburg Dadamatinée

Introduction

In this article we will see how to extract the main text content from a blog using the Google Reader NoAPI.

Extracting the main text content from a web page is an important step in the text processing pipeline. The source code of pages in HTML is usually cluttered with advertising and other text which is not related to the main content. Formally, in the context of computer science, it is impossible for a computer to distinguish between the main content and other content on the same page. That is, no algorithm can recognize it for all possible cases. Sometimes it is even difficult for humans to distinguish it. Recognition of primary content is part of the machine learning/artificial intelligence field of study.

In practice there are many ways to recognize main content. If, for example, a blog platform includes attributes which indicate where the main content is, the process will be straightforward. Similarly, If the pages on a particular site have a well defined structure, we can also infer where the main content is by sampling a few pages. In this approach, we train the recognizer to apply patterns to additional pages. Of course purely manual work is another option. The quickest way to build an army of human recognizers is to put the job on sites like Amazon’s Mechanical Turk or similar services such as Microworkers.

For a good compilation of resources related to this subject you can see:

Extracting the Main Content from a Blog

If the blog platform includes information about the main text content on their tags, making an XPath expression for each one will do the trick. Now imagine that you want to do it automatically, without depending on each blog platform or blog theme. In this case you can read the RSS feed, which generally only includes main text, and extract the text from there. However, not all blogs post the complete text in the feed. The TechCrunch feed, for example, shows the first part of the text, but you have to click to continue reading. In this case you can use the partial text from the feed to recognize the complete text in the HTML. A potential problem with reading RSS feeds is that they only contain the most recent articles. To get around this limitation, we can get a longer feed history from Google Reader. Google Reader has some gaps and misses some articles, but this issue is beyond the scope of this article.

Getting Blog Text from Google Reader

Since Google Reader does not have a real API we will rely on the Google Reader API lib by Mauro Asprea from Wish and BAM!. He is an active reader of this blog and a friend.

We will retrieve posts by Fred Wilson, one of the most prolific VC bloggers, since he has blogged since 9/23/2003 on an almost daily basis, and includes the whole post within the feed.

Python code

#!/usr/bin/python
# *-* coding: utf-8 *-*

import sys
import time
from GoogleReader import  CONST
from GoogleReader.reader import GoogleReader
import lxml.html

USERNAME = '' # Replace with your Google Reader username
PASSWORD = '' # Replace with your Google Reader password. Not included in this post :-)

gr = GoogleReader()
login_info = (USERNAME, PASSWORD)
gr.identify(*login_info)
gr.login()

gr.add_subscription(url="http://feeds.feedburner.com/avc")
xmlfeed = gr.get_feed(url="http://feeds.feedburner.com/avc")

COUNT = 1000
i=0

print >>sys.stderr, "page:", i
for entry in xmlfeed.get_entries():
   print entry['title'].encode('utf-8'), time.ctime(entry['published'])
   doc = lxml.html.fromstring(entry['content']) # Thanks lxml.html for handling incomplete HTML documents!
   print doc.text_content().encode('utf-8')
   print "******************************************************************************************************"

continuation = xmlfeed.get_continuation()

i+=1
while continuation != None and i < COUNT:
   print >>sys.stderr, "page:", i
   xmlfeed = gr.get_feed(url="http://feeds.feedburner.com/avc", continuation = continuation)

   for entry in xmlfeed.get_entries():
      print entry['title'].encode('utf-8'), time.ctime(entry['published'])
      try:
         doc = lxml.html.fromstring(entry['content']) # Thanks lxml.html for handling incomplete HTML documents!
         print doc.text_content().encode('utf-8')
      except:
         print "------------------ ERROR -------------------"
         print entry['content']

      print "******************************************************************************************************"

   continuation = xmlfeed.get_continuation()
   i+=1

Notes

If you try this script you will realize that the oldest post retrieved is from 9/29/2005. The real first post however was on 9/23/2003. Why don’t we see it? I believe it is because Google Reader uses feed information from FeedBurner, which was launched in 2004 and acquired by Google in 2007, so they probably started recording feed entries then. Incidentally Union Square Ventures was one of the original FeedBurner investors.

There is an easier way to retrieve text in the specific case of Fred Wilson’s blog and other HTML5 modern sites. HTML5 provides an <article> tag, so you can just crawl the whole site and retrieve the content within the <article> tag. You’ll need an extra step to deduplicate the content since many of the crawled pages will appear more than once. For example if you follow categories like MBA Mondays you will find articles that also appear when you follow another path.

Lessons Learned

  • We can use Google Reader to easily extract text content from blogs.
  • Google Reader has its limitations: it doesn’t cover posts before a certain data and sometimes skips posts.
  • HTML5 adds a valuable new tag for differentiating article text from the rest of the content.

See Also

  1. Voice Recognition + Content Extraction + TTS = Innovative Web Browsing
  2. Google Search NoAPI

Additional Resources

  1. boilerpipe: Boilerplate Removal and Fulltext Extraction from HTML pages
  2. Readability API
  3. HTML Content Extraction Questions on StackOverflow
  4. Google Reader Development Questions on StackOverflow
February 1, 2011

Scraping vs Antiscraping

Introduction

It’s not possible to jump into the subject of scrapers without confronting antiscraping techniques.  The reverse is also true: if you want to develop good antiscraping techniques you must think like a scraper developer. Similarly, real hackers needs knowledge of security technologies while a good security system benefits from simulated attacks. This kind of “game dynamics” also applies to security algorithms. For example one of the best known public encryption algorithms, RSA, was invented by Ron Rivest, Adi Shamir and Leonard Adleman. Ron and Adi invented new algorithms and Adelman was in charge of breaking them. They eventually came up with RSA1.

Antiscraping Measures and How to Pass Them

A preliminary chart:

Antiscraping techniques Scraping techniques
The site only enables crawling by a known search engine bot. The scraper can access the search engine cache.
The site doesn’t allow the same IP to access a lot of pages in a short period of time. Use Tor, a set of proxies, or a crawling service like 80legs.
The site shows a captcha if it’s crawled in a massive way. Use anti-captcha techniques or services like Mechanical Turk where real people can give the answer. Another alternative is to listen to the captcha and use voice recognition with noise.
The site uses javascript. Use a javascript enabled crawler.

Many antiscraping measures are annoying for visitors. For example if you’re a “search engine junkie” you’ll find pretty quickly that Google shows you a captcha thinking that you are a bot.

Digression

I believe the web should follow a MVC (Model View Controller) type pattern where you can access the data (the model) independently of how you interact with it. This would enable stronger connections between different sites. Linked Data is one of such initiative, but there are others. Data Portability and APIs are a step towards this pattern, but when you are using APIs from large sites you realize that they’ve put a lot of limits. Starting a whole business based on third party APIs is very risky. You only have to look at the past to see a lot of changes on API features and policies. Facebook, Google and Twitter are good examples. API providers are afraid of losing control of their sites and the profits they generate. We need new business models which can get around this problem and benefit both API providers and consumers. In this sense should be created new business models not only based on advertising. One common approach is to charge for the use of the API. There are other models like that followed by the Guardian, which distribute their ads via their API. APIs carrying advertising is a promising concept. We hope that more creative people will came up with new models for a better MVC web.

See Also

  1. Running Your Own Anonymous Rotating Proxies
  2. Distributed Scraping With Multiple Tor Circuits

References

  1. Leonard Adleman Interview

Further reading

  1. Captcha Recognition
  2. OCR Research Team
  3. Data Scraping with YQL and jQuery
  4. API Conference
  5. Google Calls Out Facebook’s Data Hypocrisy, Blocks Gmail Import
  6. Google Search NoAPI
  7. Kayak Search API is no longer supported
  8. The Guardian Open Platform
  9. Twitter Slashes API Rate Limits In Half Across The Board To Deal With Capacity Issues
  10. Facebook, you are doing it wrong
  11. Cubeduel Goes Viral Too Quickly, Stumbles Over LinkedIn API Limits
  12. Keyword Exchange Market
  13. A Union for Mechanical Turk Workers?
  14. The Long Tail Of Business Models
  15. Scraping, cleaning, and selling big data
  16. Detecting ‘stealth’ web-crawlers

Photo: Glykais Gyula fencing against Oreste Puliti. [Source]

January 20, 2011

Google Search NoAPI

History

Way back in 2001 I wanted to be able to query Google automatically. Since Google did not provide an official API,  I developed a small simple Google Search “NoAPI” scraper  and published it as Googolplex. Google launched a SOAP based API but on December 20, 2006 they stopped accepting signups for the API1 and suspended it on August 31, 20092.  This shows that creating a service or product based on web APIs is a very risky business without an SLA contract. Google soon launched another API called Google Ajax Web Search API3 under a different license. This second API was suspended on November 1, 20104. You may wonder if Google is a bipolar creature. You can see the latest post at Fall Housekeeping.

Google has undergone a lot of changes since 2001 and Googolplex and other  libraries like xgoogle are now part of Internet history. A similar new library  is available at Mario Vilas Google Search Python blog post as Quickpost: Using Google Search from your Python code.

It’s not clear why Google vacilates over what could be an additional source of revenue, but it is clear that we should expect Google to provide an official and easy to use API. There are ways Google could restrict abuse of their APIs by third parties. It’s very common to offer a free alternative for low volume searches and charge for more intensive uses like Yahoo BOSS does.

In this article we’ll examine one way of crawling information in AJAX/Javascript based sites.

Crawling Google As A Browser

If you go to Google and look at the html source code you’ll be astonished to see pure Javascript obfuscated code. Even after searching the source is not clearer.

So, here is our code to get Google’s results using htmlunit/jython,we don’t have any affiliation with them,jwejust like it!). Look at our Web Scraping Ajax and Javascript Sites for more information.

google.py

import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def query(q):
   webclient = WebClient(BrowserVersion.FIREFOX_3_6)
   url = "http://www.google.com"
   page = webclient.getPage(url)

   query_input = page.getByXPath("//input[@name='q']")[0]
   query_input.text = q
   search_button = page.getByXPath("//input[@name='btnG']")[0]
   page = search_button.click()
   results = page.getByXPath("//ol[@id='rso']/li//span/h3[@class='r']")

   c = 0
   for result in results:
      title = result.asText()
      href = result.getByXPath("./a")[0].getAttributes().getNamedItem("href").nodeValue
      print title, href
      c += 1

   print c,"Results"

if __name__ == '__main__':
   query("google web search api")

run.sh

/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" google.py

Alternatives

The following search engines provide official APIs for search:

Homework

  1. Write a clean function/class to do Google queries and handle exceptions.
  2. Modify the function to handle nested and paged results
  3. Modify the function again, this time to include descriptions.

Final Notes

The approach taken by Mario Vilas is more API like, our approach here is a defensive measure against NoAPIs. This is another good example where HtmlUnit does its job.

BTW the noapi.com domain is available5

See Also

  1. Extraction of Main Text Content Using the Google Reader NoAPI
  2. The Data Portability Fact Sheet

References

  1. Beyond the SOAP Search API
  2. A well earned retirement for the SOAP Search API
  3. Google AJAX Search API beta Version 1.0 Available
  4. Fall Housekeeping
  5. The noapi.com domain is available at the time of writing of this article. Register it now! (Disclaimer: affiliate link).

Additional Resources

  1. Google Search API?
  2. Google Deprecates Their SOAP Search API
  3. Google Search API Dropped
  4. Is this API going to be closed down?
  5. Yahoo BOSS Switching To Paid Model In Early 2011
  6. Thoughts on Yahoo! BOSS Monetization Announcement
  7. Google to Start Charging for Prediction API
  8. Update on Whitelisting (Twitter API policies discussion)
  9. From “Businesses” To “Tools”: The Twitter API ToS Changes
January 11, 2011

Web Scraping Ajax and Javascript Sites

Introduction

Most crawling frameworks used for scraping cannot be used for Javascript or Ajax. Their scope is limited to those sites that show their main content without using scripting. One would also be tempted to connect a specific crawler to a Javascript engine but it’s not easy to do. You need a fully functional browser with good DOM support because the browser behavior is too complex for a simple connection between a crawler and a Javascript engine to work. There is a list of resources at the end of this article to explore the alternatives in more depth.

There are several ways to scrape a site that contains Javascript:

  1. Embed a web browser within an application and simulate a normal user.
  2. Remotely connect to a web browser and automate it from a scripting language.
  3. Use special purpose add-ons to automate the browser
  4. Use a framework/library to simulate a complete browser.

Each one of these alternatives has its pros and cons. For  example using a complete browser consumes a lot of resources, especially if we need to scrape websites with a lot of pages.

In this post we’ll give a simple example of how to scrape a web site that uses Javascript. We will use the htmlunit library to simulate a browser. Since htmlunit runs on a JVM we will use Jython, an [excellent] programming language,which is a Python implementation in the JVM. The resulting code is very clear and focuses on solving the problem instead of on the aspects of programming languages.

Setting up the environment

Prerequisites

  1. JRE or JDK.
  2. Download the latest version of Jython from http://www.jython.org/downloads.html.
  3. Run the .jar file and install it in your preferred directory (e.g: /opt/jython).
  4. Download the htmlunit compiled binaries from: http://sourceforge.net/projects/htmlunit/files/.
  5. Unzip the htmlunit to your preferred directory.

Crawling example

We will scrape the Gartner Magic Quadrant pages at: http://www.gartner.com/it/products/mq/mq_ms.jsp . If you look at the list of documents, the links are Javascript code instead of hyperlinks with http urls. This is may be to reduce crawling, or just to open a popup window. It’s a very convenient page to illustrate the solution.

gartner.py

import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def main():
   webclient = WebClient(BrowserVersion.FIREFOX_3_6) # creating a new webclient object.
   url = "http://www.gartner.com/it/products/mq/mq_ms.jsp"
   page = webclient.getPage(url) # getting the url
   articles = page.getByXPath("//table[@id='mqtable']//tr/td/a") # getting all the hyperlinks

   for article in articles:
      print "Clicking on:", article
      subpage = article.click() # click on the article link
      title = subpage.getByXPath("//div[@class='title']") # get title
      summary = subpage.getByXPath("//div[@class='summary']") # get summary
      if len(title) > 0 and len(summary) > 0:
         print "Title:", title[0].asText()
         print "Summary:", summary[0].asText()
#     break

if __name__ == '__main__':
   main()

run.sh

/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" gartner.py

Final notes

This article is just a starting point to move ahead of simple crawlers and point the way for further research. As this is a simple page, it is a good choice for a clear example of how Javascript scraping works.You must do your homework to learn to crawl more web pages or add multithreading for better performance. In a demanding crawling scenario a lot of things must be taken into account, but this is a subject for future articles.

If you want to be polite don’t forget to read the robots.txt file before crawling…

See Also

  1. Distributed Scraping With Multiple Tor Circuits
  2. Running Your Own Anonymous Rotating Proxies
  3. Automated Browserless OAuth Authentication for Twitter

Resources

  1. HtmlUnit
  2. Crowbar web scraping environment
  3. Google Chrome remote debugging shell from Python
  4. Selenium web application testing systemWatirSahiWindmill Testing Framework
  5. Internet Explorer automation
  6. jSSh Javascript Shell Server for Mozilla
  7. http://trac.webkit.org/wiki/QtWebKit
  8. Embedding Gecko
  9. Opera Dragonfly
  10. PyAuto: Python Interface to Chromum’s automation framework
  11. Related questions on Stack Overflow
  12. Scrapy
  13. EnvJS: Simulated browser environment written in Javascript
  14. Setting up Headless XServer and CutyCapt on Ubuntu
  15. CutyCapt: Capture WebKit’s rendering of a web page.
  16. Google webmaste blog: A spider’s view of Web 2.0
  17. OpenQA
  18. Python Webkit DOM Bindings
  19. Berkelium Browser
  20. uBrowser
  21. Using HtmlUnit on .NET for Headless Browser Automation (using IKVM)
  22. Zombie.js
  23. PhantomJS
  24. PyPhantomJS
  25. CasperJS
  26. Web Inspector Remote
  27. Offscreen/Headless Mozilla Firefox (via @brutuscat)
  28. Web Scraping with Google Spreadsheets and XPath
  29. Web Scraping with YQL and Yahoo Pipes

Photo taken by xiffy