History
Way back in 2001 I wanted to be able to query Google automatically. Since Google did not provide an official API, I developed a small simple Google Search “NoAPI” scraper and published it as Googolplex. Google launched a SOAP based API but on December 20, 2006 they stopped accepting signups for the API1 and suspended it on August 31, 20092. This shows that creating a service or product based on web APIs is a very risky business without an SLA contract. Google soon launched another API called Google Ajax Web Search API3 under a different license. This second API was suspended on November 1, 20104. You may wonder if Google is a bipolar creature. You can see the latest post at Fall Housekeeping.
Google has undergone a lot of changes since 2001 and Googolplex and other libraries like xgoogle are now part of Internet history. A similar new library is available at Mario Vilas Google Search Python blog post as Quickpost: Using Google Search from your Python code.
It’s not clear why Google vacilates over what could be an additional source of revenue, but it is clear that we should expect Google to provide an official and easy to use API. There are ways Google could restrict abuse of their APIs by third parties. It’s very common to offer a free alternative for low volume searches and charge for more intensive uses like Yahoo BOSS does.
In this article we’ll examine one way of crawling information in AJAX/Javascript based sites.
Crawling Google As A Browser
If you go to Google and look at the html source code you’ll be astonished to see pure Javascript obfuscated code. Even after searching the source is not clearer.
So, here is our code to get Google’s results using htmlunit/jython,we don’t have any affiliation with them,jwejust like it!). Look at our Web Scraping Ajax and Javascript Sites for more information.
google.py
import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion
def query(q):
webclient = WebClient(BrowserVersion.FIREFOX_3_6)
url = "http://www.google.com"
page = webclient.getPage(url)
query_input = page.getByXPath("//input[@name='q']")[0]
query_input.text = q
search_button = page.getByXPath("//input[@name='btnG']")[0]
page = search_button.click()
results = page.getByXPath("//ol[@id='rso']/li//span/h3[@class='r']")
c = 0
for result in results:
title = result.asText()
href = result.getByXPath("./a")[0].getAttributes().getNamedItem("href").nodeValue
print title, href
c += 1
print c,"Results"
if __name__ == '__main__':
query("google web search api")
run.sh
/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" google.py
Alternatives
The following search engines provide official APIs for search:
Homework
- Write a clean function/class to do Google queries and handle exceptions.
- Modify the function to handle nested and paged results
- Modify the function again, this time to include descriptions.
Final Notes
The approach taken by Mario Vilas is more API like, our approach here is a defensive measure against NoAPIs. This is another good example where HtmlUnit does its job.
BTW the noapi.com domain is available5
See Also
- Extraction of Main Text Content Using the Google Reader NoAPI
- The Data Portability Fact Sheet
References
- Beyond the SOAP Search API
- A well earned retirement for the SOAP Search API
- Google AJAX Search API beta Version 1.0 Available
- Fall Housekeeping
- The noapi.com domain is available at the time of writing of this article. Register it now! (Disclaimer: affiliate link).
Additional Resources
- Google Search API?
- Google Deprecates Their SOAP Search API
- Google Search API Dropped
- Is this API going to be closed down?
- Yahoo BOSS Switching To Paid Model In Early 2011
- Thoughts on Yahoo! BOSS Monetization Announcement
- Google to Start Charging for Prediction API
- Update on Whitelisting (Twitter API policies discussion)
- From “Businesses” To “Tools”: The Twitter API ToS Changes