History
Way back in 2001 I wanted to be able to query Google automatically. Since Google did not provide an official API, I developed a small simple Google Search “NoAPI” scraper and published it as Googolplex. Google launched a SOAP based API but on December 20, 2006 they stopped accepting signups for the API1 and suspended it on August 31, 20092. This shows that creating a service or product based on web APIs is a very risky business without an SLA contract. Google soon launched another API called Google Ajax Web Search API3 under a different license. This second API was suspended on November 1, 20104. You may wonder if Google is a bipolar creature. You can see the latest post at Fall Housekeeping.
Google has undergone a lot of changes since 2001 and Googolplex and other libraries like xgoogle are now part of Internet history. A similar new library is available at Mario Vilas Google Search Python blog post as Quickpost: Using Google Search from your Python code.
It’s not clear why Google vacilates over what could be an additional source of revenue, but it is clear that we should expect Google to provide an official and easy to use API. There are ways Google could restrict abuse of their APIs by third parties. It’s very common to offer a free alternative for low volume searches and charge for more intensive uses like Yahoo BOSS does.
In this article we’ll examine one way of crawling information in AJAX/Javascript based sites.
Crawling Google As A Browser
If you go to Google and look at the html source code you’ll be astonished to see pure Javascript obfuscated code. Even after searching the source is not clearer.
So, here is our code to get Google’s results using htmlunit/jython,we don’t have any affiliation with them,jwejust like it!). Look at our Web Scraping Ajax and Javascript Sites for more information.
google.py
import com.gargoylesoftware.htmlunit.WebClient as WebClient import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion def query(q): webclient = WebClient(BrowserVersion.FIREFOX_3_6) url = "http://www.google.com" page = webclient.getPage(url) query_input = page.getByXPath("//input[@name='q']")[0] query_input.text = q search_button = page.getByXPath("//input[@name='btnG']")[0] page = search_button.click() results = page.getByXPath("//ol[@id='rso']/li//span/h3[@class='r']") c = 0 for result in results: title = result.asText() href = result.getByXPath("./a")[0].getAttributes().getNamedItem("href").nodeValue print title, href c += 1 print c,"Results" if __name__ == '__main__': query("google web search api")
run.sh
/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" google.py
Alternatives
The following search engines provide official APIs for search:
- Yahoo Search API
- Blekko Search API: Ask for a key or use the RSS search feed.
- Duck Duck Go API.
- Bing Search API
- Twitter Search API
Homework
- Write a clean function/class to do Google queries and handle exceptions.
- Modify the function to handle nested and paged results
- Modify the function again, this time to include descriptions.
Final Notes
The approach taken by Mario Vilas is more API like, our approach here is a defensive measure against NoAPIs. This is another good example where HtmlUnit does its job.
BTW the noapi.com domain is available5
See Also
References
- Beyond the SOAP Search API
- A well earned retirement for the SOAP Search API
- Google AJAX Search API beta Version 1.0 Available
- Fall Housekeeping
- The noapi.com domain is available at the time of writing of this article. Register it now! (Disclaimer: affiliate link).
Additional Resources
- Google Search API?
- Google Deprecates Their SOAP Search API
- Google Search API Dropped
- Is this API going to be closed down?
- Yahoo BOSS Switching To Paid Model In Early 2011
- Thoughts on Yahoo! BOSS Monetization Announcement
- Google to Start Charging for Prediction API
- Update on Whitelisting (Twitter API policies discussion)
- From “Businesses” To “Tools”: The Twitter API ToS Changes
Great example! Checkout this quora question with other alternatives ;) http://www.quora.com/What-are-the-alternatives-to-custom-search-besides-Google-CSE-Y!-BOSS-and-IndexTank
Thanks Mauro. That Quora question is specifically oriented to crawling your own site, but the problem is that the main search engine (Google) doesn’t have now an official API!
how it can deal with concurrent query request ?
Do you mean doing multiple searches in different threads/processes/machines ?
Google replaced the AJAX API with the CSE API, which gives JSON results and provides an API that can be called from a server: http://code.google.com/apis/customsearch/v1/overview.html
I learnt from the feedback that you can search the indexed Internet adding sites and then removing them. This is well explained on Alternative to the deprecated google REST web search API.
But the point of the article is how to retrieve information from sites with changing policies, restricted or nonapis. On the critical side many companies are talking about the open web but open is a very devalued word.
Also it’s important to note the restrictions that continue to exists using Google CSE, like rate limit (even in commercial agreements), results differences with the Google “main” search engine, etc.
You can look at: http://www.google.com/support/forum/p/customsearch/thread?tid=3534cbeb56c4a26c&hl=en