Way back in 2001 I wanted to be able to query Google automatically. Since Google did not provide an official API, I developed a small simple Google Search “NoAPI” scraper and published it as Googolplex. Google launched a SOAP based API but on December 20, 2006 they stopped accepting signups for the API1 and suspended it on August 31, 20092. This shows that creating a service or product based on web APIs is a very risky business without an SLA contract. Google soon launched another API called Google Ajax Web Search API3 under a different license. This second API was suspended on November 1, 20104. You may wonder if Google is a bipolar creature. You can see the latest post at Fall Housekeeping.
Google has undergone a lot of changes since 2001 and Googolplex and other libraries like xgoogle are now part of Internet history. A similar new library is available at Mario Vilas Google Search Python blog post as Quickpost: Using Google Search from your Python code.
It’s not clear why Google vacilates over what could be an additional source of revenue, but it is clear that we should expect Google to provide an official and easy to use API. There are ways Google could restrict abuse of their APIs by third parties. It’s very common to offer a free alternative for low volume searches and charge for more intensive uses like Yahoo BOSS does.
Crawling Google As A Browser
import com.gargoylesoftware.htmlunit.WebClient as WebClient import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion def query(q): webclient = WebClient(BrowserVersion.FIREFOX_3_6) url = "http://www.google.com" page = webclient.getPage(url) query_input = page.getByXPath("//input[@name='q']") query_input.text = q search_button = page.getByXPath("//input[@name='btnG']") page = search_button.click() results = page.getByXPath("//ol[@id='rso']/li//span/h3[@class='r']") c = 0 for result in results: title = result.asText() href = result.getByXPath("./a").getAttributes().getNamedItem("href").nodeValue print title, href c += 1 print c,"Results" if __name__ == '__main__': query("google web search api")
/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" google.py
The following search engines provide official APIs for search:
- Yahoo Search API
- Blekko Search API: Ask for a key or use the RSS search feed.
- Duck Duck Go API.
- Bing Search API
- Twitter Search API
- Write a clean function/class to do Google queries and handle exceptions.
- Modify the function to handle nested and paged results
- Modify the function again, this time to include descriptions.
The approach taken by Mario Vilas is more API like, our approach here is a defensive measure against NoAPIs. This is another good example where HtmlUnit does its job.
BTW the noapi.com domain is available5
- Beyond the SOAP Search API
- A well earned retirement for the SOAP Search API
- Google AJAX Search API beta Version 1.0 Available
- Fall Housekeeping
- The noapi.com domain is available at the time of writing of this article. Register it now! (Disclaimer: affiliate link).
- Google Search API?
- Google Deprecates Their SOAP Search API
- Google Search API Dropped
- Is this API going to be closed down?
- Yahoo BOSS Switching To Paid Model In Early 2011
- Thoughts on Yahoo! BOSS Monetization Announcement
- Google to Start Charging for Prediction API
- Update on Whitelisting (Twitter API policies discussion)
- From “Businesses” To “Tools”: The Twitter API ToS Changes
Pingback: Tweets that mention Google Search NoAPI « Data Big Bang Blog -- Topsy.com()
Pingback: Scraping vs Antiscraping « Data Big Bang Blog()
Pingback: The Data Portability Fact Sheet « Data Big Bang Blog()
Pingback: Extraction of Main Text Content « Data Big Bang Blog()