Google Search NoAPI

History

Way back in 2001 I wanted to be able to query Google automatically. Since Google did not provide an official API, I developed a small simple Google Search “NoAPI” scraper and published it as Googolplex. Google launched a SOAP based API but on December 20, 2006 they stopped accepting signups for the API¹ and suspended it on August 31, 2009². This shows that creating a service or product based on web APIs is a very risky business without an SLA contract. Google soon launched another API called Google Ajax Web Search API³ under a different license. This second API was suspended on November 1, 2010⁴. You may wonder if Google is a bipolar creature. You can see the latest post at Fall Housekeeping.

Google has undergone a lot of changes since 2001 and Googolplex and other libraries like xgoogle are now part of Internet history. A similar new library is available at Mario Vilas Google Search Python blog post as Quickpost: Using Google Search from your Python code.

It’s not clear why Google vacilates over what could be an additional source of revenue, but it is clear that we should expect Google to provide an official and easy to use API. There are ways Google could restrict abuse of their APIs by third parties. It’s very common to offer a free alternative for low volume searches and charge for more intensive uses like Yahoo BOSS does.

In this article we’ll examine one way of crawling information in AJAX/Javascript based sites.

Crawling Google As A Browser

If you go to Google and look at the html source code you’ll be astonished to see pure Javascript obfuscated code. Even after searching the source is not clearer.

So, here is our code to get Google’s results using htmlunit/jython,we don’t have any affiliation with them,jwejust like it!). Look at our Web Scraping Ajax and Javascript Sites for more information.

google.py

import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def query(q):
   webclient = WebClient(BrowserVersion.FIREFOX_3_6)
   url = "http://www.google.com"
   page = webclient.getPage(url)

   query_input = page.getByXPath("//input[@name='q']")[0]
   query_input.text = q
   search_button = page.getByXPath("//input[@name='btnG']")[0]
   page = search_button.click()
   results = page.getByXPath("//ol[@id='rso']/li//span/h3[@class='r']")

   c = 0
   for result in results:
      title = result.asText()
      href = result.getByXPath("./a")[0].getAttributes().getNamedItem("href").nodeValue
      print title, href
      c += 1

   print c,"Results"

if __name__ == '__main__':
   query("google web search api")

run.sh

/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" google.py

Alternatives

The following search engines provide official APIs for search:

Yahoo Search API
Blekko Search API: Ask for a key or use the RSS search feed.
Duck Duck Go API.
Bing Search API
Twitter Search API

Homework

Write a clean function/class to do Google queries and handle exceptions.
Modify the function to handle nested and paged results
Modify the function again, this time to include descriptions.

Final Notes

The approach taken by Mario Vilas is more API like, our approach here is a defensive measure against NoAPIs. This is another good example where HtmlUnit does its job.

BTW the noapi.com domain is available⁵

References

Beyond the SOAP Search API
A well earned retirement for the SOAP Search API
Google AJAX Search API beta Version 1.0 Available
Fall Housekeeping
The noapi.com domain is available at the time of writing of this article. Register it now! (Disclaimer: affiliate link).

Additional Resources

10 thoughts on “Google Search NoAPI”

Pingback: Tweets that mention Google Search NoAPI « Data Big Bang Blog -- Topsy.com ~
Mauro Asprea says:

January 21, 2011 at 10:12 am

Great example! Checkout this quora question with other alternatives ;) http://www.quora.com/What-are-the-alternatives-to-custom-search-besides-Google-CSE-Y!-BOSS-and-IndexTank
- Anonymous says:
  
  January 21, 2011 at 1:28 pm
  
  Thanks Mauro. That Quora question is specifically oriented to crawling your own site, but the problem is that the main search engine (Google) doesn’t have now an official API!
Pingback: Scraping vs Antiscraping « Data Big Bang Blog ~
BigPigeon says:

June 30, 2011 at 5:05 am

how it can deal with concurrent query request ?
- Anonymous says:
  
  June 30, 2011 at 2:05 pm
  
  Do you mean doing multiple searches in different threads/processes/machines ?
Nick says:

June 30, 2011 at 6:01 am

Google replaced the AJAX API with the CSE API, which gives JSON results and provides an API that can be called from a server: http://code.google.com/apis/customsearch/v1/overview.html
- Anonymous says:
  
  June 30, 2011 at 4:57 pm
  
  I learnt from the feedback that you can search the indexed Internet adding sites and then removing them. This is well explained on Alternative to the deprecated google REST web search API.
  
  But the point of the article is how to retrieve information from sites with changing policies, restricted or nonapis. On the critical side many companies are talking about the open web but open is a very devalued word.
  
  Also it’s important to note the restrictions that continue to exists using Google CSE, like rate limit (even in commercial agreements), results differences with the Google “main” search engine, etc.
  You can look at: http://www.google.com/support/forum/p/customsearch/thread?tid=3534cbeb56c4a26c&hl=en
Pingback: The Data Portability Fact Sheet « Data Big Bang Blog ~
Pingback: Extraction of Main Text Content « Data Big Bang Blog ~

Comments are closed.

Data Big Bang Blog

Creativity and Problem Solving for Data Science (whatever it may mean…) | An experimental spin-off from Nektra Advanced Computing

Menu