Web Scraping Ajax and Javascript Sites

Introduction

Most crawling frameworks used for scraping cannot be used for Javascript or Ajax. Their scope is limited to those sites that show their main content without using scripting. One would also be tempted to connect a specific crawler to a Javascript engine but it’s not easy to do. You need a fully functional browser with good DOM support because the browser behavior is too complex for a simple connection between a crawler and a Javascript engine to work. There is a list of resources at the end of this article to explore the alternatives in more depth.

There are several ways to scrape a site that contains Javascript:

Embed a web browser within an application and simulate a normal user.
Remotely connect to a web browser and automate it from a scripting language.
Use special purpose add-ons to automate the browser
Use a framework/library to simulate a complete browser.

Each one of these alternatives has its pros and cons. For example using a complete browser consumes a lot of resources, especially if we need to scrape websites with a lot of pages.

In this post we’ll give a simple example of how to scrape a web site that uses Javascript. We will use the htmlunit library to simulate a browser. Since htmlunit runs on a JVM we will use Jython, an [excellent] programming language,which is a Python implementation in the JVM. The resulting code is very clear and focuses on solving the problem instead of on the aspects of programming languages.

Setting up the environment

Prerequisites

JRE or JDK.
Download the latest version of Jython from http://www.jython.org/downloads.html.
Run the .jar file and install it in your preferred directory (e.g: /opt/jython).
Download the htmlunit compiled binaries from: http://sourceforge.net/projects/htmlunit/files/.
Unzip the htmlunit to your preferred directory.

Crawling example

We will scrape the Gartner Magic Quadrant pages at: http://www.gartner.com/it/products/mq/mq_ms.jsp . If you look at the list of documents, the links are Javascript code instead of hyperlinks with http urls. This is may be to reduce crawling, or just to open a popup window. It’s a very convenient page to illustrate the solution.

gartner.py

import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def main():
   webclient = WebClient(BrowserVersion.FIREFOX_3_6) # creating a new webclient object.
   url = "http://www.gartner.com/it/products/mq/mq_ms.jsp"
   page = webclient.getPage(url) # getting the url
   articles = page.getByXPath("//table[@id='mqtable']//tr/td/a") # getting all the hyperlinks

   for article in articles:
      print "Clicking on:", article
      subpage = article.click() # click on the article link
      title = subpage.getByXPath("//div[@class='title']") # get title
      summary = subpage.getByXPath("//div[@class='summary']") # get summary
      if len(title) > 0 and len(summary) > 0:
         print "Title:", title[0].asText()
         print "Summary:", summary[0].asText()
#     break

if __name__ == '__main__':
   main()

run.sh

/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" gartner.py

Final notes

This article is just a starting point to move ahead of simple crawlers and point the way for further research. As this is a simple page, it is a good choice for a clear example of how Javascript scraping works.You must do your homework to learn to crawl more web pages or add multithreading for better performance. In a demanding crawling scenario a lot of things must be taken into account, but this is a subject for future articles.

If you want to be polite don’t forget to read the robots.txt file before crawling…

If you like this article, you might also be interested in

Resources

Photo taken by xiffy

30 thoughts on “Web Scraping Ajax and Javascript Sites”

Pingback: Tweets that mention Web Scraping Ajax and Javascript Sites « Data Big Bang Blog -- Topsy.com ~
Fred1977 says:

March 19, 2011 at 2:04 am

OR, you can use Helium Scraper at http://www.heliumscraper.com
- Sebastian Wain says:
  
  March 20, 2011 at 11:31 pm
  
  Add information on how Helium Scraper compares against others in the resource list. Also if you work on Helium Scraper you must add a disclosure notice.
Goldtech says:

July 22, 2011 at 9:01 pm

Thank you for sharing this. It helped me a lot!
- Sebastian Wain says:
  
  July 23, 2011 at 4:10 am
  
  You’re welcome.
Goldtech says:

July 23, 2011 at 12:17 am

I think the html from the site first needs to be made well-formed before running xpath on it? I have not tried this great code but seems like first run tidy on it? Thanks
- Sebastian Wain says:
  
  July 23, 2011 at 4:10 am
  
  That’s a good question and the focus of a future article on HTML cleaners/tidiers
  
  In this article we are using HtmlUnit. Since HtmlUnit simulates (or is a) browser, that means that in the end you’ll have a correct DOM even if the original HTML was malformed. So happily you don’t need to be worried about that.
Pingback: Running HtmlUnit with Jython – issue with startup on command line - Programmers Goodies ~
Plumo says:

November 15, 2011 at 5:39 am

Thanks for sharing. What Jython version are you using? Mine (v2.5.1) does not support the J-classpath option:

$ jython -J-classpath “htmlunit-2.8/lib/*” gartner.py
Unknown option: J-classpath
usage: …
- Sebastian Wain says:
  
  November 15, 2011 at 11:16 am
  
  For this article I was using Jython 2.5.2rc2 under Linux. I noted your ‘$’ prompt on your side, are you using Linux also? Because this option doesn’t work on Windows.
  
  If you run jython –help what do you see? I see this option:
  
  -Jarg : pass argument through to Java VM (e.g. -J-Xmx512m)
Pingback: Running HtmlUnit with Jython – issue with startup on command line | Software development support, software risk,bugs for bugs, risk analysis, ~
Hunter Barrington says:

December 31, 2011 at 9:01 pm

so I tried this on Ubuntu 11.10 with Java 6, jython 2.5.2 and htmlunit 2.9 and doesnt seem to work. I’m not sure what changed… any thoughts? post to
http://pastebin.com/tMhNacyy
- Sebastian Wain says:
  
  January 3, 2012 at 12:37 pm
  
  Can you describe your issue with more detail? like:
  
  i) The goal of your script
  ii) What you expect and which part is failing
  iii) If you run our example does it work?
REC says:

February 13, 2012 at 8:15 pm

Hi, thanks for this, sorry for a possibly obvious question, but I am trying to run this and I am getting an error:

C:Python27optjython>jython -J-classpath “htmlunit-2.9/lib/*” JavaScriptLinkTe
st.py

File “”, line None
SyntaxError: Non-ASCII character in file ‘htmlunit-2.9/lib/commons-codec-1.4.jar
‘, but no encoding declared; see http://www.python.org/peps/pep-0263.html for de
tails
REC says:

February 15, 2012 at 12:16 pm

Please delete my previous comment, I was able to get this to work fairly well, but I had to load the jars into my python file..I was also having problems with unicode errors when using .getText() but that is probably a different topic..

Thanks for posting this, it helped quite a bit :)
Gabm says:

February 28, 2012 at 7:40 pm

your blog rocks! thanks for the detailed infos and resources!
- Sebastian Wain says:
  
  February 29, 2012 at 1:05 pm
  
  Thank you. Feel free to propose topics that you are interested in.
Junior says:

April 2, 2012 at 3:47 pm

Thank you for sharing this code.

But I have one problem. Can you help me?
I use Ubuntu and when I run this example, I got the following errors:Apr 2, 2012 3:42:35 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notifyWARNING: Obsolete content type encountered: ‘application/x-javascript’.Apr 2, 2012 3:42:36 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notifyWARNING: Obsolete content type encountered: ‘application/x-javascript’.Apr 2, 2012 3:42:36 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notifyWARNING: Expected content type of ‘application/javascript’ or ‘application/ecmascript’ for remotely loaded JavaScript element at ‘http://www.gartner.com/js/optionsArray.jsp’, but got ‘text/html’.Apr 2, 2012 3:42:37 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notifyWARNING: Obsolete content type encountered: ‘text/javascript’.Apr 2, 2012 3:42:37 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notifyWARNING: Obsolete content type encountered: ‘text/javascript’.Traceback (most recent call last): File “gartner.py”, line 22, in main() File “gartner.py”, line 7, in main page = webclient.getPage(url) # getting the urlException class=[net.sourceforge.htmlunit.corejs.javascript.WrappedException]com.gargoylesoftware.htmlunit.ScriptException: Wrapped com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot find function Q in object [object Object]. (https://apis.google.com/_/apps-static/_/js/gapi/gcm_ppb,googleapis_client,plusone/rt=j/ver=Qfr2nBcLP04.pt_BR./sv=1/am=!Ze6NnRS0VYCICGRMrA/d=1/cb=gapi.loaded0#3) (gartner.py#11) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:595) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:537) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:538) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:499) at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:973) at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:349) at com.gargoylesoftware.htmlunit.html.HtmlScript$1.execute(HtmlScript.java:230) at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:240) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:598) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:556) at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1142) at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1044) at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206) at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:329) at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3018) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2005) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:908) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:789) at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:225) at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179) at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221) at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106) at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616)com.gargoylesoftware.htmlunit.ScriptException: com.gargoylesoftware.htmlunit.ScriptException: Wrapped com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot find function Q in object [object Object]. (https://apis.google.com/_/apps-static/_/js/gapi/gcm_ppb,googleapis_client,plusone/rt=j/ver=Qfr2nBcLP04.pt_BR./sv=1/am=!Ze6NnRS0VYCICGRMrA/d=1/cb=gapi.loaded0#3) (gartner.py#11)
- Sebastian Wain says:
  
  April 4, 2012 at 12:10 am
  
  The page in the example changed AND also htmlunit is not handling complex obfuscated javascript properly (in this case .js from Google) also added after the article was release.
  
  For another example take a look at: http://databigbang1.wpenginepowered.com/automated-browserless-oauth-authentication-for-twitter/ it illustrate the same point but with OAuth/OAuth2 authentication.
Drake082012 says:

June 15, 2012 at 11:00 am

Great read!
I’m curious to know if you can scrap webpages with infinite scroll?
This has been a problem I’ve been trying to solve.
Any hints/suggestions ?
- Sebastian Wain says:
  
  June 15, 2012 at 1:34 pm
  
  Yes you can. Can you describe the context of your problem to help you better?
  - Drake082012 says:
    
    June 15, 2012 at 10:02 pm
    
    I’ve come across certain websites (i.e. image sites, etc) that uploading the page only a select few images are loaded. When I scroll down to the bottom more images are loaded. If I use, for example, mechanize, to pull the page I don’t get all the images. This is because of the javascript I presume.
    
    How would you go about it with htmlunit?
snape says:

July 24, 2012 at 9:51 am

I have used HtmlUnit the way you have written but it is super slow.
Can you suggest something to reduce the time taken by htmlunit
- Sebastian Wain says:
  
  July 24, 2012 at 1:40 pm
  
  Yes. The best way I found to scrape sites with javascript is using a Google Chrome extension for the scraping. I prefer this way to using phantomjs, but phantomjs is faster than htmlunit.
Surender says:

November 14, 2012 at 11:45 am

in webpage has a drop down list, if we select any of them form list it shows values in the below and we need to scrap this data using webclient, just onchange function is calling and submiting the form
akhter says:

April 10, 2013 at 3:37 am

why using jython ? don’t you think we should directly use java for using htmlunit ?
- Sebastian Wain says:
  
  April 10, 2013 at 9:30 am
  
  Searching on Google I found answers like http://stackoverflow.com/questions/96922/why-use-jython-when-you-could-just-use-java and I agree with many of the answers.
Reynold PJ says:

May 7, 2014 at 12:48 pm

Awesome !! thanks a ton :)
Kishore kumar says:

May 27, 2015 at 3:27 am

hi I ran the script but it comes out as blank

C:jython2.7.0>jython.jar -J-classpath “C:jython2.7.0htmlunit-2.16-binhtmluni
t-2.16lib*” gartner.py
anaya says:

June 11, 2015 at 6:26 pm

How can I use it in java?