It is odd that the base libraries for most programming languages do not allow you to search for regular expressions and substrings in streams or partial reads. We have modified the KMP algorithm so that it accepts virtually infinite partial strings. The code is implemented in Haxe, so it can generate code in multiple programming languages.

Streams are important when working with data that does not fit in main memory, such as large files, or with data which is being transferred. There are a few implementations of regular expressions and substrings matching. One is the Jakarta Regexp, now retired and resting in the Apache Attic. The Jakarta Regexp library “match” method in the RE class uses a CharacterIterator as a parameter. In C++, Boost.Regex implements partial matches.

Our code is implemented in Haxe so the same code can target Javascript, ActionScript, Flash SWF, NekoVM, PHP, C++, C#, and Java. We really like the concept of writing one code and expanding it to a variety of platforms with minimum effort. There are excellent libraries in specific environments that can work perfectly in other environments. Porting libraries from one programming language to another is tedious. For example, the amazing NetworkX graph library implemented in Python can be easily ported to C# to benefit a broader audience.

Code

Prerequisites

Haxe (tested on version 2.10)
For C++: hxcpp (run haxelib install hxcpp)
For Java: hxjava (run haxelib install hxjava)
For Mono/C#: jxcs (run haxelib install hxcs)

Source code available on github.

Resources

Introduction

Most crawling frameworks used for scraping cannot be used for Javascript or Ajax. Their scope is limited to those sites that show their main content without using scripting. One would also be tempted to connect a specific crawler to a Javascript engine but it’s not easy to do. You need a fully functional browser with good DOM support because the browser behavior is too complex for a simple connection between a crawler and a Javascript engine to work. There is a list of resources at the end of this article to explore the alternatives in more depth.

There are several ways to scrape a site that contains Javascript:

Embed a web browser within an application and simulate a normal user.
Remotely connect to a web browser and automate it from a scripting language.
Use special purpose add-ons to automate the browser
Use a framework/library to simulate a complete browser.

Each one of these alternatives has its pros and cons. For example using a complete browser consumes a lot of resources, especially if we need to scrape websites with a lot of pages.

In this post we’ll give a simple example of how to scrape a web site that uses Javascript. We will use the htmlunit library to simulate a browser. Since htmlunit runs on a JVM we will use Jython, an [excellent] programming language,which is a Python implementation in the JVM. The resulting code is very clear and focuses on solving the problem instead of on the aspects of programming languages.

Setting up the environment

Prerequisites

JRE or JDK.
Download the latest version of Jython from http://www.jython.org/downloads.html.
Run the .jar file and install it in your preferred directory (e.g: /opt/jython).
Download the htmlunit compiled binaries from: http://sourceforge.net/projects/htmlunit/files/.
Unzip the htmlunit to your preferred directory.

Crawling example

We will scrape the Gartner Magic Quadrant pages at: http://www.gartner.com/it/products/mq/mq_ms.jsp . If you look at the list of documents, the links are Javascript code instead of hyperlinks with http urls. This is may be to reduce crawling, or just to open a popup window. It’s a very convenient page to illustrate the solution.

gartner.py

import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion

def main():
   webclient = WebClient(BrowserVersion.FIREFOX_3_6) # creating a new webclient object.
   url = "http://www.gartner.com/it/products/mq/mq_ms.jsp"
   page = webclient.getPage(url) # getting the url
   articles = page.getByXPath("//table[@id='mqtable']//tr/td/a") # getting all the hyperlinks

   for article in articles:
      print "Clicking on:", article
      subpage = article.click() # click on the article link
      title = subpage.getByXPath("//div[@class='title']") # get title
      summary = subpage.getByXPath("//div[@class='summary']") # get summary
      if len(title) > 0 and len(summary) > 0:
         print "Title:", title[0].asText()
         print "Summary:", summary[0].asText()
#     break

if __name__ == '__main__':
   main()

run.sh

/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" gartner.py

Final notes

This article is just a starting point to move ahead of simple crawlers and point the way for further research. As this is a simple page, it is a good choice for a clear example of how Javascript scraping works.You must do your homework to learn to crawl more web pages or add multithreading for better performance. In a demanding crawling scenario a lot of things must be taken into account, but this is a subject for future articles.

If you want to be polite don’t forget to read the robots.txt file before crawling…

If you like this article, you might also be interested in

Resources

Photo taken by xiffy

Data Big Bang Blog

Creativity and Problem Solving for Data Science (whatever it may mean…) | An experimental spin-off from Nektra Advanced Computing

Menu

Tag Archives: java

Searching for Substrings in Streams: a Slight Modification of the Knuth-Morris-Pratt Algorithm in Haxe

Code

Prerequisites

See Also

Resources

Web Scraping Ajax and Javascript Sites

Introduction

Setting up the environment

Prerequisites

Crawling example

gartner.py

run.sh

Final notes

If you like this article, you might also be interested in

Resources