It is odd that the base libraries for most programming languages do not allow you to search for regular expressions and substrings in streams or partial reads. We have modified the KMP algorithm so that it accepts virtually infinite partial strings. The code is implemented in Haxe, so it can generate code in multiple programming languages.
Streams are important when working with data that does not fit in main memory, such as large files, or with data which is being transferred. There are a few implementations of regular expressions and substrings matching. One is the Jakarta Regexp, now retired and resting in the Apache Attic. The Jakarta Regexp library “match” method in the RE class uses a CharacterIterator as a parameter. In C++, Boost.Regex implements partial matches.
- Haxe (tested on version 2.10)
- For C++: hxcpp (run haxelib install hxcpp)
- For Java: hxjava (run haxelib install hxjava)
- For Mono/C#: jxcs (run haxelib install hxcs)
Source code available on github.
- Parsing S-Expressions in C# using OMeta
- Esoteric Queue Scheduling Disciplines
- Knuth-Morris-Pratt string matching
- Text Searching: Theory and Practice
- Boyer–Moore–Horspool algorithm
- Rabin–Karp algorithm
- Aho–Corasick string matching algorithm
- Lexicographically minimal string rotation
- Efficient way to search a stream for a string
- Embed a web browser within an application and simulate a normal user.
- Remotely connect to a web browser and automate it from a scripting language.
- Use special purpose add-ons to automate the browser
- Use a framework/library to simulate a complete browser.
Each one of these alternatives has its pros and cons. For example using a complete browser consumes a lot of resources, especially if we need to scrape websites with a lot of pages.
Setting up the environment
- JRE or JDK.
- Download the latest version of Jython from http://www.jython.org/downloads.html.
- Run the .jar file and install it in your preferred directory (e.g: /opt/jython).
- Download the htmlunit compiled binaries from: http://sourceforge.net/projects/htmlunit/files/.
- Unzip the htmlunit to your preferred directory.
import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion
webclient = WebClient(BrowserVersion.FIREFOX_3_6) # creating a new webclient object.
url = "http://www.gartner.com/it/products/mq/mq_ms.jsp"
page = webclient.getPage(url) # getting the url
articles = page.getByXPath("//table[@id='mqtable']//tr/td/a") # getting all the hyperlinks
for article in articles:
print "Clicking on:", article
subpage = article.click() # click on the article link
title = subpage.getByXPath("//div[@class='title']") # get title
summary = subpage.getByXPath("//div[@class='summary']") # get summary
if len(title) > 0 and len(summary) > 0:
print "Title:", title.asText()
print "Summary:", summary.asText()
if __name__ == '__main__':
/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" gartner.py
If you want to be polite don’t forget to read the robots.txt file before crawling…
If you like this article, you might also be interested in
- Distributed Scraping With Multiple Tor Circuits
- Precise Scraping with Google Chrome
- Running Your Own Anonymous Rotating Proxies
- Automated Browserless OAuth Authentication for Twitter
- ghost.py is a webkit web client written in python
- Crowbar web scraping environment
- Google Chrome remote debugging shell from Python
- Selenium web application testing system – Watir – Sahi – Windmill Testing Framework
- Internet Explorer automation
- Embedding Gecko
- Opera Dragonfly
- PyAuto: Python Interface to Chromum’s automation framework
- Related questions on Stack Overflow
- Setting up Headless XServer and CutyCapt on Ubuntu
- CutyCapt: Capture WebKit’s rendering of a web page.
- Google webmaste blog: A spider’s view of Web 2.0
- Python Webkit DOM Bindings
- Berkelium Browser
- Using HtmlUnit on .NET for Headless Browser Automation (using IKVM)
- Web Inspector Remote
- Offscreen/Headless Mozilla Firefox (via @brutuscat)
- Web Scraping with Google Spreadsheets and XPath
- Web Scraping with YQL and Yahoo Pipes
Photo taken by xiffy