- Embed a web browser within an application and simulate a normal user.
- Remotely connect to a web browser and automate it from a scripting language.
- Use special purpose add-ons to automate the browser
- Use a framework/library to simulate a complete browser.
Each one of these alternatives has its pros and cons. For example using a complete browser consumes a lot of resources, especially if we need to scrape websites with a lot of pages.
Setting up the environment
- JRE or JDK.
- Download the latest version of Jython from http://www.jython.org/downloads.html.
- Run the .jar file and install it in your preferred directory (e.g: /opt/jython).
- Download the htmlunit compiled binaries from: http://sourceforge.net/projects/htmlunit/files/.
- Unzip the htmlunit to your preferred directory.
import com.gargoylesoftware.htmlunit.WebClient as WebClient import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion def main(): webclient = WebClient(BrowserVersion.FIREFOX_3_6) # creating a new webclient object. url = "http://www.gartner.com/it/products/mq/mq_ms.jsp" page = webclient.getPage(url) # getting the url articles = page.getByXPath("//table[@id='mqtable']//tr/td/a") # getting all the hyperlinks for article in articles: print "Clicking on:", article subpage = article.click() # click on the article link title = subpage.getByXPath("//div[@class='title']") # get title summary = subpage.getByXPath("//div[@class='summary']") # get summary if len(title) > 0 and len(summary) > 0: print "Title:", title.asText() print "Summary:", summary.asText() # break if __name__ == '__main__': main()
/opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" gartner.py
If you want to be polite don’t forget to read the robots.txt file before crawling…
If you like this article, you might also be interested in
- Distributed Scraping With Multiple Tor Circuits
- Precise Scraping with Google Chrome
- Running Your Own Anonymous Rotating Proxies
- Automated Browserless OAuth Authentication for Twitter
- ghost.py is a webkit web client written in python
- Crowbar web scraping environment
- Google Chrome remote debugging shell from Python
- Selenium web application testing system – Watir – Sahi – Windmill Testing Framework
- Internet Explorer automation
- Embedding Gecko
- Opera Dragonfly
- PyAuto: Python Interface to Chromum’s automation framework
- Related questions on Stack Overflow
- Setting up Headless XServer and CutyCapt on Ubuntu
- CutyCapt: Capture WebKit’s rendering of a web page.
- Google webmaste blog: A spider’s view of Web 2.0
- Python Webkit DOM Bindings
- Berkelium Browser
- Using HtmlUnit on .NET for Headless Browser Automation (using IKVM)
- Web Inspector Remote
- Offscreen/Headless Mozilla Firefox (via @brutuscat)
- Web Scraping with Google Spreadsheets and XPath
- Web Scraping with YQL and Yahoo Pipes
Photo taken by xiffy