Precise Scraping with Google Chrome

Developers often search the vast corpus of scraping tools for one that is capable of simulating a full browser. Their search is pointless. Full browsers with extension capabilities are great scraping tools. Among extensions, Google Chrome’s are by far the easiest to develop, while Mozilla has less restrictive APIs. Google offers a second way to control Chrome: the Debugger protocol. Unfortunately, Debugger protocol is pretty slow.

The Google Chrome extension API is an excellent choice for writing an up to date scraper which uses a full browser with the latest HTML5 features and performance improvements. In a previous article, we described how to scrape Microsoft TechNet App-V forum. Now, we will focus on VMWare’s ThinApp. In this case, we develop a Google extension instead of a Python script.

Procedure

  1. You will need Google Chrome, Python 2.7, and lxml.html
  2. Download the code from github
  3. Install the Google Chrome extension
  4. Enter the VMware ThinApp: Discussion Forum
  5. The scraper starts automatically
  6. Once it stops, go to the Google Chrome console and copy&paste the results in JSON format to the thinapp.json file
  7. Run the thinapp_parser.py to generate the thinapp.csv file with the results
  8. Open the thinapp.csv file with a spreadsheet
  9. To rank the results, add a column which divides the number of views by the number of days.

Our Results: Top Twenty Threads

  1. Registry Isolation…
  2. Thinapp Internet Explorer 10
  3. Process (ifrun60.exe) remains active (Taskmanager) after closing thinapp under windows7 (xp works)
  4. Google Chrome browser
  5. File association not passing file to thinapp package
  6. Adobe CS3 Design Premium and FlexNET woes…
  7. How to thinapp Office 2010?
  8. Size limit of .dat file?
  9. ThinApp Citrix Receiver 3.2
  10. Visio 2010 Thinapp – Licensing issue
  11. Thinapp Google Chrome
  12. Thinapp IE7 running on Windows 7
  13. Adobe CS 6
  14. Failed to open, find, or create Sandbox directory
  15. Microsoft Project and Office issues
  16. No thinapp in thinapp factory + unable to create workpool
  17. IE8 Thinapp crashing with IE 10 installed natively
  18. ThinApp MS project and MS Visio 2010
  19. Difference between ESXi and vSphere and VMware view ??
  20. ThinAPP with AppSense

Acknowledgments

Matias Palomera from Nektra Advanced Computing wrote the code.

Notes

  • This approach can be successfully used to scrape heavy Javascript and AJAX sites
  • Instead of copying the JSON data from the Chrome console, you can use the FileSystem API to write the results to a file
  • You can also write the CSV directly from Chrome instead of using an extra script

If you like this article, you might also be interested in

  1. Scraping for Semi-automatic Market Research
  2. Application Virtualization Benchmarking: Microsoft App-V Vs. Symantec
  3. Web Scraping Ajax and Javascript Sites [using HTMLUnit]
  4. Distributed Scraping With Multiple Tor Circuits
  5. VMWare ThinApp vs. Symantec Workspace

Resources

  1. Application Virtualization Smackdown
  2. Application Virtualization Market Report

One thought on “Precise Scraping with Google Chrome

Comments are closed.