Developers often search the vast corpus of scraping tools for one that is capable of simulating a full browser. Their search is pointless. Full browsers with extension capabilities are great scraping tools. Among extensions, Google Chrome’s are by far the easiest to develop, while Mozilla has less restrictive APIs. Google offers a second way to control Chrome: the Debugger protocol. Unfortunately, Debugger protocol is pretty slow.
The Google Chrome extension API is an excellent choice for writing an up to date scraper which uses a full browser with the latest HTML5 features and performance improvements. In a previous article, we described how to scrape Microsoft TechNet App-V forum. Now, we will focus on VMWare’s ThinApp. In this case, we develop a Google extension instead of a Python script.
Procedure
- You will need Google Chrome, Python 2.7, and lxml.html
- Download the code from github
- Install the Google Chrome extension
- Enter the VMware ThinApp: Discussion Forum
- The scraper starts automatically
- Once it stops, go to the Google Chrome console and copy&paste the results in JSON format to the thinapp.json file
- Run the thinapp_parser.py to generate the thinapp.csv file with the results
- Open the thinapp.csv file with a spreadsheet
- To rank the results, add a column which divides the number of views by the number of days.
Our Results: Top Twenty Threads
- Registry Isolation…
- Thinapp Internet Explorer 10
- Process (ifrun60.exe) remains active (Taskmanager) after closing thinapp under windows7 (xp works)
- Google Chrome browser
- File association not passing file to thinapp package
- Adobe CS3 Design Premium and FlexNET woes…
- How to thinapp Office 2010?
- Size limit of .dat file?
- ThinApp Citrix Receiver 3.2
- Visio 2010 Thinapp – Licensing issue
- Thinapp Google Chrome
- Thinapp IE7 running on Windows 7
- Adobe CS 6
- Failed to open, find, or create Sandbox directory
- Microsoft Project and Office issues
- No thinapp in thinapp factory + unable to create workpool
- IE8 Thinapp crashing with IE 10 installed natively
- ThinApp MS project and MS Visio 2010
- Difference between ESXi and vSphere and VMware view ??
- ThinAPP with AppSense
Acknowledgments
Matias Palomera from Nektra Advanced Computing wrote the code.
Notes
- This approach can be successfully used to scrape heavy Javascript and AJAX sites
- Instead of copying the JSON data from the Chrome console, you can use the FileSystem API to write the results to a file
- You can also write the CSV directly from Chrome instead of using an extra script
If you like this article, you might also be interested in
- Scraping for Semi-automatic Market Research
- Application Virtualization Benchmarking: Microsoft App-V Vs. Symantec
- Web Scraping Ajax and Javascript Sites [using HTMLUnit]
- Distributed Scraping With Multiple Tor Circuits
- VMWare ThinApp vs. Symantec Workspace
As soon as you enter the keyword on the scraper the immediate you will get the list of your search result. Thumbs up for this scraper.
Get More Info : http://www.youtube.com/watch?v=lNkz8Cu5ORA