Web Scraping for Semi-automatic Market Research

It is easy to web scrape Microsoft TechNet Forums (look at the xml output here: http://social.technet.microsoft.com/Forums/en-US/mdopappv/threads?outputAs=xml)and normalize the resulting information to have a better idea of each thread’s rank based on views and initial publication date. Knowing how issues are ranked can help a company choose what to focus on.

This code was used to scrape Microsoft TechNet’s forums. In the example below we scraped the App-V forum, since it is one of the application virtualization market’s leaders along with VMware ThinApp, and Symantec Workspace Virtualization.

These are the top ten threads for the App-V forum:

  1. “Exception has been thrown by the target of an invocation”
  2. Office 2010 KMS activation Error: 0xC004F074
  3. App-V 5 Hotfix 1
  4. Outlook 2010 Search Not Working
  5. Java 1.6 update 17 for Kronos (webapp)
  6. Word 2010 There was a problem sending the command to the program
  7. Utility to quickly install/remove App-V packages
  8. SAP GUI 7.1
  9. The dreaded: “The Application Virtualization Client could not launch the application”
  10. Sequencing Chrome with 4.6 SP1 on Windows 7 x64

The results show how frequently customers have issues with virtualizing Microsoft Office, Key Management Services (KMS), SAP, and Java. App-V competitors like Symantec Workspace Virtualization and VMWare ThinApp have similar problems. Researching markets this way gives you a good idea of areas where you can contribute solutions.

The scraper stores all the information in a SQLite database. The database can be exported using the csv_App-V.py script to an UTF-8 CSV file. We imported the file with Microsoft Excel and then normalized the ranking of the threads. To normalize it we divided the number of views by the age of the thread so threads with more views per day rank higher. Again, the scraper can be used on any Microsoft forum on Social TechNet. Try it out on your favorite forum.


Prerequisites: lxml.html

The code is available at microsoft-technet-forums-scraping [github] . It was written by Matias Palomera from Nektra Advanced Computing, who received valuable support from Victor Gonzalez.


  1. Run scrapper-App-V.py
  2. Then run csv_App-V.py
  3. The results are available in the App-V.csv file


Matias Palomera from Nektra Advanced Computing wrote the code.


  1. This is a single thread code. You can take a look at our discovering web resources code to optimize it with multithreading.
  2. Microsoft has given scrapers a special gift: it is possible to use the outputAs variable in the URL to get the structured information as XML instead of parsing HTML web pages.
  3. Our articles Distributed Scraping With Multiple Tor Circuits and Running Your Own Anonymous Rotating Proxies show how to Implement your own rotating proxies infrastructure with Tor.

If you liked this article, you might also like:

  1. Nektra and VMware are Collaborating to Simplify Application Virtualization Packaging
  2. Automated Discovery of Social Media Identities

HTML Cleaners and Tidiers

Tag Soup

When you are crawling a website you will come across a lot of malformed web pages. Some typical problems are unclosed tags, mishandling of comments or of css styles. Modern browsers have to do a good job of cleaning HTML to build the correct DOM without ambiguities. Due to performance and scalability limitations, it is more efficient to process HTML with a parser instead of using a browser or headless browsers such as HTMLUnit or PhantomJS. If your HTML parser does not incorporate the cleaning or fixing process, you will have to use an HTML cleaner or tidier.

As in other processing pipelines if you fail to clean up malformed HTML, all subsequent processes will be stalled. It is important to choose a good HTML cleaner. Many cleaners fail to do their jobs.

HTML Cleaner List

The list of HTML cleaners is long, but the list of good ones is pretty short. In our experience the best choice is lxml.html. Other cleaners often have trouble.

Comprehensive Resources

  1. lxml.html
  2. Beautiful Soup
  3. lxml.html vs Beautiful Soup
  4. Cleaning Word’s Nasty HTML
  5. HTML Cleaners query
  6. Tag soup