Scraping Web Sites which Dynamically Load Data

Preface

More and more sites are implementing dynamic updates of their contents. New items are added as the user scrolls down. Twitter is one of these sites. Twitter only displays a certain number of news items initially, loading additional ones on demand. How can sites with this behavior be scraped?

In the previous article we played with Google Chrome extensions to scrape a forum that depends on Javascript and XMLHttpRequest. Here we use the same technique for retrieving a specific number of news items based on a specific search. A list of additional alternatives is available in the Web Scraping Ajax and Javascript Sites article.

Code

Instructions

  1. Download the code from github
  2. Load the extension in Google Chrome: settings => extensions => check “developer mode” => load unpacked extension
  3. An “eye” icon now appears on the Google Chrome bar
  4. Go to the Twitter’s search page https://twitter.com/search-home and enter your search keywords
  5. Now press the “eye” and then the start button
  6. The scraping output is displayed on the console as JSON

Customization

  1. To modify the number of news items to be scraped open the file inject.js and change the scrollBottom(100); line by the number of items you would like (e.g: scrollBottom(200);)

Acknowledgments

This source code was written by Matias Palomera from Nektra Advanced Computing.

If you like this article, you might also be interested in

Further Reading

Precise Scraping with Google Chrome

Developers often search the vast corpus of scraping tools for one that is capable of simulating a full browser. Their search is pointless. Full browsers with extension capabilities are great scraping tools. Among extensions, Google Chrome’s are by far the easiest to develop, while Mozilla has less restrictive APIs. Google offers a second way to control Chrome: the Debugger protocol. Unfortunately, Debugger protocol is pretty slow.

The Google Chrome extension API is an excellent choice for writing an up to date scraper which uses a full browser with the latest HTML5 features and performance improvements. In a previous article, we described how to scrape Microsoft TechNet App-V forum. Now, we will focus on VMWare’s ThinApp. In this case, we develop a Google extension instead of a Python script.

Procedure

  1. You will need Google Chrome, Python 2.7, and lxml.html
  2. Download the code from github
  3. Install the Google Chrome extension
  4. Enter the VMware ThinApp: Discussion Forum
  5. The scraper starts automatically
  6. Once it stops, go to the Google Chrome console and copy&paste the results in JSON format to the thinapp.json file
  7. Run the thinapp_parser.py to generate the thinapp.csv file with the results
  8. Open the thinapp.csv file with a spreadsheet
  9. To rank the results, add a column which divides the number of views by the number of days.

Our Results: Top Twenty Threads

  1. Registry Isolation…
  2. Thinapp Internet Explorer 10
  3. Process (ifrun60.exe) remains active (Taskmanager) after closing thinapp under windows7 (xp works)
  4. Google Chrome browser
  5. File association not passing file to thinapp package
  6. Adobe CS3 Design Premium and FlexNET woes…
  7. How to thinapp Office 2010?
  8. Size limit of .dat file?
  9. ThinApp Citrix Receiver 3.2
  10. Visio 2010 Thinapp – Licensing issue
  11. Thinapp Google Chrome
  12. Thinapp IE7 running on Windows 7
  13. Adobe CS 6
  14. Failed to open, find, or create Sandbox directory
  15. Microsoft Project and Office issues
  16. No thinapp in thinapp factory + unable to create workpool
  17. IE8 Thinapp crashing with IE 10 installed natively
  18. ThinApp MS project and MS Visio 2010
  19. Difference between ESXi and vSphere and VMware view ??
  20. ThinAPP with AppSense

Acknowledgments

Matias Palomera from Nektra Advanced Computing wrote the code.

Notes

  • This approach can be successfully used to scrape heavy Javascript and AJAX sites
  • Instead of copying the JSON data from the Chrome console, you can use the FileSystem API to write the results to a file
  • You can also write the CSV directly from Chrome instead of using an extra script

If you like this article, you might also be interested in

  1. Scraping for Semi-automatic Market Research
  2. Application Virtualization Benchmarking: Microsoft App-V Vs. Symantec
  3. Web Scraping Ajax and Javascript Sites [using HTMLUnit]
  4. Distributed Scraping With Multiple Tor Circuits
  5. VMWare ThinApp vs. Symantec Workspace

Resources

  1. Application Virtualization Smackdown
  2. Application Virtualization Market Report