July 29, 2013

Scraping Web Sites which Dynamically Load Data

Preface

More and more sites are implementing dynamic updates of their contents. New items are added as the user scrolls down. Twitter is one of these sites. Twitter only displays a certain number of news items initially, loading additional ones on demand. How can sites with this behavior be scraped?

In the previous article we played with Google Chrome extensions to scrape a forum that depends on Javascript and XMLHttpRequest. Here we use the same technique for retrieving a specific number of news items based on a specific search. A list of additional alternatives is available in the Web Scraping Ajax and Javascript Sites article.

Code

Instructions

  1. Download the code from github
  2. Load the extension in Google Chrome: settings => extensions => check “developer mode” => load unpacked extension
  3. An “eye” icon now appears on the Google Chrome bar
  4. Go to the Twitter’s search page https://twitter.com/search-home and enter your search keywords
  5. Now press the “eye” and then the start button
  6. The scraping output is displayed on the console as JSON

Customization

  1. To modify the number of news items to be scraped open the file inject.js and change the scrollBottom(100); line by the number of items you would like (e.g: scrollBottom(200);)

Acknowledgments

This source code was written by Matias Palomera from Nektra Advanced Computing.

If you like this article, you might also be interested in

Further Reading

  • whatever

    Perhaps an example of the havoc dynamic content plays on browsers, I notice that however Disqus loads comments into Chrome, somehow it makes it impossible/difficult to search for text contained in the comment using Chrome’s Ctrl-F search function.

    The search function will show the search string exists, but is unable to position the browser to show the string.

    At least that was the case prior to last week’s disqus mods. Darn, it seems to be working now at least here.

  • Boomy

    Awesome post, thank you!

    What would be the best way to extract xpaths of elements of interests in other websites (such as the “//p[@class='js-tweet-text tweet-text']“)?

    • Motyar

      Try

      Visual Web Scraper

      http://webscrapemaster.com/try/

      • http://blog.databigbang.com/ Sebastian Wain

        @Motyar can you tell us more about this tool?

    • http://blog.databigbang.com/ Sebastian Wain

      Can you elaborate a little bit your question to help you better?

  • H.a.w.k P.h.i.l

    I would just use phantom.js and casper.js to keep things simple

  • seoelixir

    what a intersting post it is….thanks for sharing!
    http://homesteadroad.com

  • George

    I just read your post and I installed the extension but when I entered a word in twitter search and clicked the “Start” button it didn’t scrape the twitter search results. How to fix this?

    • Matias Palomera

      You clicked the start button before or after searching that word?

      • George

        Well, I tried both but to no avail

        • Matias Palomera

          Press F12 and go to “console”. What kind of error shows?