Scraping Web Sites which Dynamically Load Data

Preface

More and more sites are implementing dynamic updates of their contents. New items are added as the user scrolls down. Twitter is one of these sites. Twitter only displays a certain number of news items initially, loading additional ones on demand. How can sites with this behavior be scraped?

In the previous article we played with Google Chrome extensions to scrape a forum that depends on Javascript and XMLHttpRequest. Here we use the same technique for retrieving a specific number of news items based on a specific search. A list of additional alternatives is available in the Web Scraping Ajax and Javascript Sites article.

Code

Instructions

Download the code from github
Load the extension in Google Chrome: settings => extensions => check “developer mode” => load unpacked extension
An “eye” icon now appears on the Google Chrome bar
Go to the Twitter’s search page https://twitter.com/search-home and enter your search keywords
Now press the “eye” and then the start button
The scraping output is displayed on the console as JSON

Customization

To modify the number of news items to be scraped open the file inject.js and change the scrollBottom(100); line by the number of items you would like (e.g: scrollBottom(200);)

Acknowledgments

This source code was written by Matias Palomera from Nektra Advanced Computing.

If you like this article, you might also be interested in

22 thoughts on “Scraping Web Sites which Dynamically Load Data”

whatever says:

July 29, 2013 at 7:58 pm

Perhaps an example of the havoc dynamic content plays on browsers, I notice that however Disqus loads comments into Chrome, somehow it makes it impossible/difficult to search for text contained in the comment using Chrome’s Ctrl-F search function.

The search function will show the search string exists, but is unable to position the browser to show the string.

At least that was the case prior to last week’s disqus mods. Darn, it seems to be working now at least here.
Boomy says:

July 29, 2013 at 11:29 pm

Awesome post, thank you!

What would be the best way to extract xpaths of elements of interests in other websites (such as the “//p[@class=’js-tweet-text tweet-text’]”)?
- Motyar says:
  
  July 30, 2013 at 1:08 am
  
  Try
  
  Visual Web Scraper
  
  http://webscrapemaster.com/try/
  - Sebastian Wain says:
    
    July 31, 2013 at 12:26 pm
    
    @Motyar can you tell us more about this tool?
    - Motyar says:
      
      April 21, 2017 at 12:36 pm
      
      Sure. You can put any url into it and click on the element you want to fetch.
      
      for example we are putting https://isup.pro here
      
      try the visual web scraper http://motyar.info/webscrapemaster/try/?url=http%3A%2F%2Fisup.pro
- Sebastian Wain says:
  
  July 30, 2013 at 10:33 pm
  
  Can you elaborate a little bit your question to help you better?
H.a.w.k P.h.i.l says:

July 31, 2013 at 4:03 am

I would just use phantom.js and casper.js to keep things simple
seoelixir says:

August 1, 2013 at 3:30 am

what a intersting post it is….thanks for sharing!
http://homesteadroad.com
George says:

August 10, 2013 at 3:17 am

I just read your post and I installed the extension but when I entered a word in twitter search and clicked the “Start” button it didn’t scrape the twitter search results. How to fix this?
- Matias Palomera says:
  
  August 11, 2013 at 1:27 pm
  
  You clicked the start button before or after searching that word?
  - George says:
    
    August 15, 2013 at 7:23 am
    
    Well, I tried both but to no avail
    - Matias Palomera says:
      
      August 15, 2013 at 1:14 pm
      
      Press F12 and go to “console”. What kind of error shows?
Angelina Fomina says:

September 24, 2014 at 8:42 pm

Hey there. Great posts! I’d love to get your feedback (good or bad) on a tool we are currently building – a visual data extractor specifically made for dynamic sites. http://www.parsehub.com. It’s a work in progress and any feedback will be much appreciated. :)
Dre Peters says:

December 12, 2014 at 6:42 am

CaperJS is the way to go. Forget about Chrome.
John Frey says:

December 31, 2014 at 4:39 am

The scraper which I used for extracting data is just similar like this. But when I used the amazon product scraper I found it was simple and nice that what the data scraper also.

For More Info : http://www.youtube.com/watch?v=lNkz8Cu5ORA
Abhay says:

March 31, 2015 at 9:07 am

I am very new to this web scraping world…I am python programmer. Able to scrape STATIC web pages using beautifulsoup. But I want to know how to parse dynamically loaded web pages in python (beautifulsoup only loads view source code data).
Any help is appreciated.
- Jatin Grover says:
  
  April 30, 2015 at 11:45 pm
  
  use selenium with bs4
alison white says:

June 8, 2015 at 3:04 pm

guyuy
Hamman Samuel says:

June 19, 2015 at 2:33 pm

I tried this but there’s no JSON output being sent to the console. Any fix?
Alyk says:

June 15, 2020 at 11:39 am

nice way of scraping dynamically load data. thanks for your tips!!!
- "e-Scraper" Data Extracting says:
  
  June 6, 2022 at 10:08 am
  
  Check out the new page https://e-scraper.com/facebook/
alex says:

September 23, 2021 at 12:05 pm

I am a car dealer and i am also researching private ads at the FB marketplace and then resell it, or repair and resell… e-scraper .com awesome on-demand web scraping service which helped me in my case.
Maybe it helps somebody too.