Web Scraping 101: Pulling Stories from Hacker News

This is a guest post by Hartley Brody, whose book “The Ultimate Guide to Web Scraping” goes into much more detail on web scraping best practices. You can follow him on Twitter, it’ll make his day! Thanks for contributing Hartley!

Hacker News is a treasure trove of information on the hacker zeitgeist. There are all sorts of cool things you could do with the information once you pull it, but first you need to scrape a copy for yourself.

Hacker News is actually a bit tricky to scrape since the site’s markup isn’t all that semantic — meaning the HTML elements and attributes don’t do a great job of explaining the content they contain. Everything on the HN homepage is in two tables, and there aren’t that many classes or ids to help us hone in on the particular HTML elements that hold stories. Instead, we’ll have to rely more on patterns and counting on elements as we go.

Pull up the web inspector in Chrome and try zooming up and down the DOM tree. You’ll see that the markup is pretty basic. There’s an outer table that’s basically just used to keep things centered (85% of the screen width) and then an inner table that holds the stories.

Debugging Hacker News Page

If you look inside the inner table, you’ll see that the rows come in groups of three: the first row in each group contains the headlines and story links, the second row contains the metadata about each story — like who posted it and how many points it has — and the third row is empty and adds a bit of padding between stories. This should be enough information for us to get started, so let’s dive into the code.

I’m going to try and avoid the religious tech wars and just say that I’m using Python and my trusty standby libraries — requests and BeautifulSoup — although there are many other great options out there. Feel free to use your HTTP requests library and HTML parsing library of choice.

In its purest form, web scraping is two simple steps: 1. Make a request to a website that generates HTML, and 2. Pull the content you want out of the HTML that’s returned.

As the programmer, all you need to do is a bit of pattern recognition to find the URLs to request and the DOM elements to parse, and then you can let your libraries do the heavy lifting. Our code will just glue the two functions together to pull out just what we need.

import requests

from BeautifulSoup import BeautifulSoup
# make a single request to the homepage
r = requests.get("https://news.ycombinator.com/")
# convert the plaintext HTML markup into a DOM-like structure that we can search
soup = BeautifulSoup(r.text)
# parse through the outer and inner tables, then find the rows
outer_table = soup.find("table")
inner_table = outer_table.findAll("table")[1]
rows = inner_table.findAll("tr")
stories = []
# create an empty list for holding stories
rows_per_story = 3
# helps us iterate over the table
for row_num in range(0, len(rows)-rows_per_story, rows_per_story):
	# grab the 1st & 2nd rows and create an array of their cells
	story_pieces = rows[row_num].findAll("td")
	meta_pieces = rows[row_num + 1].findAll("td")
	# create our story dictionary
	story = { "current_position": story_pieces[0].string, "link": story_pieces[2].find("a")["href"], "title": story_pieces[2].find("a").string, }
	try:
		story["posted_by"] = meta_pieces[1].findAll("a")[0].string
	except IndexError:
		continue # this is a job posting, not a story stories.append(story)

import json
print json.dumps(stories, indent=1)

You’ll notice that inside the for loop, when we’re iterating over the rows in the table two at a time, we’re parsing out the individual pieces of content (link, title, etc) by skipping to a particular number in the list of <td> elements returned. Generally, you want to avoid using magic numbers in your code, but without more semantic markup, this is what we’re left to work with.

This obviously makes the scraping code brittle, if the site is ever redesigned or the elements on the page move around at all, this code will no longer work as designed. But I’m guessing from the consistently minimalistic, retro look that HN isn’t getting a facelift any time soon. ;)

Extension Ideas

Running this script top-to-bottom will print out a list of all the current stories on HN. But if you really want to do something interesting, you’ll probably want to grab snapshots of the homepage and the newest page fairly regularly. Maybe even every minute.

There are a number of cool projects that have already built cool extensions and visualizations from (I presume) scraping data from Hacker News, such as:

  • http://hnrankings.info/
  • http://api.ihackernews.com/
  • https://www.hnsearch.com/

It’d be a good idea to set this up using crontab on your web server. Run crontab -e to pull up a vim editor and edit your machine’s cron jobs, and add a line that looks like this:

* * * * * python /path/to/hn_scraper.py

Then save it and exit (<esc> + “:wq”) and you should be good to go. Obviously, printing things to the command line doesn’t do you much good from a cron job, so you’ll probably want to change the script to write each snapshot of stories into your database of choice for later retrieval.

Basic Web Scraping Etiquette

If you’re going to be scraping any site regularly, it’s important to be a good web scraping citizen so that your script doesn’t ruin the experience for the rest of us… aw who are we kidding, you’ll definitely get blocked before your script causes any noticeable site degradation for other users on Hacker News. But still, it’s good to keep these things in mind whenever you’re making frequent scrapes on the same site.

Your HTTP Requests library probably lets you set headers like User Agent and Accept-Encoding. You should set your user agent to something that identifies you and provides some contact information in case any site admins want to get in touch.

You also want to ensure you’re asking for the gzipped version of the site, so that you’re not hogging bandwidth with uncompressed page requests. Use the Accept-Encoding request header to tell the server your client can accept gzipped responses. The Python requests library automagically unzips those gzipped responses for you.

You might want to modify line 4 above to look more like this:

headers = { "User-Agent": "HN Scraper / Contact me: ", "Accept-Encoding": "gzip", }
r = requests.get("https://news.ycombinator.com/", headers=headers)

Note that if you were doing the scraping with some sort of headless browser or something like Selenium which actually downloads all the resources on the page and renders them, you’d also want to make sure you’re caching the stylesheet and images to avoid unnecessary extra requests.

If you liked this article, you might also like:

  1. Scraping Web Sites which Dynamically Load Data
  2. Ideas and Execution Magic Chart (includes a Hacker News Search Hack)
  3. Running Your Own Anonymous Rotating Proxies

Scraping Web Sites which Dynamically Load Data

Preface

More and more sites are implementing dynamic updates of their contents. New items are added as the user scrolls down. Twitter is one of these sites. Twitter only displays a certain number of news items initially, loading additional ones on demand. How can sites with this behavior be scraped?

In the previous article we played with Google Chrome extensions to scrape a forum that depends on Javascript and XMLHttpRequest. Here we use the same technique for retrieving a specific number of news items based on a specific search. A list of additional alternatives is available in the Web Scraping Ajax and Javascript Sites article.

Code

Instructions

  1. Download the code from github
  2. Load the extension in Google Chrome: settings => extensions => check “developer mode” => load unpacked extension
  3. An “eye” icon now appears on the Google Chrome bar
  4. Go to the Twitter’s search page https://twitter.com/search-home and enter your search keywords
  5. Now press the “eye” and then the start button
  6. The scraping output is displayed on the console as JSON

Customization

  1. To modify the number of news items to be scraped open the file inject.js and change the scrollBottom(100); line by the number of items you would like (e.g: scrollBottom(200);)

Acknowledgments

This source code was written by Matias Palomera from Nektra Advanced Computing.

If you like this article, you might also be interested in

Further Reading

Web Scraping for Semi-automatic Market Research

It is easy to web scrape Microsoft TechNet Forums (look at the xml output here: http://social.technet.microsoft.com/Forums/en-US/mdopappv/threads?outputAs=xml)and normalize the resulting information to have a better idea of each thread’s rank based on views and initial publication date. Knowing how issues are ranked can help a company choose what to focus on.

This code was used to scrape Microsoft TechNet’s forums. In the example below we scraped the App-V forum, since it is one of the application virtualization market’s leaders along with VMware ThinApp, and Symantec Workspace Virtualization.

These are the top ten threads for the App-V forum:

  1. “Exception has been thrown by the target of an invocation”
  2. Office 2010 KMS activation Error: 0xC004F074
  3. App-V 5 Hotfix 1
  4. Outlook 2010 Search Not Working
  5. Java 1.6 update 17 for Kronos (webapp)
  6. Word 2010 There was a problem sending the command to the program
  7. Utility to quickly install/remove App-V packages
  8. SAP GUI 7.1
  9. The dreaded: “The Application Virtualization Client could not launch the application”
  10. Sequencing Chrome with 4.6 SP1 on Windows 7 x64

The results show how frequently customers have issues with virtualizing Microsoft Office, Key Management Services (KMS), SAP, and Java. App-V competitors like Symantec Workspace Virtualization and VMWare ThinApp have similar problems. Researching markets this way gives you a good idea of areas where you can contribute solutions.

The scraper stores all the information in a SQLite database. The database can be exported using the csv_App-V.py script to an UTF-8 CSV file. We imported the file with Microsoft Excel and then normalized the ranking of the threads. To normalize it we divided the number of views by the age of the thread so threads with more views per day rank higher. Again, the scraper can be used on any Microsoft forum on Social TechNet. Try it out on your favorite forum.

Code

Prerequisites: lxml.html

The code is available at microsoft-technet-forums-scraping [github] . It was written by Matias Palomera from Nektra Advanced Computing, who received valuable support from Victor Gonzalez.

Usage

  1. Run scrapper-App-V.py
  2. Then run csv_App-V.py
  3. The results are available in the App-V.csv file

Acknowledgments

Matias Palomera from Nektra Advanced Computing wrote the code.

Notes

  1. This is a single thread code. You can take a look at our discovering web resources code to optimize it with multithreading.
  2. Microsoft has given scrapers a special gift: it is possible to use the outputAs variable in the URL to get the structured information as XML instead of parsing HTML web pages.
  3. Our articles Distributed Scraping With Multiple Tor Circuits and Running Your Own Anonymous Rotating Proxies show how to Implement your own rotating proxies infrastructure with Tor.

If you liked this article, you might also like:

  1. Nektra and VMware are Collaborating to Simplify Application Virtualization Packaging
  2. Automated Discovery of Social Media Identities

Distributed Scraping With Multiple Tor Circuits

Multiple Circuit Tor Solution

When you rapidly fetch different web pages from a single IP address you risk getting stuck in the middle of the scraping. Some sites completely ban scrapers, while others follow a rate limit policy. For example, If you automate Google searches, Google will require you to solve captchas. Google is confused by many people using the same IP, and by search junkies. It used to be costly to get enough IPs to build a good scraping infrastructure. Now there are alternatives: cheap rotating proxies and Tor. Other options include specialized crawling and scraping services like 80legs, or even running Tor on AWS EC2 instances. The advantage of running Tor is its widespread network coverage. Tor is also free of charge. Unfortunately Tor does not allow you to control the bandwidth and latency.

All navigation performed when you start a session on Tor will be associated with the same exit point and its IP addresses. To renew these IP addresses you must restart Tor, or send a newnym signal, or as in our case study you can run multiples Tor instances at the same time If you assign different ports for each one. Many SOCKS proxies will then be ready for use. It is possible for more than one instance to share the same circuit, but that is beyond the scope of this article.

IMDB: A Case Study

If you like movies, Internet Movie Database is omnipresent in your daily life. IMDB users have always been able to share their movies and lists. Recently, however, the site turned previously shared public movie ratings private by default. Useful movie ratings disappeared from Internet with this change, and most of those that were manually set back to public are not indexed by search engines. All links that previously pointed to user ratings are now broken since the URLs have changed. How can you find all the public ratings available on IMDB?
If you use IMDB’s scraping policy it will take years, since the site contains tens of million of user pages. Distributed scraping is the best way to solve this issue and quickly discover which users are sharing their ratings. Our method just retrieves the HTTP response code to find out whether the user is sharing his rating.

Our code sample has three elements:

  1. Multiple Tor instances listening to different ports. The result is many SOCKS proxies available for use with different Tor circuits.
  2. A Python script that launches multiple workers in different threads. Each worker uses a different SOCK port.
  3. MongoDB to persist the state of the scraping if the process fails or if you want to stop the process and continue later.

Shell Script and Source Code

Prerequisites

  1. Tor
  2. MongoDB
  3. PyMongo
  4. SocksiPy
  5. Python

Multiple Tor Launcher

You must run the following script before running the Python script. To adjust the number of Tor instances just change the interval in the loop.

#!/bin/bash

base_socks_port=9050
base_control_port=8118

# Create data directory if it doesn't exist
if [ ! -d "data" ]; then
	mkdir "data"
fi

#for i in {0..10}
for i in {0..80}

do
	j=$((i+1))
	socks_port=$((base_socks_port+i))
	control_port=$((base_control_port+i))
	if [ ! -d "data/tor$i" ]; then
		echo "Creating directory data/tor$i"
		mkdir "data/tor$i"
	fi
	# Take into account that authentication for the control port is disabled. Must be used in secure and controlled environments

	echo "Running: tor --RunAsDaemon 1 --CookieAuthentication 0 --HashedControlPassword \"\" --ControlPort $control_port --PidFile tor$i.pid --SocksPort $socks_port --DataDirectory data/tor$i"

	tor --RunAsDaemon 1 --CookieAuthentication 0 --HashedControlPassword "" --ControlPort $control_port --PidFile tor$i.pid --SocksPort $socks_port --DataDirectory data/tor$i
done

Python Script

The script below stores its results on MongoDB on the “imdb” db under the “imdb.ratings” collection. To handle the number of simultaneous workers you can change the “Discovery.NWorkers” variable. Note that the the number of workers must be equal to or less than the number of Tor instances.

#!/usr/bin/python

import httplib
import socks
import urllib2
from Queue import Queue
from threading import Thread, Condition, Lock
from threading import active_count as threading_active_count

import time
from pymongo import Connection
import pymongo

url_format = 'http://www.imdb.com/user/ur{0}/ratings'

http_codes_counter = {}

MONGODB_HOSTNAME = '192.168.0.118'

"""
https://gist.github.com/869791

SocksiPy + urllib handler

version: 0.2
author: e

This module provides a Handler which you can use with urllib2 to allow it to tunnel your connection through a socks.sockssocket socket, without monkey patching the original socket...
"""

class SocksiPyConnection(httplib.HTTPConnection):
    def __init__(self, proxytype, proxyaddr, proxyport = None, rdns = True, username = None, password = None, *args, **kwargs):
        self.proxyargs = (proxytype, proxyaddr, proxyport, rdns, username, password)
        httplib.HTTPConnection.__init__(self, *args, **kwargs)

    def connect(self):
        self.sock = socks.socksocket()
        self.sock.setproxy(*self.proxyargs)
        if isinstance(self.timeout, float):
            self.sock.settimeout(self.timeout)
        self.sock.connect((self.host, self.port))

class SocksiPyHandler(urllib2.HTTPHandler):
    def __init__(self, *args, **kwargs):
        self.args = args
        self.kw = kwargs
        urllib2.HTTPHandler.__init__(self)

    def http_open(self, req):
        def build(host, port=None, strict=None, timeout=0):
            conn = SocksiPyConnection(*self.args, host=host, port=port, strict=strict, timeout=timeout, **self.kw)
            return conn
        return self.do_open(build, req)

class Monitor(Thread):
	def __init__(self, queue, discovery):
		Thread.__init__(self)
		self.queue = queue
		self.discovery = discovery
		self.finish_signal = False

	def finish(self):
		self.finish_signal = True

	def run(self):
		while not self.finish_signal:
			time.sleep(5)
			print "Elements in Queue:", self.queue.qsize(), "Active Threads:", threading_active_count(), "Exceptions Counter:", self.discovery.exception_counter

class Worker(Thread):
	def __init__(self, queue, discovery, socks_proxy_port):
		Thread.__init__(self)
		self.queue = queue
		self.discovery = discovery
		self.socks_proxy_port = socks_proxy_port
		self.opener = urllib2.build_opener(SocksiPyHandler(socks.PROXY_TYPE_SOCKS4, 'localhost', self.socks_proxy_port))
		self.conn = Connection(MONGODB_HOSTNAME, 27017)
		self.db = self.conn.scraping
		self.coll = self.db.imdb.ratings

	def get_url(self, url):
		try:
			#h = urllib2.urlopen(url)
			h = self.opener.open(url)

			return h.getcode()

		except urllib2.URLError, e:
			return e.code

	def run(self):
		while True:
			try:
				index = self.queue.get()

				if index == None:
					self.queue.put(None) # Notify the next worker
					break

				url = url_format.format(index)

				code = self.get_url(url)

				self.coll.update({'index':index}, {'$set': {'last_response':code}})

				self.discovery.lock.acquire()
				self.discovery.records_to_process -= 1
				if self.discovery.records_to_process == 0:
					self.discovery.lock.notify()
				self.discovery.lock.release()

			except (socks.Socks4Error, httplib.BadStatusLine), e:
				# TypeError: 'Socks4Error' object is not callable
				print e
				self.discovery.exception_counter_lock.acquire()
				self.discovery.exception_counter += 1
				self.discovery.exception_counter_lock.release()
				pass # leave this element for the next cycle

			time.sleep(1.5)

class Croupier(Thread):
	Base = 0
	Top = 25000000
	#Top = 1000
	def __init__(self, queue, discovery):
		Thread.__init__(self)
		self.conn = Connection(MONGODB_HOSTNAME, 27017)
		self.db = self.conn.scraping
		self.coll = self.db.imdb.ratings
		self.finish_signal = False
		self.queue = queue
		self.discovery = discovery
		self.discovery.records_to_process = 0

	def run(self):
		# Look if imdb collection is empty. Only if its empty we create all the items
		c = self.coll.count()
		if c == 0:
			print "Inserting items"
			self.coll.ensure_index([('index', pymongo.ASCENDING), ('last_response', pymongo.ASCENDING)])
			for i in xrange(Croupier.Base, Croupier.Top):
				self.coll.insert({'index':i, 'url': url_format.format(i), 'last_response': 0})

		else:
			print "Using #", c, " persisted items"

		while True:
			#items = self.coll.find({'last_response': {'$ne': 200}})
			items = self.coll.find({'$and': [{'last_response': {'$ne': 200}}, {'last_response' : {'$ne': 404}}]}, timeout = False)

			self.discovery.records_to_process = items.count()

			if self.discovery.records_to_process == 0:
				break

			for item in items:
				self.queue.put(item['index'])

			# Wait until the last item is updated on the db
			self.discovery.lock.acquire()
			while self.discovery.records_to_process != 0:
				self.discovery.lock.wait()
			self.discovery.lock.release()

#			time.sleep(5)

		# Send a 'signal' to workers to finish
		self.queue.put(None)

	def finish(self):
		self.finish_signal = True

class Discovery:
	NWorkers = 71
	SocksProxyBasePort = 9050
	Contention = 10000

	def __init__(self):
		self.queue = Queue(Discovery.Contention)
		self.workers = []
		self.lock = Condition()
		self.exception_counter_lock = Lock()
		self.records_to_process = 0
		self.exception_counter = 0

	def start(self):
		croupier = Croupier(self.queue, self)
		croupier.start()

		for i in range(Discovery.NWorkers):
			worker = Worker(self.queue, self, Discovery.SocksProxyBasePort + i)
			self.workers.append(worker)

		for w in self.workers:
			w.start()

		monitor = Monitor(self.queue, self)
		monitor.start()

		for w in self.workers:
			w.join()

		croupier.join()

		print "Queue finished with:", self.queue.qsize(), "elements"

		monitor.finish()

def main():
	discovery = Discovery()
	discovery.start()

if __name__ == '__main__':
	main()

#
# MISC NOTES
#
# - How many IMDB ratings pages are currently indexed by Google? query: inurl:www.imdb.com/user/*/ratings
# - [pymongo] cursor id '239432858681488351' not valid at server Options: http://groups.google.com/group/mongodb-user/browse_thread/thread/4ed6e3d77fb1c2cf?pli=1
#     That error generally means that the cursor timed out on the server -
#     this could be the case if you are performing a long running operation
#     while iterating over the cursor. The best bet is probably to turn off
#     the timeout by passing "timeout=False" in your call to find:
#

This script will gather users with public ratings using the following MongoDB query: db.imdb.ratings.find({‘last_response’: 200})
Try exporting the movies ratings. This the easiest part because it is now a comma separated value file and you don’t need an XPath query.

Additional observations

  1. We are not just using MongoDB because it is fancy, but also because it is very practical for quickly prototyping and persisting data along the way. The well-known “global lock” limitation on MongoDB (and many other databases) does not significantly affect its ability to efficiently store data.
  2. We use SocksiPy to allow us to use different proxies at the same time.
  3. If you are serious about using Tor to build a distributed infrastructure you might consider running Tor proxies on AWS EC2 instances as needed.
  4. Do not forget to run Tor instances in a secure environment since the control port is open to everyone without authentication.
  5. Our solution is easily scalable.
  6. If you get many 503 return codes, try balancing the quantity of proxies and delaying each worker’s activity.

See Also

  1. Running Your Own Anonymous Rotating Proxies
  2. Web Scraping Ajax and Javascript Sites
  3. Luminati Distributed Web Crawling

Resources

  1. An Improved Algorithm for Tor Circuit Scheduling
  2. How Much Anonymity does Network Latency Leak?
  3. StackOverflow Tor Questions
  4. New IMDB Ratings Breaks Everything
  5. Distributed Harvesting and Scraping