Web Scraping 101: Pulling Stories from Hacker News

This is a guest post by Hartley Brody, whose book “The Ultimate Guide to Web Scraping” goes into much more detail on web scraping best practices. You can follow him on Twitter, it’ll make his day! Thanks for contributing Hartley!

Hacker News is a treasure trove of information on the hacker zeitgeist. There are all sorts of cool things you could do with the information once you pull it, but first you need to scrape a copy for yourself.

Hacker News is actually a bit tricky to scrape since the site’s markup isn’t all that semantic — meaning the HTML elements and attributes don’t do a great job of explaining the content they contain. Everything on the HN homepage is in two tables, and there aren’t that many classes or ids to help us hone in on the particular HTML elements that hold stories. Instead, we’ll have to rely more on patterns and counting on elements as we go.

Pull up the web inspector in Chrome and try zooming up and down the DOM tree. You’ll see that the markup is pretty basic. There’s an outer table that’s basically just used to keep things centered (85% of the screen width) and then an inner table that holds the stories.

Debugging Hacker News Page

If you look inside the inner table, you’ll see that the rows come in groups of three: the first row in each group contains the headlines and story links, the second row contains the metadata about each story — like who posted it and how many points it has — and the third row is empty and adds a bit of padding between stories. This should be enough information for us to get started, so let’s dive into the code.

I’m going to try and avoid the religious tech wars and just say that I’m using Python and my trusty standby libraries — requests and BeautifulSoup — although there are many other great options out there. Feel free to use your HTTP requests library and HTML parsing library of choice.

In its purest form, web scraping is two simple steps: 1. Make a request to a website that generates HTML, and 2. Pull the content you want out of the HTML that’s returned.

As the programmer, all you need to do is a bit of pattern recognition to find the URLs to request and the DOM elements to parse, and then you can let your libraries do the heavy lifting. Our code will just glue the two functions together to pull out just what we need.

import requests

from BeautifulSoup import BeautifulSoup
# make a single request to the homepage
r = requests.get("https://news.ycombinator.com/")
# convert the plaintext HTML markup into a DOM-like structure that we can search
soup = BeautifulSoup(r.text)
# parse through the outer and inner tables, then find the rows
outer_table = soup.find("table")
inner_table = outer_table.findAll("table")[1]
rows = inner_table.findAll("tr")
stories = []
# create an empty list for holding stories
rows_per_story = 3
# helps us iterate over the table
for row_num in range(0, len(rows)-rows_per_story, rows_per_story):
	# grab the 1st & 2nd rows and create an array of their cells
	story_pieces = rows[row_num].findAll("td")
	meta_pieces = rows[row_num + 1].findAll("td")
	# create our story dictionary
	story = { "current_position": story_pieces[0].string, "link": story_pieces[2].find("a")["href"], "title": story_pieces[2].find("a").string, }
		story["posted_by"] = meta_pieces[1].findAll("a")[0].string
	except IndexError:
		continue # this is a job posting, not a story stories.append(story)

import json
print json.dumps(stories, indent=1)

You’ll notice that inside the for loop, when we’re iterating over the rows in the table two at a time, we’re parsing out the individual pieces of content (link, title, etc) by skipping to a particular number in the list of <td> elements returned. Generally, you want to avoid using magic numbers in your code, but without more semantic markup, this is what we’re left to work with.

This obviously makes the scraping code brittle, if the site is ever redesigned or the elements on the page move around at all, this code will no longer work as designed. But I’m guessing from the consistently minimalistic, retro look that HN isn’t getting a facelift any time soon. ;)

Extension Ideas

Running this script top-to-bottom will print out a list of all the current stories on HN. But if you really want to do something interesting, you’ll probably want to grab snapshots of the homepage and the newest page fairly regularly. Maybe even every minute.

There are a number of cool projects that have already built cool extensions and visualizations from (I presume) scraping data from Hacker News, such as:

  • http://hnrankings.info/
  • http://api.ihackernews.com/
  • https://www.hnsearch.com/

It’d be a good idea to set this up using crontab on your web server. Run crontab -e to pull up a vim editor and edit your machine’s cron jobs, and add a line that looks like this:

* * * * * python /path/to/hn_scraper.py

Then save it and exit (<esc> + “:wq”) and you should be good to go. Obviously, printing things to the command line doesn’t do you much good from a cron job, so you’ll probably want to change the script to write each snapshot of stories into your database of choice for later retrieval.

Basic Web Scraping Etiquette

If you’re going to be scraping any site regularly, it’s important to be a good web scraping citizen so that your script doesn’t ruin the experience for the rest of us… aw who are we kidding, you’ll definitely get blocked before your script causes any noticeable site degradation for other users on Hacker News. But still, it’s good to keep these things in mind whenever you’re making frequent scrapes on the same site.

Your HTTP Requests library probably lets you set headers like User Agent and Accept-Encoding. You should set your user agent to something that identifies you and provides some contact information in case any site admins want to get in touch.

You also want to ensure you’re asking for the gzipped version of the site, so that you’re not hogging bandwidth with uncompressed page requests. Use the Accept-Encoding request header to tell the server your client can accept gzipped responses. The Python requests library automagically unzips those gzipped responses for you.

You might want to modify line 4 above to look more like this:

headers = { "User-Agent": "HN Scraper / Contact me: ", "Accept-Encoding": "gzip", }
r = requests.get("https://news.ycombinator.com/", headers=headers)

Note that if you were doing the scraping with some sort of headless browser or something like Selenium which actually downloads all the resources on the page and renders them, you’d also want to make sure you’re caching the stylesheet and images to avoid unnecessary extra requests.

If you liked this article, you might also like:

  1. Scraping Web Sites which Dynamically Load Data
  2. Ideas and Execution Magic Chart (includes a Hacker News Search Hack)
  3. Running Your Own Anonymous Rotating Proxies

Voice Recognition + Content Extraction + TTS = Innovative Web Browsing

An Interesting Opportunity

Voice recognition and text to speech technologies make a good combo for future user interfaces. Imagine browsing the web by voice and listening to blogs like you listen to podcasts. Your mobile phone will be able to provide these features in the near future. Voice web browsing is not only useful for the visually impaired, but also for users who wish to do other things while surfing the web. Voice recognition and text to speech feature prominently on the iPhone 4S, but voice web browsing is not incorporated.

This blog post includes code which allows you to extract the main text content from a web page and convert it to a playable audio file. The process is triggered by simple voice commands such as “receive hacker news”, “read article 7” or “save article 19”. The resulting audio file can also be synchronized to other services such as DropBox, Amazon Cloud Drive, and Apple iCloud to be played later. We use IKVM to allow us to run the boilerpipe library, which is written in Java, over .NET. We choose .NET because it comes with “batteries included”: ready to use voice recognition and text to speech capabilities.

The present article demonstrates the application of main text content extraction. For methods of MTCE see our article: Extraction of Main Text Content Using the Google Reader NoAPI. For further exploration of voice recognition and text to speech .NET capabilities consult the following links:

Good voice recognition and text to speech systems are expensive and require training. Companies which provide these services do not offer a more granular way to access a web service or run a local engine. For example, AT&T Natural Voice for TTS sounds good, but their licensing terms are prohibitive for small companies and startups. On the voice recognition side of the equation, Nuance has been accumulating patents to strengthen their market position, making it difficult for others to compete. Although many other companies offer voice recognition, most of them actually use Nuance’s technology. See for example: Siri, Do You Use Nuance Technology? Siri: I’m Sorry, I Can’t Answer That. Voice recognition systems require training to improve their accuracy. SpinVox, a Nuance subsidiary, used “conversion experts”. They built teams which listened to audio messages and manually converted them to text. If you want to use this approach, you’ll need to wait for Amazon’s Mechanical Turk to offer micro jobs in real time.

What is missing here? None of the leading companies offer good quality voice recognition on a charge per use basis. Google seems to be actively researching voice recognition, and has achieved impressive results with “experimental” speech recognition technology. Sadly, Google’s voice recognition and text to speech APIs can not be used to develop all desktop and server applications. Their use is restricted to Android phones, Chrome’s beta html5 support, and Chrome’s extensions. It would be nice for Google to remove this restriction and include this service on their web APIs Console.

Our code demonstrates that it is possible to use voice recognition and text to speech while avoiding the licensing, patent and API conundrum.

Using VR and TTS under .NET


If you use our VoiceWebBrowsing code, also available on GitHub, you will just need Microsoft Visual Studio 2010. However if there is a new version of boilerpipe you will have to generate the boilerpipe .NET assemblies yourself as follows:

  1. Have Microsoft Visual Studio 2010
  2. Download boilerpipe from http://code.google.com/p/boilerpipe/
  3. Download and install IKVM from http://www.ikvm.net/
  4. Run boilerpipe library and dependencies through ikvmc: ikvmc -nojni -target:library  boilerpipe-1.2.0.jar lib\nekohtml-1.9.13.jar lib\xerces-2.9.1.jar
  5. Use the resulting boilerpipe-1.2.0.dll .NET assembly from ikvmc.


using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Speech.Recognition;
using System.Speech.Synthesis;
using System.Speech.AudioFormat;
using System.Net;
using System.IO;
using System.Xml;

namespace VoiceWebBrowsing
    public partial class MainForm : Form
        #region Constants
        //const string _bogusRSSFeed = "<items><item><title>First title</title><link>http://</link></item><item><title>Second title</title><link>http://</link></item></items>";
        const string _bogusRSSFeed = null;
        const string downloadPath = @"..\..\..\..\Download";

        #region Private Variables
        SpeechRecognizer _speechRecognizer = new SpeechRecognizer();
        SpeechSynthesizer _ttsVoice = new SpeechSynthesizer();
        SpeechAudioFormatInfo formatInfo = new SpeechAudioFormatInfo(8000, AudioBitsPerSample.Sixteen, AudioChannel.Mono);
        Queue<string> _queue = new Queue<string>();
        List<string> articleList = new List<string>();
        HashSet<SpeechSynthesizer> tts2FileTasks = new HashSet<SpeechSynthesizer>();

        #region Private Methods
        private void InitGrammar()
            GrammarBuilder readGrammar = new Choices(new string[] { "read article" });
            Choices articleChoice = new Choices();
            for (int i = 1; i <= 30; i++)

            GrammarBuilder saveGrammar = new Choices(new string[] { "save article" });

            GrammarBuilder otherGrammar = new Choices(new string[] { "receive hacker news", "stop", "test" });

            //GrammarBuilder commands = new Choices(new string[] { "receive hacker news", "stop", "test" });
            Choices commands = new Choices();
            commands.Add(new Choices(new GrammarBuilder[] { readGrammar, saveGrammar, otherGrammar }));

            var grammar = new Grammar(commands);

        private void Say(string text)

        private void ReadHackerNewsFeed()
            string hackerNewsRSSUrl = "http://news.ycombinator.com/rss";

            using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
                // an app.config is added to surpress: The server committed a protocol violation. Section=ResponseStatusLine
                string rssXmlStr = null;
                if (_bogusRSSFeed == null)
                    rssXmlStr = client.DownloadString(hackerNewsRSSUrl);
                    rssXmlStr = _bogusRSSFeed;
                XmlDocument xmlDoc = new XmlDocument();

                XmlNodeList items = xmlDoc.SelectNodes("//item");

                int counter = 1;
                foreach (XmlNode item in items)
                    string title = item.SelectSingleNode("title").InnerText;
                    string link = item.SelectSingleNode("link").InnerText;
                    Say("article " + counter.ToString() + " " + title);


        private void ReceiveHackerNewsButton_Click(object sender, EventArgs e)

        private void SaveArticle(string link, string article)
            SpeechSynthesizer tts2File = new SpeechSynthesizer();
            tts2File.SpeakStarted += new EventHandler<SpeakStartedEventArgs>(tts2File_SpeakStarted);
            tts2File.SpeakCompleted += new EventHandler<SpeakCompletedEventArgs>(tts2File_SpeakCompleted);
            System.Security.Cryptography.SHA1Managed hashAlgorithm = new System.Security.Cryptography.SHA1Managed();
            byte[] buffer = Encoding.UTF8.GetBytes(link);
            byte[] hash = hashAlgorithm.ComputeHash(buffer);
            string fileName = BitConverter.ToString(hash).Replace("-", string.Empty) + ".wav";
            string executionPath = System.Reflection.Assembly.GetExecutingAssembly().Location;
            string fullPath = Path.Combine(executionPath, downloadPath, fileName);
            tts2File.SetOutputToWaveFile(fullPath, formatInfo);


        #region Constructor
        public MainForm()
            _speechRecognizer.Enabled = true;
            _speechRecognizer.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(_speechRecognizer_SpeechRecognized);

        #region Events
        void _ttsVoice_SpeakCompleted(object sender, SpeakCompletedEventArgs e)
            this._ttsVoice.SetOutputToNull(); // Needed for flushing file buffers.

        void _speechRecognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
            string command = e.Result.Text;
            CommandTextBox.Text = command;

            if (command == "test")

            if (command == "stop")


            if (command == "receive hacker news")


            if (command.Contains("read article"))
                string[] words = command.Split(' ');



            if (command.Contains("save article"))
                string[] words = command.Split(' ');



        private void MainForm_Load(object sender, EventArgs e)

        void tts2File_SpeakStarted(object sender, SpeakStartedEventArgs e)

        void tts2File_SpeakCompleted(object sender, SpeakCompletedEventArgs e)
            SpeechSynthesizer tts2File = (SpeechSynthesizer)sender;


        private void StopButton_Click(object sender, EventArgs e)
        private void ReadArticleButton_Click(object sender, EventArgs e)

        private void SaveArticleButton_Click(object sender, EventArgs e)

        private void BackgroundWorker_DoWork(object sender, DoWorkEventArgs e)
            while (true)

                    if(this._queue.Count > 0)
                        string cmd = this._queue.Dequeue();

                        if (cmd != null)
                            System.Uri uri = new Uri(cmd);
                            if (uri.Scheme == "voicewebbrowsing")
                                if (uri.Host == "receivehackernews")
                                else if (uri.Host == "stop")
                                else if (uri.Host == "readarticle" || uri.Host == "savearticle")
                                    string articleNumberStr = System.IO.Path.GetFileName(uri.AbsolutePath);
                                    int articleNumber = int.Parse(articleNumberStr);

                                    if (articleNumber > articleList.Count)
                                        Say("please retrieve hacker news articles first");
                                        articleNumber--; // 0-based index
                                        string link = articleList[articleNumber];

                                        java.net.URL url = new java.net.URL(link);
                                        string article = de.l3s.boilerpipe.extractors.ArticleExtractor.INSTANCE.getText(url);
                                        if (uri.Host == "readarticle")
                                        else if (uri.Host == "savearticle")
                                            SaveArticle(link, article);

        #region Commands
        private void ReceiveHackerNewsCommand()
            lock (this)
        private void StopCommand()
            lock (this)

        private void ReadArticleCommand(Decimal article)
            lock (this)
                this._queue.Enqueue(String.Format("voicewebbrowsing://readarticle/{0}", article.ToString()));

        private void SaveArticleCommand(decimal article)
            lock (this)
                this._queue.Enqueue(String.Format("voicewebbrowsing://savearticle/{0}", article.ToString()));

See Also

  1. Extraction of Main Text Content Using the Google Reader NoAPI

Where can you go from here?

  1. You can write a continuous Hacker News front page reader which constantly checks the feed and reads you new titles.
  2. You can write a voice oriented mobile web browser for Windows Phone, Google Android, and for Google Chrome using their APIs. To write a mobile browser for iPhones, you will have to wait for iOS Siri API.
  3. You can write a cloud service to provide a web page to a podcast converter service, store the audio file in the cloud, and automatically convert Google Reader’s starred items.
  4. Finally, you can research how to improve TTS and VR technologies.


  1. Dragon Speech Recognition Software
  2. Publications by Googlers in Speech Processing
  3. Patent case seeks to silence Nuance voice recognition
  4. Nuance Loses First Patent Fight with Vlingo, Others to Follow
  5. eSpeak: Open Source Text to Speech
  6. CMU Sphinx: Speech Recognition Open Source Toolkit
  7. SAM:  The First Commercial voice synthesis program for Commodore 64, Apple and Atari computers.
  8. Siri
  9. Microsoft Tellme
  10. The Mobile Challenge: My Personal Rants
  11. List of speech recognition software
  12. What is the difference between System.Speech.Recognition and Microsoft.Speech.Recognition?
  13. Siri for everyone, with Pioneer’s Zypr API
  14. Reverse Engineering and Cracking Apple Siri with SiriProxy

Ideas and Execution Magic Chart

Ideas vs Execution

There is an endless discussion in the startup community about the value of ideas versus the importance of execution. Here is a timeline showing Hacker News community submissions with the idea(s) keyword in the title:

I am no prophet, but I believe the future will most likely lean towards ideas because the cost of creating and operating a web company has been dramatically reduced. Soon marketing and sales services will be more affordable, making it easier to resolve the business puzzle. On the other hand, although following Joseph Schumpeter’s thinking, big companies have an advantage because they have more resources, they often prefer to follow the acquisition route after market natural selection instead of building risky projects from scratch. Entrepreneurs benefit from reduced competition in the initial phase of product development.

Magic Chart

This is an exercise, you must be objective to fill in your chart, and dabble in the black art of time estimation. The idea of the magic chart is to fill in a scatter plot chart. The x axis shows the time you expect it to take to execute the idea (you can limit it to development time first), and the y axis the potential of the idea. You can easily add other dimensions like cost, to the graph by using the size of the point plotted or colors. Add a vertical asymptote to the chart at the outside time limit which is feasible for you.

Here is my magic chart:


As you see it’s difficult to came up with ideas which can be executed in a short time and many of the ideas fall on an uncertainty beyond some time point. If you think that having a minimum viable product is key, then you must think very hard about how to reduce your product execution time, and this is an art more than a science. The need to generate profit is a serious constraint. Your idea may be excellent and your software may be used by millions of people, but you may lack a business model.

How is your ideas execution magic chart landscape?

HNSearch Script

Here is the Python script for retrieving Hacker News posts with the words idea and ideas in the title. It includes a legal hack (what else?) to bypass the ThriftDB’s HNSearch API imposed limit of 1000 items.

# -*- coding: utf-8 -*-

# Done under Visual Studio 2010 using the excelent Python Tools for Visual Studio http://pytools.codeplex.com/

import urllib2
import json
from datetime import datetime
from time import mktime
import csv
import codecs
import cStringIO

class CSVUnicodeWriter: # http://docs.python.org/library/csv.html
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        # empty queue

    def writerows(self, rows):
        for row in rows:

def get_hackernews_articles_with_idea_in_the_title():
    endpoint = 'http://api.thriftdb.com/api.hnsearch.com/items/_search?filter[fields][title]=idea&start={0}&limit={1}&sortby=map(ms(create_ts),{2},{3},4294967295000)%20asc'

    incomplete_iso_8601_format = '%Y-%m-%dT%H:%M:%SZ'

    items = {}
    start = 0
    limit = 100
    begin_range = 0
    end_range = 0

    url = endpoint.format(start, limit, begin_range, str(int(end_range)))
    response = urllib2.urlopen(url).read()
    data = json.loads(response)

    prev_timestamp = datetime.fromtimestamp(0)

    results = data['results']

    while results:
        for e in data['results']:
            _id = e['item']['id']
            title = e['item']['title']
            points = e['item']['points']
            num_comments = e['item']['num_comments']
            timestamp = datetime.strptime(e['item']['create_ts'], incomplete_iso_8601_format)

            #if timestamp < prev_timestamp: # The results are not correctly sorted. We can't rely on this one.             if _id in items: # If the circle is complete.                 return items             prev_timestamp = timestamp                      items[_id] = {'id':_id, 'title':title, 'points':points, 'num_comments':num_comments, 'timestamp':timestamp}             title_utf8 = title.encode('utf-8')             print title_utf8, timestamp, _id, points, num_comments         start += len(results)         if start + limit > 1000:
            start = 0
            end_range = mktime(timestamp.timetuple())*1000

        url = endpoint.format(start, limit, begin_range, str(int(end_range))) # if not str(int(x)) then a float gives in the sci math form: '1.24267528e+12'
        response = urllib2.urlopen(url).read()
        data = json.loads(response)
        results = data['results']

    return items

if __name__ == '__main__':
    items = get_hackernews_articles_with_idea_in_the_title()

    with open('hn-articles.csv', 'wb') as f:
        hn_articles = CSVUnicodeWriter(f)

        hn_articles.writerow(['ID', 'Timestamp', 'Title', 'Points', '# Comments'])

        for k,e in items.items():
            hn_articles.writerow([str(e['id']), str(e['timestamp']), e['title'], str(e['points']), str(e['num_comments'])])

# It returns 3706 articles where the query says that they are 3711... find the bug...



  1. Are Ideas Getting Harder to Find? (2016)
  2. Science as Art
  3. Thinking Skills Instruction: Concepts and Techniques (Anthology)
  4. De Bono’s Lateral Thinking
  5. TRIZ
  6. Schumpeter’s Creative Destruction: A Review of the Evidence
  7. Google Query: “ideas vs execution” OR “execution vs ideas”
  8. Google Query: site:news.ycombinator.com AND (intitle:idea OR intitle:ideas)
  9. Startup Ideas We’d Like to Fund
  10. My list of ideas, if you’re looking for inspiration by Jacques Mattheij
  11. Startup Ideas We’d Like to Fund by Paul Graham.
  12. Ideas don’t make you rich. The correct execution of ideas does excerpt from Felix Dennis book.
  13. Ideas suck by Chris Prescott.
  14. Execution Matters, Ideas Don’t by Fred Wilson.
  15. What Is Twitter’s Problem? No, It’s Not the Product
  16. 1000 results limit? (HNSearch NoAPI limits, bonus hack included in this article).
  17. Year 2038 problem
  18. How to use time > year 2038 on official Windows Python 2.5
  19. Solr FunctionQuery
  20. HackerNews Ideas Articles
  21. Execution Is An Order Of Magnitude Easier Than Opportunity