Egont Part II

(part I here)


Egont is a shared space where users mashup personal information.
Its top goals are:
  • Discovering and curating new information in a personalized and dynamic way.
  • Promoting emergent behavior in a shared programming environment
  • Facilitating Serendipity.

Egont is a personalization environment where users can connect to, import, expose, and index data from their web services. They can also apply functions to build mashups around their personal interest like in a spreadsheet. On Egont, users can combine and exchange information. For example, users can connect their Egont accounts to a variety of services like movie rankings, and merge rankings from their social networks. If they want to find independent films they can filter out blockbusters. When users from their social networks update their rankings, these updates are processed and the result is automatically recalculated. The same idea can be applied to streams from Twitter or blog posts. One user can apply a filter to those streams to curate information apart from mainstream trends and recommendation systems, while other users can build new filters using this user’s data. Third parties can take advantage of the data flowing in this shared environment by developing new information functions.

Egont has a simple programming language where experienced users can access other user’s variable namespaces and handle security granularities to enable or restrict the flow of information. Less experienced users personalize their Egont experience using a simpler web interface.


Egont is composed of the following elements:
  1. A data flow engine
  2. A data store where cell values are persisted.
  3. A web application
  4. A simple programming language

Data Flow Engine

The data flow engine works like a spreadsheet. Some cells may be dependant on others. Values are recalculated only when necessary. For example, one cell may contain a function to retrieve new tweets, while another cell takes those tweets and uses a second function to extract named entities like places or proper names. Users can personalize the vast flow of information from many sources to process, aggregate, and filter information. The data flow engine limits recalculation to affected cells only.

The key feature of the engine is its ability to apply functions to a set of shared cells from other users. Another important feature is the handling of security settings. Users can configure which cells are shared with which users at a very granular level.

Web Application

The web application has two important parts. One is the editor where advanced users can use the browser to edit their Egont scripts. The other is a simpler user interface where users are able to define their sources of information and apply functions to them more easily.

Programming Language

The goal of Egont is to simplify the building of personalization and mashups, so its programming language is oriented to quickly orchestrating user information.

This is a rough example of how an advanced user could use Egont programming language to merge friends movie rankings.

friends <- [egont.users.alice, egont.users.bob, me] # list of friends.
movies_ranking <- imdb.ranking("swain-4") # persist my ranking on movies_ranking from my user on IMDB.
movies_average <- average(apply(friends, ’movies_ranking’)) # calculate the average of movies rankings from my specified friends. It only changes when rankings are updated
egont.feeds <- movies_average # expose the results as a feed in the web application.

Whenever any of the above users modify a movie’s ranking Egont recalculates that movie’s score.

With Egont,  we will have a place where we can discover new resources, research our interests, and create a community capable of sifting through the ever more vast sea of data available on today’s web.

See Also

  1. Parsing S-Expressions in C# using OMeta


  1. A Brief History of Spreadsheets
  2. Kahn process networks
  3. Directed acyclic graph
  4. Advances in IC-Scheduling Theory: Scheduling Expansive and Reductive Dags and Scheduling Dags via Duality
  5. Pregel: A System for Large-Scale Graph Processing
  6. Grzegorz Malewicz’s Google Research page
  7. CIEL: a universal execution engine for distributed data-flow computing
  8. Bloom Programming Language (via ComingThoughts)

The Python POPO’s Way to Integrate PayPal Instant Payment Notification

Pompeo Massani: The Money Counter

Python PayPal IPN

PayPal is the fastest, but not the best, way to incorporate payments on your web site and reach a worldwide audience. If you are searching for a Plain Old Python Object (POPO) way to integrate with the Python programming language, you are on your own. The Instant Payment Notification (IPN) page only incorporates ASP, .NET, ColdFusion, Java, Perl and PHP samples. A web search will bring up a ton of Python code. Most of this code will be for frameworks such as Django. The rest will not be specifically for connecting Python with IPN: there will be a lot of extra code you do not need. Here is a translation of the PHP sample code into Python.


also available on GitHub.


# PHP to Python translation from:

import urllib
import cgi
import cgitb
import socket, ssl, pprint
import pickle
import sys
import json


form = cgi.FieldStorage()

req = 'cmd=_notify-validate'
for k in form.keys():
	v = form[k]
	value = urllib.quote(v.value.decode('string_escape')) #
	req = req + '&{0}={1}'.format(k, value)

header = 'POST /cgi-bin/webscr HTTP/1.0\r\n'
header += 'Content-Type: application/x-www-form-urlencoded\r\n'
header += 'Content-Length: ' + str(len(req)) + '\r\n\r\n'

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
ssl_sock = ssl.wrap_socket(s)
ssl_sock.connect(('', 443)) # Use this for sandbox testing
# ssl_sock.connect(('', 443)) # Use this for production

ssl_sock.write(header + req)

data =
while len(data) > 0:
	if 'VERIFIED' in data:
	elif 'INVALID' in data:
		VERIFIED = False

	data =


if not VERIFIED:
	print "Content-type: text/plain"
	print "Not Verified"

fields = {	'item_name': None,
		'item_number': None,
		'payment_status': None,
		'mc_gross': None,
		'mc_currency': None,
		'txn_id': None,
		'receiver_email': None,
		'payer_email': None,
		'custom': None,

for k in fields.keys():
	if k in form:
		fields[k] = form[k].value

item_name = fields['item_name']
item_number = fields['item_number']
payment_status = fields['payment_status']
payment_amount = fields['mc_gross']
payment_currency = fields['mc_currency']
txn_id = fields['txn_id']
receiver_email = fields['receiver_email']
payer_email = fields['payer_email']

# check the payment_status is Completed
# check that txn_id has not been previously processed
#  check that receiver_email is your Primary PayPal email
# check that payment_amount/payment_currency are correct
# process payment

print "Content-type: text/plain"
print "Verified"


  1. PayPal Developer Network
  2. GitHub projects related to PayPal written in Python

Ideas: Egont, A Web Orchestration Language


Human curiosity goes beyond limited web applications, recommendation systems and search engines. People collect lists of things on the web. Things like music playlists, movie rankings or visited places are populating our web culture, but this information is spread out in different places and we need search engines, social networks, and recommendation systems to leverage it. The real-time web also offers transformation opportunities which are only limited by the imagination.

How can we adjust all this information to our personal or organizational needs? The semantic web could play an important role here, but the web is not organized semantically yet. However, it is possible today to give people tools to manipulate information at a personal and social level. Spreadsheets have hundreds of functions which are used by people with limited computer and mathematical skills. What if we could transform information in a similar way? What if a new stimuli, like a new tweet or a new ranked movie could trigger a cascade of processes?

People and organizations are sharing a record amount of data, but current web platforms tightly dictate the limits to its use. For example Twitter’s API has very small call rates for the general public. Most Twitter applications cannot retrieve more than one or two degrees of a user’s social network without working around these API limitations. Examples of API limitations abound, undermining the opportunities to leverage data potentials.

The inspiration for Egont was come from the idea of a social operating system. People do not only share data, they also share data transformations. Egont is a platform for writing simple code snippets, while allowing others to reuse them to extract new information. It is a shared pipeline which is focused on connecting people’s data and processes. It can be thought of as a living operating system: when a state changes, the dependant processes are recalculated. Although Egont has clear security controls it’s primarily oriented to data that can be shared, even providing tools for exporting information to be analyzed offline. The shift is from a perspective where users accept platforms applications, to a perspective where users do not only generate data but also processes. Users and third parties will be free to write new functions to extend Egont’s capabilities.

(continue to part ii)

Voice Recognition + Content Extraction + TTS = Innovative Web Browsing

An Interesting Opportunity

Voice recognition and text to speech technologies make a good combo for future user interfaces. Imagine browsing the web by voice and listening to blogs like you listen to podcasts. Your mobile phone will be able to provide these features in the near future. Voice web browsing is not only useful for the visually impaired, but also for users who wish to do other things while surfing the web. Voice recognition and text to speech feature prominently on the iPhone 4S, but voice web browsing is not incorporated.

This blog post includes code which allows you to extract the main text content from a web page and convert it to a playable audio file. The process is triggered by simple voice commands such as “receive hacker news”, “read article 7” or “save article 19”. The resulting audio file can also be synchronized to other services such as DropBox, Amazon Cloud Drive, and Apple iCloud to be played later. We use IKVM to allow us to run the boilerpipe library, which is written in Java, over .NET. We choose .NET because it comes with “batteries included”: ready to use voice recognition and text to speech capabilities.

The present article demonstrates the application of main text content extraction. For methods of MTCE see our article: Extraction of Main Text Content Using the Google Reader NoAPI. For further exploration of voice recognition and text to speech .NET capabilities consult the following links:

Good voice recognition and text to speech systems are expensive and require training. Companies which provide these services do not offer a more granular way to access a web service or run a local engine. For example, AT&T Natural Voice for TTS sounds good, but their licensing terms are prohibitive for small companies and startups. On the voice recognition side of the equation, Nuance has been accumulating patents to strengthen their market position, making it difficult for others to compete. Although many other companies offer voice recognition, most of them actually use Nuance’s technology. See for example: Siri, Do You Use Nuance Technology? Siri: I’m Sorry, I Can’t Answer That. Voice recognition systems require training to improve their accuracy. SpinVox, a Nuance subsidiary, used “conversion experts”. They built teams which listened to audio messages and manually converted them to text. If you want to use this approach, you’ll need to wait for Amazon’s Mechanical Turk to offer micro jobs in real time.

What is missing here? None of the leading companies offer good quality voice recognition on a charge per use basis. Google seems to be actively researching voice recognition, and has achieved impressive results with “experimental” speech recognition technology. Sadly, Google’s voice recognition and text to speech APIs can not be used to develop all desktop and server applications. Their use is restricted to Android phones, Chrome’s beta html5 support, and Chrome’s extensions. It would be nice for Google to remove this restriction and include this service on their web APIs Console.

Our code demonstrates that it is possible to use voice recognition and text to speech while avoiding the licensing, patent and API conundrum.

Using VR and TTS under .NET


If you use our VoiceWebBrowsing code, also available on GitHub, you will just need Microsoft Visual Studio 2010. However if there is a new version of boilerpipe you will have to generate the boilerpipe .NET assemblies yourself as follows:

  1. Have Microsoft Visual Studio 2010
  2. Download boilerpipe from
  3. Download and install IKVM from
  4. Run boilerpipe library and dependencies through ikvmc: ikvmc -nojni -target:library  boilerpipe-1.2.0.jar lib\nekohtml-1.9.13.jar lib\xerces-2.9.1.jar
  5. Use the resulting boilerpipe-1.2.0.dll .NET assembly from ikvmc.


using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Speech.Recognition;
using System.Speech.Synthesis;
using System.Speech.AudioFormat;
using System.Net;
using System.IO;
using System.Xml;

namespace VoiceWebBrowsing
    public partial class MainForm : Form
        #region Constants
        //const string _bogusRSSFeed = "<items><item><title>First title</title><link>http://</link></item><item><title>Second title</title><link>http://</link></item></items>";
        const string _bogusRSSFeed = null;
        const string downloadPath = @"..\..\..\..\Download";

        #region Private Variables
        SpeechRecognizer _speechRecognizer = new SpeechRecognizer();
        SpeechSynthesizer _ttsVoice = new SpeechSynthesizer();
        SpeechAudioFormatInfo formatInfo = new SpeechAudioFormatInfo(8000, AudioBitsPerSample.Sixteen, AudioChannel.Mono);
        Queue<string> _queue = new Queue<string>();
        List<string> articleList = new List<string>();
        HashSet<SpeechSynthesizer> tts2FileTasks = new HashSet<SpeechSynthesizer>();

        #region Private Methods
        private void InitGrammar()
            GrammarBuilder readGrammar = new Choices(new string[] { "read article" });
            Choices articleChoice = new Choices();
            for (int i = 1; i <= 30; i++)

            GrammarBuilder saveGrammar = new Choices(new string[] { "save article" });

            GrammarBuilder otherGrammar = new Choices(new string[] { "receive hacker news", "stop", "test" });

            //GrammarBuilder commands = new Choices(new string[] { "receive hacker news", "stop", "test" });
            Choices commands = new Choices();
            commands.Add(new Choices(new GrammarBuilder[] { readGrammar, saveGrammar, otherGrammar }));

            var grammar = new Grammar(commands);

        private void Say(string text)

        private void ReadHackerNewsFeed()
            string hackerNewsRSSUrl = "";

            using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
                // an app.config is added to surpress: The server committed a protocol violation. Section=ResponseStatusLine
                string rssXmlStr = null;
                if (_bogusRSSFeed == null)
                    rssXmlStr = client.DownloadString(hackerNewsRSSUrl);
                    rssXmlStr = _bogusRSSFeed;
                XmlDocument xmlDoc = new XmlDocument();

                XmlNodeList items = xmlDoc.SelectNodes("//item");

                int counter = 1;
                foreach (XmlNode item in items)
                    string title = item.SelectSingleNode("title").InnerText;
                    string link = item.SelectSingleNode("link").InnerText;
                    Say("article " + counter.ToString() + " " + title);


        private void ReceiveHackerNewsButton_Click(object sender, EventArgs e)

        private void SaveArticle(string link, string article)
            SpeechSynthesizer tts2File = new SpeechSynthesizer();
            tts2File.SpeakStarted += new EventHandler<SpeakStartedEventArgs>(tts2File_SpeakStarted);
            tts2File.SpeakCompleted += new EventHandler<SpeakCompletedEventArgs>(tts2File_SpeakCompleted);
            System.Security.Cryptography.SHA1Managed hashAlgorithm = new System.Security.Cryptography.SHA1Managed();
            byte[] buffer = Encoding.UTF8.GetBytes(link);
            byte[] hash = hashAlgorithm.ComputeHash(buffer);
            string fileName = BitConverter.ToString(hash).Replace("-", string.Empty) + ".wav";
            string executionPath = System.Reflection.Assembly.GetExecutingAssembly().Location;
            string fullPath = Path.Combine(executionPath, downloadPath, fileName);
            tts2File.SetOutputToWaveFile(fullPath, formatInfo);


        #region Constructor
        public MainForm()
            _speechRecognizer.Enabled = true;
            _speechRecognizer.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(_speechRecognizer_SpeechRecognized);

        #region Events
        void _ttsVoice_SpeakCompleted(object sender, SpeakCompletedEventArgs e)
            this._ttsVoice.SetOutputToNull(); // Needed for flushing file buffers.

        void _speechRecognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
            string command = e.Result.Text;
            CommandTextBox.Text = command;

            if (command == "test")

            if (command == "stop")


            if (command == "receive hacker news")


            if (command.Contains("read article"))
                string[] words = command.Split(' ');



            if (command.Contains("save article"))
                string[] words = command.Split(' ');



        private void MainForm_Load(object sender, EventArgs e)

        void tts2File_SpeakStarted(object sender, SpeakStartedEventArgs e)

        void tts2File_SpeakCompleted(object sender, SpeakCompletedEventArgs e)
            SpeechSynthesizer tts2File = (SpeechSynthesizer)sender;


        private void StopButton_Click(object sender, EventArgs e)
        private void ReadArticleButton_Click(object sender, EventArgs e)

        private void SaveArticleButton_Click(object sender, EventArgs e)

        private void BackgroundWorker_DoWork(object sender, DoWorkEventArgs e)
            while (true)

                    if(this._queue.Count > 0)
                        string cmd = this._queue.Dequeue();

                        if (cmd != null)
                            System.Uri uri = new Uri(cmd);
                            if (uri.Scheme == "voicewebbrowsing")
                                if (uri.Host == "receivehackernews")
                                else if (uri.Host == "stop")
                                else if (uri.Host == "readarticle" || uri.Host == "savearticle")
                                    string articleNumberStr = System.IO.Path.GetFileName(uri.AbsolutePath);
                                    int articleNumber = int.Parse(articleNumberStr);

                                    if (articleNumber > articleList.Count)
                                        Say("please retrieve hacker news articles first");
                                        articleNumber--; // 0-based index
                                        string link = articleList[articleNumber];

                               url = new;
                                        string article = de.l3s.boilerpipe.extractors.ArticleExtractor.INSTANCE.getText(url);
                                        if (uri.Host == "readarticle")
                                        else if (uri.Host == "savearticle")
                                            SaveArticle(link, article);

        #region Commands
        private void ReceiveHackerNewsCommand()
            lock (this)
        private void StopCommand()
            lock (this)

        private void ReadArticleCommand(Decimal article)
            lock (this)
                this._queue.Enqueue(String.Format("voicewebbrowsing://readarticle/{0}", article.ToString()));

        private void SaveArticleCommand(decimal article)
            lock (this)
                this._queue.Enqueue(String.Format("voicewebbrowsing://savearticle/{0}", article.ToString()));

See Also

  1. Extraction of Main Text Content Using the Google Reader NoAPI

Where can you go from here?

  1. You can write a continuous Hacker News front page reader which constantly checks the feed and reads you new titles.
  2. You can write a voice oriented mobile web browser for Windows Phone, Google Android, and for Google Chrome using their APIs. To write a mobile browser for iPhones, you will have to wait for iOS Siri API.
  3. You can write a cloud service to provide a web page to a podcast converter service, store the audio file in the cloud, and automatically convert Google Reader’s starred items.
  4. Finally, you can research how to improve TTS and VR technologies.


  1. Dragon Speech Recognition Software
  2. Publications by Googlers in Speech Processing
  3. Patent case seeks to silence Nuance voice recognition
  4. Nuance Loses First Patent Fight with Vlingo, Others to Follow
  5. eSpeak: Open Source Text to Speech
  6. CMU Sphinx: Speech Recognition Open Source Toolkit
  7. SAM:  The First Commercial voice synthesis program for Commodore 64, Apple and Atari computers.
  8. Siri
  9. Microsoft Tellme
  10. The Mobile Challenge: My Personal Rants
  11. List of speech recognition software
  12. What is the difference between System.Speech.Recognition and Microsoft.Speech.Recognition?
  13. Siri for everyone, with Pioneer’s Zypr API
  14. Reverse Engineering and Cracking Apple Siri with SiriProxy

Ideas and Execution Magic Chart

Ideas vs Execution

There is an endless discussion in the startup community about the value of ideas versus the importance of execution. Here is a timeline showing Hacker News community submissions with the idea(s) keyword in the title:

I am no prophet, but I believe the future will most likely lean towards ideas because the cost of creating and operating a web company has been dramatically reduced. Soon marketing and sales services will be more affordable, making it easier to resolve the business puzzle. On the other hand, although following Joseph Schumpeter’s thinking, big companies have an advantage because they have more resources, they often prefer to follow the acquisition route after market natural selection instead of building risky projects from scratch. Entrepreneurs benefit from reduced competition in the initial phase of product development.

Magic Chart

This is an exercise, you must be objective to fill in your chart, and dabble in the black art of time estimation. The idea of the magic chart is to fill in a scatter plot chart. The x axis shows the time you expect it to take to execute the idea (you can limit it to development time first), and the y axis the potential of the idea. You can easily add other dimensions like cost, to the graph by using the size of the point plotted or colors. Add a vertical asymptote to the chart at the outside time limit which is feasible for you.

Here is my magic chart:


As you see it’s difficult to came up with ideas which can be executed in a short time and many of the ideas fall on an uncertainty beyond some time point. If you think that having a minimum viable product is key, then you must think very hard about how to reduce your product execution time, and this is an art more than a science. The need to generate profit is a serious constraint. Your idea may be excellent and your software may be used by millions of people, but you may lack a business model.

How is your ideas execution magic chart landscape?

HNSearch Script

Here is the Python script for retrieving Hacker News posts with the words idea and ideas in the title. It includes a legal hack (what else?) to bypass the ThriftDB’s HNSearch API imposed limit of 1000 items.

# -*- coding: utf-8 -*-

# Done under Visual Studio 2010 using the excelent Python Tools for Visual Studio

import urllib2
import json
from datetime import datetime
from time import mktime
import csv
import codecs
import cStringIO

class CSVUnicodeWriter: #
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds) = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        # empty queue

    def writerows(self, rows):
        for row in rows:

def get_hackernews_articles_with_idea_in_the_title():
    endpoint = '[fields][title]=idea&start={0}&limit={1}&sortby=map(ms(create_ts),{2},{3},4294967295000)%20asc'

    incomplete_iso_8601_format = '%Y-%m-%dT%H:%M:%SZ'

    items = {}
    start = 0
    limit = 100
    begin_range = 0
    end_range = 0

    url = endpoint.format(start, limit, begin_range, str(int(end_range)))
    response = urllib2.urlopen(url).read()
    data = json.loads(response)

    prev_timestamp = datetime.fromtimestamp(0)

    results = data['results']

    while results:
        for e in data['results']:
            _id = e['item']['id']
            title = e['item']['title']
            points = e['item']['points']
            num_comments = e['item']['num_comments']
            timestamp = datetime.strptime(e['item']['create_ts'], incomplete_iso_8601_format)

            #if timestamp < prev_timestamp: # The results are not correctly sorted. We can't rely on this one.             if _id in items: # If the circle is complete.                 return items             prev_timestamp = timestamp                      items[_id] = {'id':_id, 'title':title, 'points':points, 'num_comments':num_comments, 'timestamp':timestamp}             title_utf8 = title.encode('utf-8')             print title_utf8, timestamp, _id, points, num_comments         start += len(results)         if start + limit > 1000:
            start = 0
            end_range = mktime(timestamp.timetuple())*1000

        url = endpoint.format(start, limit, begin_range, str(int(end_range))) # if not str(int(x)) then a float gives in the sci math form: '1.24267528e+12'
        response = urllib2.urlopen(url).read()
        data = json.loads(response)
        results = data['results']

    return items

if __name__ == '__main__':
    items = get_hackernews_articles_with_idea_in_the_title()

    with open('hn-articles.csv', 'wb') as f:
        hn_articles = CSVUnicodeWriter(f)

        hn_articles.writerow(['ID', 'Timestamp', 'Title', 'Points', '# Comments'])

        for k,e in items.items():
            hn_articles.writerow([str(e['id']), str(e['timestamp']), e['title'], str(e['points']), str(e['num_comments'])])

# It returns 3706 articles where the query says that they are 3711... find the bug...



  1. Are Ideas Getting Harder to Find? (2016)
  2. Science as Art
  3. Thinking Skills Instruction: Concepts and Techniques (Anthology)
  4. De Bono’s Lateral Thinking
  5. TRIZ
  6. Schumpeter’s Creative Destruction: A Review of the Evidence
  7. Google Query: “ideas vs execution” OR “execution vs ideas”
  8. Google Query: AND (intitle:idea OR intitle:ideas)
  9. Startup Ideas We’d Like to Fund
  10. My list of ideas, if you’re looking for inspiration by Jacques Mattheij
  11. Startup Ideas We’d Like to Fund by Paul Graham.
  12. Ideas don’t make you rich. The correct execution of ideas does excerpt from Felix Dennis book.
  13. Ideas suck by Chris Prescott.
  14. Execution Matters, Ideas Don’t by Fred Wilson.
  15. What Is Twitter’s Problem? No, It’s Not the Product
  16. 1000 results limit? (HNSearch NoAPI limits, bonus hack included in this article).
  17. Year 2038 problem
  18. How to use time > year 2038 on official Windows Python 2.5
  19. Solr FunctionQuery
  20. HackerNews Ideas Articles
  21. Execution Is An Order Of Magnitude Easier Than Opportunity

Articles Summary

This is a summary of all the Data Big Bang blog articles by subject.


A summary of information retrieval stages and current data science articles.


  1. Distributed Scraping With Multiple Tor Circuits
  2. Running Your Own Anonymous Rotating Proxies


  1. HTML Cleaners and Tidiers


Handling of Active Content

  1. Web Scraping Ajax and Javascript Sites

Main Content Extraction

  1. Extraction of Main Text Content Using the Google Reader NoAPI
  2. Voice Recognition + Content Extraction + TTS = Innovative Web Browsing

Language Identification

  1. Language Identification for Text Mining and NLP


  1. Automated Browserless OAuth Authentication for Twitter
  2. The Python POPO’s Way to Integrate PayPal Instant Payment Notification

APIs and NoAPIs

  1. Google Search NoAPI
  2. Exporting StackOverflow users blogs to Excel Hyperlinks
  3. Extraction of Main Text Content Using the Google Reader NoAPI
  4. Integrating Google Analytics into your Company Loop with a Microsoft Excel Add-on

Policies and Data Issues

  1. Scraping vs Antiscraping
  2. The Data Portability Fact Sheet


  1. Ideas and Execution Magic Chart
  2. Ideas: Egont, A Web Orchestration Language
  3. Egont Part II

Marketing and Sales

  1. Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website


  1. Integrating Google Analytics into your Company Loop with a Microsoft Excel Add-on

Big Data Stack

  1. Using Queues in Web Crawling and Analysis Infrastructure
  2. Persisting Native Python Queues
  3. Adding Acknowledgement Semantics to a Persistent Queue
  4. Esoteric Queue Scheduling Disciplines


  1. Running Microsoft Windows Console Applications with Invisible Windows



  1. Data Science Resources
Digital Art by Don Relyea

HTML Cleaners and Tidiers

Tag Soup

When you are crawling a website you will come across a lot of malformed web pages. Some typical problems are unclosed tags, mishandling of comments or of css styles. Modern browsers have to do a good job of cleaning HTML to build the correct DOM without ambiguities. Due to performance and scalability limitations, it is more efficient to process HTML with a parser instead of using a browser or headless browsers such as HTMLUnit or PhantomJS. If your HTML parser does not incorporate the cleaning or fixing process, you will have to use an HTML cleaner or tidier.

As in other processing pipelines if you fail to clean up malformed HTML, all subsequent processes will be stalled. It is important to choose a good HTML cleaner. Many cleaners fail to do their jobs.

HTML Cleaner List

The list of HTML cleaners is long, but the list of good ones is pretty short. In our experience the best choice is lxml.html. Other cleaners often have trouble.

Comprehensive Resources

  1. lxml.html
  2. Beautiful Soup
  3. lxml.html vs Beautiful Soup
  4. Cleaning Word’s Nasty HTML
  5. HTML Cleaners query
  6. Tag soup

Exporting StackOverflow User Blogs to Excel

It’s more simple than what you think

Do you find yourself with the need to automatically convert URLs that you have imported in your cells to hyperlinks? If you search on Google there are many solutions but the top ones add complexity using macros or VBA solutions.

I needed to do it quickly in the context of retrieving information from the StackOverflow API. I was searching for all sites from users in a specific country to keep track of their blogs. So I added

   Location like '%Buenos Aires%' or Location like '%Argentina%'
   Reputation DESC

on  and exported it to CSV but when they were imported to Microsoft Excel I needed to convert the URLs manually. The results are ordered by StackOverflow ranking.

I found this solution, you can just use the Microsoft Excel hyperlink function. Say you have URLs as plain text in the cell D2, then on column E2 add something like =hyperlink(D2) and this will do the trick. You can now click on all the cells!

Please don’t tell recruiters about this. Hopefully they don’t develop web mashups.

See Also

  1. Integrating Google Analytics into your Company Loop with a Microsoft Excel Add-on
  2. Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website


  1. Most influential github users by location
  2. Implement OData API for StackOverflow
  3. Creative Commons Stack Overflow Data Dump

Extraction of Main Text Content Using the Google Reader NoAPI

Theo van Doesburg Dadamatinée


In this article we will see how to extract the main text content from a blog using the Google Reader NoAPI.

Extracting the main text content from a web page is an important step in the text processing pipeline. The source code of pages in HTML is usually cluttered with advertising and other text which is not related to the main content. Formally, in the context of computer science, it is impossible for a computer to distinguish between the main content and other content on the same page. That is, no algorithm can recognize it for all possible cases. Sometimes it is even difficult for humans to distinguish it. Recognition of primary content is part of the machine learning/artificial intelligence field of study.

In practice there are many ways to recognize main content. If, for example, a blog platform includes attributes which indicate where the main content is, the process will be straightforward. Similarly, If the pages on a particular site have a well defined structure, we can also infer where the main content is by sampling a few pages. In this approach, we train the recognizer to apply patterns to additional pages. Of course purely manual work is another option. The quickest way to build an army of human recognizers is to put the job on sites like Amazon’s Mechanical Turk or similar services such as Microworkers.

For a good compilation of resources related to this subject you can see:

Extracting the Main Content from a Blog

If the blog platform includes information about the main text content on their tags, making an XPath expression for each one will do the trick. Now imagine that you want to do it automatically, without depending on each blog platform or blog theme. In this case you can read the RSS feed, which generally only includes main text, and extract the text from there. However, not all blogs post the complete text in the feed. The TechCrunch feed, for example, shows the first part of the text, but you have to click to continue reading. In this case you can use the partial text from the feed to recognize the complete text in the HTML. A potential problem with reading RSS feeds is that they only contain the most recent articles. To get around this limitation, we can get a longer feed history from Google Reader. Google Reader has some gaps and misses some articles, but this issue is beyond the scope of this article.

Getting Blog Text from Google Reader

Since Google Reader does not have a real API we will rely on the Google Reader API lib by Mauro Asprea from Wish and BAM!. He is an active reader of this blog and a friend.

We will retrieve posts by Fred Wilson, one of the most prolific VC bloggers, since he has blogged since 9/23/2003 on an almost daily basis, and includes the whole post within the feed.

Python code

# *-* coding: utf-8 *-*

import sys
import time
from GoogleReader import  CONST
from GoogleReader.reader import GoogleReader
import lxml.html

USERNAME = '' # Replace with your Google Reader username
PASSWORD = '' # Replace with your Google Reader password. Not included in this post :-)

gr = GoogleReader()
login_info = (USERNAME, PASSWORD)

xmlfeed = gr.get_feed(url="")

COUNT = 1000

print >>sys.stderr, "page:", i
for entry in xmlfeed.get_entries():
   print entry['title'].encode('utf-8'), time.ctime(entry['published'])
   doc = lxml.html.fromstring(entry['content']) # Thanks lxml.html for handling incomplete HTML documents!
   print doc.text_content().encode('utf-8')
   print "******************************************************************************************************"

continuation = xmlfeed.get_continuation()

while continuation != None and i < COUNT:
   print >>sys.stderr, "page:", i
   xmlfeed = gr.get_feed(url="", continuation = continuation)

   for entry in xmlfeed.get_entries():
      print entry['title'].encode('utf-8'), time.ctime(entry['published'])
         doc = lxml.html.fromstring(entry['content']) # Thanks lxml.html for handling incomplete HTML documents!
         print doc.text_content().encode('utf-8')
         print "------------------ ERROR -------------------"
         print entry['content']

      print "******************************************************************************************************"

   continuation = xmlfeed.get_continuation()


If you try this script you will realize that the oldest post retrieved is from 9/29/2005. The real first post however was on 9/23/2003. Why don’t we see it? I believe it is because Google Reader uses feed information from FeedBurner, which was launched in 2004 and acquired by Google in 2007, so they probably started recording feed entries then. Incidentally Union Square Ventures was one of the original FeedBurner investors.

There is an easier way to retrieve text in the specific case of Fred Wilson’s blog and other HTML5 modern sites. HTML5 provides an <article> tag, so you can just crawl the whole site and retrieve the content within the <article> tag. You’ll need an extra step to deduplicate the content since many of the crawled pages will appear more than once. For example if you follow categories like MBA Mondays you will find articles that also appear when you follow another path.

Lessons Learned

  • We can use Google Reader to easily extract text content from blogs.
  • Google Reader has its limitations: it doesn’t cover posts before a certain data and sometimes skips posts.
  • HTML5 adds a valuable new tag for differentiating article text from the rest of the content.

See Also

  1. Voice Recognition + Content Extraction + TTS = Innovative Web Browsing
  2. Google Search NoAPI

Additional Resources

  1. Newspaper: News, full-text, and article metadata extraction in Python 3
  2. boilerpipe: Boilerplate Removal and Fulltext Extraction from HTML pages
  3. Readability API
  4. HTML Content Extraction Questions on StackOverflow
  5. Google Reader Development Questions on StackOverflow

Data Science Resources

Big Data, Big Trend

Big Data is an important megatrend:

Companies such as Google, Facebook, Twitter and LinkedIn are using their vast information to discover things that are definitely not obvious, and may even challenge our common sense. Some initiatives like Recorded Future or StreamBase try to predict the future while events like a plane crash were first pointed out on Twitter. One of the funniest blogs about big data is on OKCupid which mines information about relationship matching and can discover connections between orgasms and exercise.

Sharing Data Science Blogs

Following is an OPML file with 228 data science related blogs: Data Science Blogs in OPML format.