Searching for Substrings in Streams: a Slight Modification of the Knuth-Morris-Pratt Algorithm in Haxe

It is odd that the base libraries for most programming languages do not allow you to search for regular expressions and substrings in streams or partial reads. We have modified the KMP algorithm so that it accepts virtually infinite partial strings. The code is implemented in Haxe, so it can generate code in multiple programming languages.

Streams are important when working with data that does not fit in main memory, such as large files, or with data which is being transferred. There are a few implementations of regular expressions and substrings matching. One is the Jakarta Regexp, now retired and resting in the Apache Attic. The Jakarta Regexp library “match” method in the RE class uses a CharacterIterator as a parameter. In C++, Boost.Regex implements partial matches.

Our code is implemented in Haxe so the same code can target Javascript, ActionScript, Flash SWF, NekoVM, PHP, C++, C#, and Java. We really like the concept of writing one code and expanding it to a variety of platforms with minimum effort. There are excellent libraries in specific environments that can work perfectly in other environments. Porting libraries from one programming language to another is tedious. For example, the amazing NetworkX graph library implemented in Python can be easily ported to C# to benefit a broader audience.

Code

Prerequisites

  1. Haxe (tested on version 2.10)
  2. For C++: hxcpp (run haxelib install hxcpp)
  3. For Java: hxjava (run haxelib install hxjava)
  4. For Mono/C#: jxcs (run haxelib install hxcs)

Source code available on github.

See Also

  1. Parsing S-Expressions in C# using OMeta
  2. Esoteric Queue Scheduling Disciplines

Resources

  1. Knuth-Morris-Pratt string matching
  2. Text Searching: Theory and Practice
  3. Boyer–Moore–Horspool algorithm
  4. Rabin–Karp algorithm
  5. Aho–Corasick string matching algorithm
  6. Lexicographically minimal string rotation
  7. Efficient way to search a stream for a string

Esoteric Queue Scheduling Disciplines

New Challenges Requires New Tools

Big Data challenges current message oriented middleware (MOM) applications. MOM usually works with FIFO and priority scheduling disciplines. What happens if there is a large list of URLs ready to be crawled but you want to give URLs at the end of the list a chance of being crawled earlier? This concept comes from genetics, and is used in genetic algorithms selection schemes. The last URLs may contain interesting new resources in spite of their order or priority. Consuming from a FIFO queue takes a long time to crawl these URLs. Priority scheduling is more helpful, but it is not possible to know apriori how useful a URL will end up being in the quest for new Internet resources. Why not add a chance factor to URL selection by using roulette wheel scheduling and an efficient algorithm?

Data flows follow an order of execution based on task dependencies. One task cannot start until the preceding tasks have finished. This is the way a spreadsheet works. A change in a cell triggers a series of processes to be completed in topological order. Why not add task dependency to MOM applications? An item can be consumed from the queue only if its precedent tasks have been completed. We provide some data flow resources at the end of the Egont Part II article. However, a new queue scheduling discipline could be used in place of a separate framework. Ideally, the new queue discipline would include features such as persistence and transactions.

Roulette Wheel Scheduling Algorithm Design

To the best of our knowledge, there are currently no Internet resources about using a roulette wheel scheduling discipline for a queue.

The external interface for a roulette wheel queue is the same as for a typical queue with “get” and ”put” methods except that the “put” method incorporates a new probabilistic parameter. Probabilities can be expressed as integers. When a consumer requests an item, a random number is generated to decide which item is selected. Items with higher probabilities have a greater chance of being retrieved, but even items with low probabilities can be consumed.

The implementation of an efficient roulette wheel queue is not easy. Genetic algorithms use roulette wheel selection to choose between a small set of alternatives. A queue used for crawling can contain a lot of URLs, and the question is how to take these processes into account in order to find, add, and remove URLS efficiently.

Finding an item in a roulette wheel data structure is O(n) for trivial traversing and O(log(n)) using a binary search. Adding an item is trivial, it can be added at the end and the new total is the sum of the previous probability parameters total plus the new item’s probability parameter. Removing an item is more difficult. The trivial, but not the best, solution is to recalculate all the partial sums after the element which is being removed. A better solution is to use a heap tree data structure or one of its variants.

An alternative that merits further study is the use of Fenwick trees. In 1994, Peter M. Fenwick discovered how to improve the finding, adding, and modifying of items and the calculation of their subtotals. Since the Fenwick tree works over a fixed range of items, item keys must be preallocated.

See Also

  1. Using Queues in Web Crawling and Analysis Infrastructure
  2. Persisting Native Python Queues
  3. Adding Acknowledgement Semantics to a Persistent Queue
  4. Ideas: Egont, A Web Orchestration Language

Resources

  1. A New Data Structure for Cumulative Frequency Tables
  2. Select random k elements from a list whose elements have weights and the roulette wheel answer
  3. A comparative analysis of selection schemes used in genetic algorithms
  4. A Framework for Alternate Queueing: Towards Traffic Management by PC-UNIX Based Routers
  5. Stack and Queue Layouts of Directed Acyclic Graphs
  6. Dynamic Data Structures for Taskgraph Scheduling Policies with Applications in OpenCL Accelerators

Photo taken by Kristofer Björkman

Using Queues in Web Crawling and Analysis Infrastructure

Message oriented middleware (MOM) is a key technology for implementing a custom pipeline and analyzing unstructured data. The pipeline for going from crawling web pages to part of speech tagging (PoST) and beyond is long. It requires a variety of processes which are implemented in several different programming languages and operating systems. For example, boilerpipe is an excellent Java library for extracting main text content while PoSTs libraries, like NLTK or FreeLing, are implemented in Python.

One might be tempted to integrate different technologies using web services but web services alone have many weak points. If the pipeline has ten processes and, for example, the last one fails, then the intermediate processes can be lost if they are not persisted. There must be a higher level mechanism in place to resume the pipeline processing. MOMs ensure message persistence until a consumer acknowledges that a specific process has finished.

There are a lot of MOMs to choose from, including commercial and free open source variants. Some features are present in almost all of them while others are not. Contention management is an important feature if you are dealing, as is likely, with a high ratio of messages produced to messages consumed at any one time. For example, a web crawler can fetch web pages at an incredibly high speed while processes like content extraction take longer. Running a message queue without contention management under these circumstances will exhaust the machine’s memory.

While MOMs are important for uniting heterogeneous technologies, the different processes must also know which queues to utilize to consume the input and produce the output for the next phases. A new wave of frameworks like NServiceBusResque, Celery, and Octobot has emerged to handle this.

In conclusion, MOMs help to connect heterogeneous technologies and bring robustness, and are very useful in the context of unstructured information like text analysis. Many MOMs are available, but there is not a single one with a complete feature set. However some of these features can be supplied by frameworks such as NServiceBus, Resque, Celery, and Octobot.

See Also

  1. Esoteric Queue Scheduling Disciplines
  2. Persisting Native Python Queues
  3. Adding Acknowledgement Semantics to a Persistent Queue

Resources

  1. Message Queues vs Web Services
  2. Message Queue Evaluation Notes
  3. The Hadoop Map-Reduce Capacity Scheduler
  4. Contention Management in the WSA
  5. Message Queuing Architectures

Egont Part II

(part I here)

Description

Egont is a shared space where users mashup personal information.
Its top goals are:
  • Discovering and curating new information in a personalized and dynamic way.
  • Promoting emergent behavior in a shared programming environment
  • Facilitating Serendipity.

Egont is a personalization environment where users can connect to, import, expose, and index data from their web services. They can also apply functions to build mashups around their personal interest like in a spreadsheet. On Egont, users can combine and exchange information. For example, users can connect their Egont accounts to a variety of services like movie rankings, and merge rankings from their social networks. If they want to find independent films they can filter out blockbusters. When users from their social networks update their rankings, these updates are processed and the result is automatically recalculated. The same idea can be applied to streams from Twitter or blog posts. One user can apply a filter to those streams to curate information apart from mainstream trends and recommendation systems, while other users can build new filters using this user’s data. Third parties can take advantage of the data flowing in this shared environment by developing new information functions.

Egont has a simple programming language where experienced users can access other user’s variable namespaces and handle security granularities to enable or restrict the flow of information. Less experienced users personalize their Egont experience using a simpler web interface.

Summary

Egont is composed of the following elements:
  1. A data flow engine
  2. A data store where cell values are persisted.
  3. A web application
  4. A simple programming language

Data Flow Engine

The data flow engine works like a spreadsheet. Some cells may be dependant on others. Values are recalculated only when necessary. For example, one cell may contain a function to retrieve new tweets, while another cell takes those tweets and uses a second function to extract named entities like places or proper names. Users can personalize the vast flow of information from many sources to process, aggregate, and filter information. The data flow engine limits recalculation to affected cells only.

The key feature of the engine is its ability to apply functions to a set of shared cells from other users. Another important feature is the handling of security settings. Users can configure which cells are shared with which users at a very granular level.

Web Application

The web application has two important parts. One is the editor where advanced users can use the browser to edit their Egont scripts. The other is a simpler user interface where users are able to define their sources of information and apply functions to them more easily.

Programming Language

The goal of Egont is to simplify the building of personalization and mashups, so its programming language is oriented to quickly orchestrating user information.

This is a rough example of how an advanced user could use Egont programming language to merge friends movie rankings.

friends <- [egont.users.alice, egont.users.bob, me] # list of friends.
movies_ranking <- imdb.ranking("swain-4") # persist my ranking on movies_ranking from my user on IMDB.
movies_average <- average(apply(friends, ’movies_ranking’)) # calculate the average of movies rankings from my specified friends. It only changes when rankings are updated
egont.feeds <- movies_average # expose the results as a feed in the web application.

Whenever any of the above users modify a movie’s ranking Egont recalculates that movie’s score.

With Egont,  we will have a place where we can discover new resources, research our interests, and create a community capable of sifting through the ever more vast sea of data available on today’s web.

See Also

  1. Parsing S-Expressions in C# using OMeta

Resources

  1. A Brief History of Spreadsheets
  2. Kahn process networks
  3. Directed acyclic graph
  4. Advances in IC-Scheduling Theory: Scheduling Expansive and Reductive Dags and Scheduling Dags via Duality
  5. Pregel: A System for Large-Scale Graph Processing
  6. Grzegorz Malewicz’s Google Research page
  7. CIEL: a universal execution engine for distributed data-flow computing
  8. Bloom Programming Language (via ComingThoughts)

Extraction of Main Text Content Using the Google Reader NoAPI

Theo van Doesburg Dadamatinée

Introduction

In this article we will see how to extract the main text content from a blog using the Google Reader NoAPI.

Extracting the main text content from a web page is an important step in the text processing pipeline. The source code of pages in HTML is usually cluttered with advertising and other text which is not related to the main content. Formally, in the context of computer science, it is impossible for a computer to distinguish between the main content and other content on the same page. That is, no algorithm can recognize it for all possible cases. Sometimes it is even difficult for humans to distinguish it. Recognition of primary content is part of the machine learning/artificial intelligence field of study.

In practice there are many ways to recognize main content. If, for example, a blog platform includes attributes which indicate where the main content is, the process will be straightforward. Similarly, If the pages on a particular site have a well defined structure, we can also infer where the main content is by sampling a few pages. In this approach, we train the recognizer to apply patterns to additional pages. Of course purely manual work is another option. The quickest way to build an army of human recognizers is to put the job on sites like Amazon’s Mechanical Turk or similar services such as Microworkers.

For a good compilation of resources related to this subject you can see:

Extracting the Main Content from a Blog

If the blog platform includes information about the main text content on their tags, making an XPath expression for each one will do the trick. Now imagine that you want to do it automatically, without depending on each blog platform or blog theme. In this case you can read the RSS feed, which generally only includes main text, and extract the text from there. However, not all blogs post the complete text in the feed. The TechCrunch feed, for example, shows the first part of the text, but you have to click to continue reading. In this case you can use the partial text from the feed to recognize the complete text in the HTML. A potential problem with reading RSS feeds is that they only contain the most recent articles. To get around this limitation, we can get a longer feed history from Google Reader. Google Reader has some gaps and misses some articles, but this issue is beyond the scope of this article.

Getting Blog Text from Google Reader

Since Google Reader does not have a real API we will rely on the Google Reader API lib by Mauro Asprea from Wish and BAM!. He is an active reader of this blog and a friend.

We will retrieve posts by Fred Wilson, one of the most prolific VC bloggers, since he has blogged since 9/23/2003 on an almost daily basis, and includes the whole post within the feed.

Python code

#!/usr/bin/python
# *-* coding: utf-8 *-*

import sys
import time
from GoogleReader import  CONST
from GoogleReader.reader import GoogleReader
import lxml.html

USERNAME = '' # Replace with your Google Reader username
PASSWORD = '' # Replace with your Google Reader password. Not included in this post :-)

gr = GoogleReader()
login_info = (USERNAME, PASSWORD)
gr.identify(*login_info)
gr.login()

gr.add_subscription(url="http://feeds.feedburner.com/avc")
xmlfeed = gr.get_feed(url="http://feeds.feedburner.com/avc")

COUNT = 1000
i=0

print >>sys.stderr, "page:", i
for entry in xmlfeed.get_entries():
   print entry['title'].encode('utf-8'), time.ctime(entry['published'])
   doc = lxml.html.fromstring(entry['content']) # Thanks lxml.html for handling incomplete HTML documents!
   print doc.text_content().encode('utf-8')
   print "******************************************************************************************************"

continuation = xmlfeed.get_continuation()

i+=1
while continuation != None and i < COUNT:
   print >>sys.stderr, "page:", i
   xmlfeed = gr.get_feed(url="http://feeds.feedburner.com/avc", continuation = continuation)

   for entry in xmlfeed.get_entries():
      print entry['title'].encode('utf-8'), time.ctime(entry['published'])
      try:
         doc = lxml.html.fromstring(entry['content']) # Thanks lxml.html for handling incomplete HTML documents!
         print doc.text_content().encode('utf-8')
      except:
         print "------------------ ERROR -------------------"
         print entry['content']

      print "******************************************************************************************************"

   continuation = xmlfeed.get_continuation()
   i+=1

Notes

If you try this script you will realize that the oldest post retrieved is from 9/29/2005. The real first post however was on 9/23/2003. Why don’t we see it? I believe it is because Google Reader uses feed information from FeedBurner, which was launched in 2004 and acquired by Google in 2007, so they probably started recording feed entries then. Incidentally Union Square Ventures was one of the original FeedBurner investors.

There is an easier way to retrieve text in the specific case of Fred Wilson’s blog and other HTML5 modern sites. HTML5 provides an <article> tag, so you can just crawl the whole site and retrieve the content within the <article> tag. You’ll need an extra step to deduplicate the content since many of the crawled pages will appear more than once. For example if you follow categories like MBA Mondays you will find articles that also appear when you follow another path.

Lessons Learned

  • We can use Google Reader to easily extract text content from blogs.
  • Google Reader has its limitations: it doesn’t cover posts before a certain data and sometimes skips posts.
  • HTML5 adds a valuable new tag for differentiating article text from the rest of the content.

See Also

  1. Voice Recognition + Content Extraction + TTS = Innovative Web Browsing
  2. Google Search NoAPI

Additional Resources

  1. Newspaper: News, full-text, and article metadata extraction in Python 3
  2. boilerpipe: Boilerplate Removal and Fulltext Extraction from HTML pages
  3. Readability API
  4. HTML Content Extraction Questions on StackOverflow
  5. Google Reader Development Questions on StackOverflow

Data Science Resources

Big Data, Big Trend

Big Data is an important megatrend:

Companies such as Google, Facebook, Twitter and LinkedIn are using their vast information to discover things that are definitely not obvious, and may even challenge our common sense. Some initiatives like Recorded Future or StreamBase try to predict the future while events like a plane crash were first pointed out on Twitter. One of the funniest blogs about big data is on OKCupid which mines information about relationship matching and can discover connections between orgasms and exercise.

Sharing Data Science Blogs

Following is an OPML file with 228 data science related blogs: Data Science Blogs in OPML format.