content-extraction | Data Big Bang Blog

An Interesting Opportunity

Voice recognition and text to speech technologies make a good combo for future user interfaces. Imagine browsing the web by voice and listening to blogs like you listen to podcasts. Your mobile phone will be able to provide these features in the near future. Voice web browsing is not only useful for the visually impaired, but also for users who wish to do other things while surfing the web. Voice recognition and text to speech feature prominently on the iPhone 4S, but voice web browsing is not incorporated.

This blog post includes code which allows you to extract the main text content from a web page and convert it to a playable audio file. The process is triggered by simple voice commands such as “receive hacker news”, “read article 7” or “save article 19”. The resulting audio file can also be synchronized to other services such as DropBox, Amazon Cloud Drive, and Apple iCloud to be played later. We use IKVM to allow us to run the boilerpipe library, which is written in Java, over .NET. We choose .NET because it comes with “batteries included”: ready to use voice recognition and text to speech capabilities.

The present article demonstrates the application of main text content extraction. For methods of MTCE see our article: Extraction of Main Text Content Using the Google Reader NoAPI. For further exploration of voice recognition and text to speech .NET capabilities consult the following links:

Good voice recognition and text to speech systems are expensive and require training. Companies which provide these services do not offer a more granular way to access a web service or run a local engine. For example, AT&T Natural Voice for TTS sounds good, but their licensing terms are prohibitive for small companies and startups. On the voice recognition side of the equation, Nuance has been accumulating patents to strengthen their market position, making it difficult for others to compete. Although many other companies offer voice recognition, most of them actually use Nuance’s technology. See for example: Siri, Do You Use Nuance Technology? Siri: I’m Sorry, I Can’t Answer That. Voice recognition systems require training to improve their accuracy. SpinVox, a Nuance subsidiary, used “conversion experts”. They built teams which listened to audio messages and manually converted them to text. If you want to use this approach, you’ll need to wait for Amazon’s Mechanical Turk to offer micro jobs in real time.

What is missing here? None of the leading companies offer good quality voice recognition on a charge per use basis. Google seems to be actively researching voice recognition, and has achieved impressive results with “experimental” speech recognition technology. Sadly, Google’s voice recognition and text to speech APIs can not be used to develop all desktop and server applications. Their use is restricted to Android phones, Chrome’s beta html5 support, and Chrome’s extensions. It would be nice for Google to remove this restriction and include this service on their web APIs Console.

Our code demonstrates that it is possible to use voice recognition and text to speech while avoiding the licensing, patent and API conundrum.

Using VR and TTS under .NET

Prerequisites

If you use our VoiceWebBrowsing code, also available on GitHub, you will just need Microsoft Visual Studio 2010. However if there is a new version of boilerpipe you will have to generate the boilerpipe .NET assemblies yourself as follows:

Have Microsoft Visual Studio 2010
Download boilerpipe from http://code.google.com/p/boilerpipe/
Download and install IKVM from http://www.ikvm.net/
Run boilerpipe library and dependencies through ikvmc: ikvmc -nojni -target:library boilerpipe-1.2.0.jar lib\nekohtml-1.9.13.jar lib\xerces-2.9.1.jar
Use the resulting boilerpipe-1.2.0.dll .NET assembly from ikvmc.

Code

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Speech.Recognition;
using System.Speech.Synthesis;
using System.Speech.AudioFormat;
using System.Net;
using System.IO;
using System.Xml;

namespace VoiceWebBrowsing
{
    public partial class MainForm : Form
    {
        #region Constants
        //const string _bogusRSSFeed = "<items><item><title>First title</title><link>http://</link></item><item><title>Second title</title><link>http://</link></item></items>";
        const string _bogusRSSFeed = null;
        const string downloadPath = @"..\..\..\..\Download";
        #endregion

        #region Private Variables
        SpeechRecognizer _speechRecognizer = new SpeechRecognizer();
        SpeechSynthesizer _ttsVoice = new SpeechSynthesizer();
        SpeechAudioFormatInfo formatInfo = new SpeechAudioFormatInfo(8000, AudioBitsPerSample.Sixteen, AudioChannel.Mono);
        Queue<string> _queue = new Queue<string>();
        List<string> articleList = new List<string>();
        HashSet<SpeechSynthesizer> tts2FileTasks = new HashSet<SpeechSynthesizer>();
        #endregion

        #region Private Methods
        private void InitGrammar()
        {
            GrammarBuilder readGrammar = new Choices(new string[] { "read article" });
            Choices articleChoice = new Choices();
            for (int i = 1; i <= 30; i++)
            {
                articleChoice.Add(i.ToString());
            }
            readGrammar.Append(articleChoice);

            GrammarBuilder saveGrammar = new Choices(new string[] { "save article" });
            saveGrammar.Append(articleChoice);

            GrammarBuilder otherGrammar = new Choices(new string[] { "receive hacker news", "stop", "test" });

            //GrammarBuilder commands = new Choices(new string[] { "receive hacker news", "stop", "test" });
            Choices commands = new Choices();
            commands.Add(new Choices(new GrammarBuilder[] { readGrammar, saveGrammar, otherGrammar }));

            var grammar = new Grammar(commands);
            this._speechRecognizer.LoadGrammar(grammar);
        }

        private void Say(string text)
        {
            this._ttsVoice.SpeakAsync(text);
        }

        private void ReadHackerNewsFeed()
        {
            string hackerNewsRSSUrl = "http://news.ycombinator.com/rss";

            using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
            {
                //
                // an app.config is added to surpress: The server committed a protocol violation. Section=ResponseStatusLine
                //
                string rssXmlStr = null;
                if (_bogusRSSFeed == null)
                    rssXmlStr = client.DownloadString(hackerNewsRSSUrl);
                else
                    rssXmlStr = _bogusRSSFeed;
                XmlDocument xmlDoc = new XmlDocument();
                xmlDoc.LoadXml(rssXmlStr);

                XmlNodeList items = xmlDoc.SelectNodes("//item");

                int counter = 1;
                articleList.Clear();
                foreach (XmlNode item in items)
                {
                    string title = item.SelectSingleNode("title").InnerText;
                    string link = item.SelectSingleNode("link").InnerText;
                    articleList.Add(link);
                    Say("article " + counter.ToString() + " " + title);
                    System.Diagnostics.Debug.WriteLine(title);

                    counter++;
                }
            }
        }

        private void ReceiveHackerNewsButton_Click(object sender, EventArgs e)
        {
            ReceiveHackerNewsCommand();
        }

        private void SaveArticle(string link, string article)
        {
            SpeechSynthesizer tts2File = new SpeechSynthesizer();
            tts2File.SpeakStarted += new EventHandler<SpeakStartedEventArgs>(tts2File_SpeakStarted);
            tts2File.SpeakCompleted += new EventHandler<SpeakCompletedEventArgs>(tts2File_SpeakCompleted);
            System.Security.Cryptography.SHA1Managed hashAlgorithm = new System.Security.Cryptography.SHA1Managed();
            hashAlgorithm.Initialize();
            byte[] buffer = Encoding.UTF8.GetBytes(link);
            byte[] hash = hashAlgorithm.ComputeHash(buffer);
            string fileName = BitConverter.ToString(hash).Replace("-", string.Empty) + ".wav";
            string executionPath = System.Reflection.Assembly.GetExecutingAssembly().Location;
            string fullPath = Path.Combine(executionPath, downloadPath, fileName);
            hashAlgorithm.Clear();
            tts2File.SetOutputToWaveFile(fullPath, formatInfo);
            tts2File.SpeakAsync(article);
            this.tts2FileTasks.Add(tts2File);
        }

        #endregion

        #region Constructor
        public MainForm()
        {
            InitializeComponent();
            _speechRecognizer.Enabled = true;
            _speechRecognizer.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(_speechRecognizer_SpeechRecognized);
        }
        #endregion

        #region Events
        void _ttsVoice_SpeakCompleted(object sender, SpeakCompletedEventArgs e)
        {
            this._ttsVoice.SetOutputToNull(); // Needed for flushing file buffers.
        }

        void _speechRecognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
        {
            string command = e.Result.Text;
            CommandTextBox.Text = command;

            if (command == "test")
            {
                return;
            }

            if (command == "stop")
            {
                StopCommand();

                return;
            }

            if (command == "receive hacker news")
            {
                ReceiveHackerNewsCommand();

                return;
            }

            if (command.Contains("read article"))
            {
                string[] words = command.Split(' ');

                ReadArticleCommand(Decimal.Parse(words[2]));

                return;
            }

            if (command.Contains("save article"))
            {
                string[] words = command.Split(' ');

                SaveArticleCommand(Decimal.Parse(words[2]));

                return;
            }
        }

        private void MainForm_Load(object sender, EventArgs e)
        {
            InitGrammar();
            BackgroundWorker.RunWorkerAsync();
        }

        void tts2File_SpeakStarted(object sender, SpeakStartedEventArgs e)
        {
        }

        void tts2File_SpeakCompleted(object sender, SpeakCompletedEventArgs e)
        {
            SpeechSynthesizer tts2File = (SpeechSynthesizer)sender;

            tts2File.SetOutputToNull();
            this.tts2FileTasks.Remove(tts2File);
        }

        private void StopButton_Click(object sender, EventArgs e)
        {
            StopCommand();
        }
        private void ReadArticleButton_Click(object sender, EventArgs e)
        {
            ReadArticleCommand(ArticleNumberUpDown.Value);
        }

        private void SaveArticleButton_Click(object sender, EventArgs e)
        {
            SaveArticleCommand(ArticleNumberUpDown.Value);
        }

        private void BackgroundWorker_DoWork(object sender, DoWorkEventArgs e)
        {
            while (true)
            {
                System.Threading.Thread.Sleep(125);

                lock(this)
                {
                    if(this._queue.Count > 0)
                    {
                        string cmd = this._queue.Dequeue();

                        if (cmd != null)
                        {
                            System.Uri uri = new Uri(cmd);
                            if (uri.Scheme == "voicewebbrowsing")
                            {
                                if (uri.Host == "receivehackernews")
                                {
                                    ReadHackerNewsFeed();
                                }
                                else if (uri.Host == "stop")
                                {
                                    this._ttsVoice.SpeakAsyncCancelAll();
                                }
                                else if (uri.Host == "readarticle" || uri.Host == "savearticle")
                                {
                                    string articleNumberStr = System.IO.Path.GetFileName(uri.AbsolutePath);
                                    int articleNumber = int.Parse(articleNumberStr);

                                    if (articleNumber > articleList.Count)
                                    {
                                        Say("please retrieve hacker news articles first");
                                    }
                                    else
                                    {
                                        articleNumber--; // 0-based index
                                        string link = articleList[articleNumber];

                                        java.net.URL url = new java.net.URL(link);
                                        string article = de.l3s.boilerpipe.extractors.ArticleExtractor.INSTANCE.getText(url);
                                        if (uri.Host == "readarticle")
                                        {
                                            Say(article);
                                        }
                                        else if (uri.Host == "savearticle")
                                        {
                                            SaveArticle(link, article);
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
        #endregion

        #region Commands
        private void ReceiveHackerNewsCommand()
        {
            StopCommand();
            lock (this)
            {
                this._queue.Enqueue("voicewebbrowsing://receivehackernews");
            }
        }
        private void StopCommand()
        {
            lock (this)
            {
                this._queue.Enqueue("voicewebbrowsing://stop");
            }
        }

        private void ReadArticleCommand(Decimal article)
        {
            StopCommand();
            lock (this)
            {
                this._queue.Enqueue(String.Format("voicewebbrowsing://readarticle/{0}", article.ToString()));
            }
        }

        private void SaveArticleCommand(decimal article)
        {
            lock (this)
            {
                this._queue.Enqueue(String.Format("voicewebbrowsing://savearticle/{0}", article.ToString()));
            }
        }
        #endregion
    }
}

Where can you go from here?

You can write a continuous Hacker News front page reader which constantly checks the feed and reads you new titles.
You can write a voice oriented mobile web browser for Windows Phone, Google Android, and for Google Chrome using their APIs. To write a mobile browser for iPhones, you will have to wait for iOS Siri API.
You can write a cloud service to provide a web page to a podcast converter service, store the audio file in the cloud, and automatically convert Google Reader’s starred items.
Finally, you can research how to improve TTS and VR technologies.

Resources

Dragon Speech Recognition Software
Publications by Googlers in Speech Processing
Patent case seeks to silence Nuance voice recognition
Nuance Loses First Patent Fight with Vlingo, Others to Follow
eSpeak: Open Source Text to Speech
CMU Sphinx: Speech Recognition Open Source Toolkit
SAM: The First Commercial voice synthesis program for Commodore 64, Apple and Atari computers.
Siri
Microsoft Tellme
The Mobile Challenge: My Personal Rants
List of speech recognition software
What is the difference between System.Speech.Recognition and Microsoft.Speech.Recognition?
Siri for everyone, with Pioneer’s Zypr API
Reverse Engineering and Cracking Apple Siri with SiriProxy

Introduction

In this article we will see how to extract the main text content from a blog using the Google Reader NoAPI.

Extracting the main text content from a web page is an important step in the text processing pipeline. The source code of pages in HTML is usually cluttered with advertising and other text which is not related to the main content. Formally, in the context of computer science, it is impossible for a computer to distinguish between the main content and other content on the same page. That is, no algorithm can recognize it for all possible cases. Sometimes it is even difficult for humans to distinguish it. Recognition of primary content is part of the machine learning/artificial intelligence field of study.

In practice there are many ways to recognize main content. If, for example, a blog platform includes attributes which indicate where the main content is, the process will be straightforward. Similarly, If the pages on a particular site have a well defined structure, we can also infer where the main content is by sampling a few pages. In this approach, we train the recognizer to apply patterns to additional pages. Of course purely manual work is another option. The quickest way to build an army of human recognizers is to put the job on sites like Amazon’s Mechanical Turk or similar services such as Microworkers.

For a good compilation of resources related to this subject you can see:

Extracting the Main Content from a Blog

If the blog platform includes information about the main text content on their tags, making an XPath expression for each one will do the trick. Now imagine that you want to do it automatically, without depending on each blog platform or blog theme. In this case you can read the RSS feed, which generally only includes main text, and extract the text from there. However, not all blogs post the complete text in the feed. The TechCrunch feed, for example, shows the first part of the text, but you have to click to continue reading. In this case you can use the partial text from the feed to recognize the complete text in the HTML. A potential problem with reading RSS feeds is that they only contain the most recent articles. To get around this limitation, we can get a longer feed history from Google Reader. Google Reader has some gaps and misses some articles, but this issue is beyond the scope of this article.

Getting Blog Text from Google Reader

Since Google Reader does not have a real API we will rely on the Google Reader API lib by Mauro Asprea from Wish and BAM!. He is an active reader of this blog and a friend.

We will retrieve posts by Fred Wilson, one of the most prolific VC bloggers, since he has blogged since 9/23/2003 on an almost daily basis, and includes the whole post within the feed.

Python code

#!/usr/bin/python
# *-* coding: utf-8 *-*

import sys
import time
from GoogleReader import  CONST
from GoogleReader.reader import GoogleReader
import lxml.html

USERNAME = '' # Replace with your Google Reader username
PASSWORD = '' # Replace with your Google Reader password. Not included in this post :-)

gr = GoogleReader()
login_info = (USERNAME, PASSWORD)
gr.identify(*login_info)
gr.login()

gr.add_subscription(url="http://feeds.feedburner.com/avc")
xmlfeed = gr.get_feed(url="http://feeds.feedburner.com/avc")

COUNT = 1000
i=0

print >>sys.stderr, "page:", i
for entry in xmlfeed.get_entries():
   print entry['title'].encode('utf-8'), time.ctime(entry['published'])
   doc = lxml.html.fromstring(entry['content']) # Thanks lxml.html for handling incomplete HTML documents!
   print doc.text_content().encode('utf-8')
   print "******************************************************************************************************"

continuation = xmlfeed.get_continuation()

i+=1
while continuation != None and i < COUNT:
   print >>sys.stderr, "page:", i
   xmlfeed = gr.get_feed(url="http://feeds.feedburner.com/avc", continuation = continuation)

   for entry in xmlfeed.get_entries():
      print entry['title'].encode('utf-8'), time.ctime(entry['published'])
      try:
         doc = lxml.html.fromstring(entry['content']) # Thanks lxml.html for handling incomplete HTML documents!
         print doc.text_content().encode('utf-8')
      except:
         print "------------------ ERROR -------------------"
         print entry['content']

      print "******************************************************************************************************"

   continuation = xmlfeed.get_continuation()
   i+=1

Notes

If you try this script you will realize that the oldest post retrieved is from 9/29/2005. The real first post however was on 9/23/2003. Why don’t we see it? I believe it is because Google Reader uses feed information from FeedBurner, which was launched in 2004 and acquired by Google in 2007, so they probably started recording feed entries then. Incidentally Union Square Ventures was one of the original FeedBurner investors.

There is an easier way to retrieve text in the specific case of Fred Wilson’s blog and other HTML5 modern sites. HTML5 provides an <article> tag, so you can just crawl the whole site and retrieve the content within the <article> tag. You’ll need an extra step to deduplicate the content since many of the crawled pages will appear more than once. For example if you follow categories like MBA Mondays you will find articles that also appear when you follow another path.

Lessons Learned

We can use Google Reader to easily extract text content from blogs.
Google Reader has its limitations: it doesn’t cover posts before a certain data and sometimes skips posts.
HTML5 adds a valuable new tag for differentiating article text from the rest of the content.

Data Big Bang Blog

Creativity and Problem Solving for Data Science (whatever it may mean…) | An experimental spin-off from Nektra Advanced Computing

Menu

Tag Archives: content-extraction

Voice Recognition + Content Extraction + TTS = Innovative Web Browsing