Exporting StackOverflow User Blogs to Excel

It’s more simple than what you think

Do you find yourself with the need to automatically convert URLs that you have imported in your cells to hyperlinks? If you search on Google there are many solutions but the top ones add complexity using macros or VBA solutions.

I needed to do it quickly in the context of retrieving information from the StackOverflow API. I was searching for all sites from users in a specific country to keep track of their blogs. So I added

SELECT
   Id,
   DisplayName,
   Reputation,
   WebsiteUrl
FROM
   Users
WHERE
   Location like '%Buenos Aires%' or Location like '%Argentina%'
ORDER BY
   Reputation DESC

on http://data.stackexchange.com/stackoverflow/q/110548/  and exported it to CSV but when they were imported to Microsoft Excel I needed to convert the URLs manually. The results are ordered by StackOverflow ranking.

I found this solution, you can just use the Microsoft Excel hyperlink function. Say you have URLs as plain text in the cell D2, then on column E2 add something like =hyperlink(D2) and this will do the trick. You can now click on all the cells!

Please don’t tell recruiters about this. Hopefully they don’t develop web mashups.

See Also

  1. Integrating Google Analytics into your Company Loop with a Microsoft Excel Add-on
  2. Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website

Resources

  1. Most influential github users by location
  2. Implement OData API for StackOverflow
  3. Creative Commons Stack Overflow Data Dump

Extraction of Main Text Content Using the Google Reader NoAPI

Theo van Doesburg Dadamatinée

Introduction

In this article we will see how to extract the main text content from a blog using the Google Reader NoAPI.

Extracting the main text content from a web page is an important step in the text processing pipeline. The source code of pages in HTML is usually cluttered with advertising and other text which is not related to the main content. Formally, in the context of computer science, it is impossible for a computer to distinguish between the main content and other content on the same page. That is, no algorithm can recognize it for all possible cases. Sometimes it is even difficult for humans to distinguish it. Recognition of primary content is part of the machine learning/artificial intelligence field of study.

In practice there are many ways to recognize main content. If, for example, a blog platform includes attributes which indicate where the main content is, the process will be straightforward. Similarly, If the pages on a particular site have a well defined structure, we can also infer where the main content is by sampling a few pages. In this approach, we train the recognizer to apply patterns to additional pages. Of course purely manual work is another option. The quickest way to build an army of human recognizers is to put the job on sites like Amazon’s Mechanical Turk or similar services such as Microworkers.

For a good compilation of resources related to this subject you can see:

Extracting the Main Content from a Blog

If the blog platform includes information about the main text content on their tags, making an XPath expression for each one will do the trick. Now imagine that you want to do it automatically, without depending on each blog platform or blog theme. In this case you can read the RSS feed, which generally only includes main text, and extract the text from there. However, not all blogs post the complete text in the feed. The TechCrunch feed, for example, shows the first part of the text, but you have to click to continue reading. In this case you can use the partial text from the feed to recognize the complete text in the HTML. A potential problem with reading RSS feeds is that they only contain the most recent articles. To get around this limitation, we can get a longer feed history from Google Reader. Google Reader has some gaps and misses some articles, but this issue is beyond the scope of this article.

Getting Blog Text from Google Reader

Since Google Reader does not have a real API we will rely on the Google Reader API lib by Mauro Asprea from Wish and BAM!. He is an active reader of this blog and a friend.

We will retrieve posts by Fred Wilson, one of the most prolific VC bloggers, since he has blogged since 9/23/2003 on an almost daily basis, and includes the whole post within the feed.

Python code

#!/usr/bin/python
# *-* coding: utf-8 *-*

import sys
import time
from GoogleReader import  CONST
from GoogleReader.reader import GoogleReader
import lxml.html

USERNAME = '' # Replace with your Google Reader username
PASSWORD = '' # Replace with your Google Reader password. Not included in this post :-)

gr = GoogleReader()
login_info = (USERNAME, PASSWORD)
gr.identify(*login_info)
gr.login()

gr.add_subscription(url="http://feeds.feedburner.com/avc")
xmlfeed = gr.get_feed(url="http://feeds.feedburner.com/avc")

COUNT = 1000
i=0

print >>sys.stderr, "page:", i
for entry in xmlfeed.get_entries():
   print entry['title'].encode('utf-8'), time.ctime(entry['published'])
   doc = lxml.html.fromstring(entry['content']) # Thanks lxml.html for handling incomplete HTML documents!
   print doc.text_content().encode('utf-8')
   print "******************************************************************************************************"

continuation = xmlfeed.get_continuation()

i+=1
while continuation != None and i < COUNT:
   print >>sys.stderr, "page:", i
   xmlfeed = gr.get_feed(url="http://feeds.feedburner.com/avc", continuation = continuation)

   for entry in xmlfeed.get_entries():
      print entry['title'].encode('utf-8'), time.ctime(entry['published'])
      try:
         doc = lxml.html.fromstring(entry['content']) # Thanks lxml.html for handling incomplete HTML documents!
         print doc.text_content().encode('utf-8')
      except:
         print "------------------ ERROR -------------------"
         print entry['content']

      print "******************************************************************************************************"

   continuation = xmlfeed.get_continuation()
   i+=1

Notes

If you try this script you will realize that the oldest post retrieved is from 9/29/2005. The real first post however was on 9/23/2003. Why don’t we see it? I believe it is because Google Reader uses feed information from FeedBurner, which was launched in 2004 and acquired by Google in 2007, so they probably started recording feed entries then. Incidentally Union Square Ventures was one of the original FeedBurner investors.

There is an easier way to retrieve text in the specific case of Fred Wilson’s blog and other HTML5 modern sites. HTML5 provides an <article> tag, so you can just crawl the whole site and retrieve the content within the <article> tag. You’ll need an extra step to deduplicate the content since many of the crawled pages will appear more than once. For example if you follow categories like MBA Mondays you will find articles that also appear when you follow another path.

Lessons Learned

  • We can use Google Reader to easily extract text content from blogs.
  • Google Reader has its limitations: it doesn’t cover posts before a certain data and sometimes skips posts.
  • HTML5 adds a valuable new tag for differentiating article text from the rest of the content.

See Also

  1. Voice Recognition + Content Extraction + TTS = Innovative Web Browsing
  2. Google Search NoAPI

Additional Resources

  1. Newspaper: News, full-text, and article metadata extraction in Python 3
  2. boilerpipe: Boilerplate Removal and Fulltext Extraction from HTML pages
  3. Readability API
  4. HTML Content Extraction Questions on StackOverflow
  5. Google Reader Development Questions on StackOverflow

Data Science Resources

Big Data, Big Trend

Big Data is an important megatrend:

Companies such as Google, Facebook, Twitter and LinkedIn are using their vast information to discover things that are definitely not obvious, and may even challenge our common sense. Some initiatives like Recorded Future or StreamBase try to predict the future while events like a plane crash were first pointed out on Twitter. One of the funniest blogs about big data is on OKCupid which mines information about relationship matching and can discover connections between orgasms and exercise.

Sharing Data Science Blogs

Following is an OPML file with 228 data science related blogs: Data Science Blogs in OPML format.

The Data Portability Fact Sheet

Introduction

Parallego has been announced on TechCrunch after a stealth period as the latest social network that will challenge Facebook and Google Plus. Their investors include big names like Sequoia Capital, Andreessen Horowitz and Union Square Ventures, and they have top angels like Ron Conway. They really love developers, so they offer an API to show their commitment to openness.

Parallego doesn’t really exist, but announcements like this are part of startup breaking news about the web and entrepreneurship. These companies emphasize their love for developers and claim to be open because they provide APIs. The truth is that when you test their APIs you usually find a number of problems:

  1. You can read the information but cannot write or modify it.
  2. You have access to certain information but other information is unavailable.
  3. The rate of API calls is low, so you can only make a few calls and must wait a certain period of time to continue.
  4. You cannot make parallel requests in a multiprocess or multithreaded application.
  5. There is no way to quickly pay for the service and access a better service. Google API Console is a step in that direction but a lot of important Google NoAPIs are unavailable.
  6. Some OAuth2 protocol implementation does not work with the existing development libraries.
  7. The service says it welcomes new applications, but this is not the case for new UIs and mobile clients. See Twitter to Devs: Don’t Make Twitter Clients… Or Else [mashable.com]
  8. You cannot even export your own information. The time you have spent adding content to this service is lost once you leave it.
  9. There is no love for developers: the forums are filled with questions and there are no official answers. See Rate limit with billing enabled [google.com] and Graph API rate limit? [facebook.com]
  10. The company often changes its policies. The web mashup that you did seven months ago that attracted thousands of users is useless because the new API revision does not give you the data that you need for some specific features. See Should facebook pay compensation for deprecated API calls and changes [facebook.com]
  11. Old content is removed without warning.

After a while, you begin to doubt, close your eyes and rethink again about the word “Open”. It seems somewhat meaningless. If you are older you may remember that Microsoft was accused of being closed, but you may also remember that in the worst case you could reverse engineer and access all the internals yourself. You need advanced knowledge of tools like IDA Pro, OllyDbg, and WinDbg of course, but it was possible. You can’t reverse engineer the cloud, however you can scrape the information, but this is time consuming both in terms of development and running time.

And while “Open” is repeated in every announcement from high profile web companies, your brain does not register the word anymore just like you do not see any of the ads on Google because your brain made has made its own AdBlock extension.

Data Portability Classification

For all of the above reasons we think the best initiative towards transparency is adding a fact sheet to every service so we can compare them and know how “open” they really are. WikiMatrix is a good example of how comparisons could be made.

Marco Paol from DBB has been informally collecting information about some web services and has put it in a public spreadsheet on Data Portability Comparison

Please feel free to send us clarifications, suggestions, and fixes.

Resources

  1. Open Data and Linked Data [wikipedia.org]
  2. DataPortability project [wikipedia.org]
  3. Small data [smalldata.org]
  4. The open data manual [opendatamanual.org]
  5. Is It Open Data?
  6. Open Data mailing lists [okfn.org]
  7. Synaptic/Web
  8. Open Knowledge Foundation Blog
  9. The Friend of a Friend (FOAF) project
  10. theinfo.org: Community for Getting, Processing, and Visualizing Large Data Sets
  11. Plagiarism Today
  12. PeopleBrowsr’s case against Twitter heads back to state court after federal court ruling
  13. Archive Team archivists