Challenging Google’s Search Engine

Google is the undisputed search leader (88% market share in the US¹). Google is not only ahead of competitors in terms of quality of search results, infrastructure worthy of science fiction, and computer science research. Another of their strengths is how quickly they apply their own research.

How can Google be dethroned? Sure, there are other search engines, and newcomers like Blekko and Duck Duck Go make headlines from time to time. However, when you look more closely at those other search engines, you find that they cannot seriously compete with Google.

Benchmarking Search Engines

A search for “reverse engineering” on Blekko returns a hundred thousand results, while the same search on Google returns approximately two million. If it is so difficult for Blekko to compete at the crawling level, imagine what happens in the rest of the search engine pipeline. Just looking at Google’s search quality reports tells you that Page Rank was only the catalyst for much more sophisticated algorithms.

Duck Duck Go manages to attract a geeky audience with highlighted features like putting privacy first. If we search for “reverse engineering” on Duck Duck Go the results seem wacky: the second result is http://reverseengineeringinc.com/ a content poor site which just has the right domain name.

Google appears to be in a league by itself. It currently seems unlikely that they could lose significant market share due to an engineering weakness. In order to outdo Google, we must think holistically and try to guess how the web as a whole will evolve over the next ten years.

A Holistic Approach

Duck Duck Go created a two level search engine for sites like Wikipedia or YouTube. DDG offers DuckDuckGo Instant Answer API to incorporate the search engines of third parties. In order to take advantage of DDG and other two-tiered search engines, sites will have to improve their local search. Currently if you search using the local site search inside Stack Overflow, for example, the results are much lower quality than the same query in Google restricted to stackoverflow.com. When each site understands its own data better than Google, its internal search results will surpass Google’s. Google will no doubt continue to provide better global results, but the two-tiered search would decentralize efforts to improve algorithms. It is important to note that this solution does not need to be distributed: sites can share their local indexes and ranking algorithms with the routing search engine.

The fact that a small number of sites receive the majority of Internet traffic means that optimizing the top sites for a two layer search would make a big difference.

Notes

Search Engine Market Share

Additional Resources

Enriching a List of URLs with Google Page Rank

Dealing with a large body of web resources can be daunting. You make a list of hundreds of blogs, but how do you share or recall those resources later? You must somehow organize your list. Many people do this with tags, but this is not necessarily the best option. Manual organization is also tedious, so tools for enriching data automatically came in handy. The relevance of different resources changes over time. What we originally tagged as “breakthrough” may come insignificant.

Last week I saw a friend who had recently started a new job and wanted my opinion about current and future technological trends. I wanted to give him links to thousands of resources that I have been accumulating over the years, but organized in such a way that he would not have to view them one at a time. This triggered an avalanche of ideas about how to enrich lists of links. My first thought was to rank my list of sites about venture capital and data science using Google Page Rank. I also considered adding the number of tweets, likes, and “+1” for each site but these are generally awarded for individual articles, not whole sites. I ended up adding the Google Page Rank with project pagerank.

The most interesting ideas to explore, though, are in another direction: how to boost items that are in the long tail. The best music may not make the Top 40, and so remains invisible. Algorithms better at recognizing value in the long tail would revolutionise the economy.

The code is available on github. Two examples of the output are available on data-science-bundle and venture-capital-bundle.

Integrating Google Analytics into your Company Loop with a Microsoft Excel Add-on

Introduction

Google Analytics and AdWords are essential marketing and sales tools. They can be integrated with the ubiquitous Microsoft Excel with the Google Data API. Data Big Bang’s Nicolas Papagna has developed an Excel add-on which can be downloaded here. This plugin enables Excel users to quickly retrieve Google Analytics data using the available Google Analytics metrics, and dimensions, and may also be sorted by the user’s criteria. One of the advantages of our solution is that Excel accesses the Google Analytics API directly instead of accessing it thru Data Big Bang server. Other solutions need access to your information which this exposes your private data to third parties.

Installation and Usage

Download GoogleAnalyticsToExcel.AddInSetup_1.0.20.0.exe.
Install it.
Run Microsoft Excel.
Configure your Google credentials by clicking on “Settings” under the “Google Analytics to Excel Addin” ribbon tab.
Customize your query and retrieve your Google Analytics data by clicking “Query Google Analytics” button.

Development Notes

Data Big Bang’s research team has also developed an OData web service that can be consumed using applications such as PowerPivot, Tableau and LINQPad. This web service doesn’t require any add-ons. However, since unfortunately neither PowerPivot nor Tableau offer query builders to interact with OData providers, users must know how to craft the OData URL query themselves. The most interesting part of this project was developing a Google Data Protocol to Open Data Protocol .NET class that offers an IQueryable interface to convert LINQ queries to GData. LINQ queries add a lot of expressive power beyond GData.

Egont Part II

(part I here)

Description

Egont is a shared space where users mashup personal information.
Its top goals are:

Discovering and curating new information in a personalized and dynamic way.
Promoting emergent behavior in a shared programming environment
Facilitating Serendipity.

Egont is a personalization environment where users can connect to, import, expose, and index data from their web services. They can also apply functions to build mashups around their personal interest like in a spreadsheet. On Egont, users can combine and exchange information. For example, users can connect their Egont accounts to a variety of services like movie rankings, and merge rankings from their social networks. If they want to find independent films they can filter out blockbusters. When users from their social networks update their rankings, these updates are processed and the result is automatically recalculated. The same idea can be applied to streams from Twitter or blog posts. One user can apply a filter to those streams to curate information apart from mainstream trends and recommendation systems, while other users can build new filters using this user’s data. Third parties can take advantage of the data flowing in this shared environment by developing new information functions.

Egont has a simple programming language where experienced users can access other user’s variable namespaces and handle security granularities to enable or restrict the flow of information. Less experienced users personalize their Egont experience using a simpler web interface.

Summary

Egont is composed of the following elements:

A data flow engine
A data store where cell values are persisted.
A web application
A simple programming language

Data Flow Engine

The data flow engine works like a spreadsheet. Some cells may be dependant on others. Values are recalculated only when necessary. For example, one cell may contain a function to retrieve new tweets, while another cell takes those tweets and uses a second function to extract named entities like places or proper names. Users can personalize the vast flow of information from many sources to process, aggregate, and filter information. The data flow engine limits recalculation to affected cells only.

The key feature of the engine is its ability to apply functions to a set of shared cells from other users. Another important feature is the handling of security settings. Users can configure which cells are shared with which users at a very granular level.

Web Application

The web application has two important parts. One is the editor where advanced users can use the browser to edit their Egont scripts. The other is a simpler user interface where users are able to define their sources of information and apply functions to them more easily.

Programming Language

The goal of Egont is to simplify the building of personalization and mashups, so its programming language is oriented to quickly orchestrating user information.

This is a rough example of how an advanced user could use Egont programming language to merge friends movie rankings.

friends <- [egont.users.alice, egont.users.bob, me] # list of friends.
movies_ranking <- imdb.ranking("swain-4") # persist my ranking on movies_ranking from my user on IMDB.
movies_average <- average(apply(friends, ’movies_ranking’)) # calculate the average of movies rankings from my specified friends. It only changes when rankings are updated
egont.feeds <- movies_average # expose the results as a feed in the web application.

Whenever any of the above users modify a movie’s ranking Egont recalculates that movie’s score.

With Egont, we will have a place where we can discover new resources, research our interests, and create a community capable of sifting through the ever more vast sea of data available on today’s web.

Resources

The Python POPO’s Way to Integrate PayPal Instant Payment Notification

Pompeo Massani: The Money Counter

Python PayPal IPN

PayPal is the fastest, but not the best, way to incorporate payments on your web site and reach a worldwide audience. If you are searching for a Plain Old Python Object (POPO) way to integrate with the Python programming language, you are on your own. The Instant Payment Notification (IPN) page only incorporates ASP, .NET, ColdFusion, Java, Perl and PHP samples. A web search will bring up a ton of Python code. Most of this code will be for frameworks such as Django. The rest will not be specifically for connecting Python with IPN: there will be a lot of extra code you do not need. Here is a translation of the PHP sample code into Python.

Code

also available on GitHub.

#!/usr/bin/python

# PHP to Python translation from: https://cms.paypal.com/cms_content/US/en_US/files/developer/IPN_PHP_41.txt

import urllib
import cgi
import cgitb
import socket, ssl, pprint
import pickle
import sys
import json

cgitb.enable(logdir='../logs/')

form = cgi.FieldStorage()

req = 'cmd=_notify-validate'
for k in form.keys():
	v = form[k]
	value = urllib.quote(v.value.decode('string_escape')) # http://stackoverflow.com/questions/13454/python-version-of-phps-stripslashes
	req = req + '&{0}={1}'.format(k, value)

header = 'POST /cgi-bin/webscr HTTP/1.0\r\n'
header += 'Content-Type: application/x-www-form-urlencoded\r\n'
header += 'Content-Length: ' + str(len(req)) + '\r\n\r\n'

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
ssl_sock = ssl.wrap_socket(s)
ssl_sock.connect(('www.sandbox.paypal.com', 443)) # Use this for sandbox testing
# ssl_sock.connect(('www.paypal.com', 443)) # Use this for production

ssl_sock.write(header + req)

data = ssl_sock.read()
VERIFIED = False
while len(data) > 0:
	if 'VERIFIED' in data:
		VERIFIED = True
		break
	elif 'INVALID' in data:
		VERIFIED = False
		break

	data = ssl_sock.read()

ssl_sock.close()

if not VERIFIED:
	print "Content-type: text/plain"
	print
	print "Not Verified"
	sys.exit(1)

fields = {	'item_name': None,
		'item_number': None,
		'payment_status': None,
		'mc_gross': None,
		'mc_currency': None,
		'txn_id': None,
		'receiver_email': None,
		'payer_email': None,
		'custom': None,
	}

for k in fields.keys():
	if k in form:
		fields[k] = form[k].value

item_name = fields['item_name']
item_number = fields['item_number']
payment_status = fields['payment_status']
payment_amount = fields['mc_gross']
payment_currency = fields['mc_currency']
txn_id = fields['txn_id']
receiver_email = fields['receiver_email']
payer_email = fields['payer_email']

# check the payment_status is Completed
# check that txn_id has not been previously processed
#  check that receiver_email is your Primary PayPal email
# check that payment_amount/payment_currency are correct
# process payment

print "Content-type: text/plain"
print
print "Verified"

Resources

Ideas and Execution Magic Chart

Ideas vs Execution

There is an endless discussion in the startup community about the value of ideas versus the importance of execution. Here is a timeline showing Hacker News community submissions with the idea(s) keyword in the title:

I am no prophet, but I believe the future will most likely lean towards ideas because the cost of creating and operating a web company has been dramatically reduced. Soon marketing and sales services will be more affordable, making it easier to resolve the business puzzle. On the other hand, although following Joseph Schumpeter’s thinking, big companies have an advantage because they have more resources, they often prefer to follow the acquisition route after market natural selection instead of building risky projects from scratch. Entrepreneurs benefit from reduced competition in the initial phase of product development.

Magic Chart

This is an exercise, you must be objective to fill in your chart, and dabble in the black art of time estimation. The idea of the magic chart is to fill in a scatter plot chart. The x axis shows the time you expect it to take to execute the idea (you can limit it to development time first), and the y axis the potential of the idea. You can easily add other dimensions like cost, to the graph by using the size of the point plotted or colors. Add a vertical asymptote to the chart at the outside time limit which is feasible for you.

Here is my magic chart:

As you see it’s difficult to came up with ideas which can be executed in a short time and many of the ideas fall on an uncertainty beyond some time point. If you think that having a minimum viable product is key, then you must think very hard about how to reduce your product execution time, and this is an art more than a science. The need to generate profit is a serious constraint. Your idea may be excellent and your software may be used by millions of people, but you may lack a business model.

How is your ideas execution magic chart landscape?

HNSearch Script

Here is the Python script for retrieving Hacker News posts with the words idea and ideas in the title. It includes a legal hack (what else?) to bypass the ThriftDB’s HNSearch API imposed limit of 1000 items.

#!/usr/bin/python
# -*- coding: utf-8 -*-

# Done under Visual Studio 2010 using the excelent Python Tools for Visual Studio http://pytools.codeplex.com/

import urllib2
import json
from datetime import datetime
from time import mktime
import csv
import codecs
import cStringIO

class CSVUnicodeWriter: # http://docs.python.org/library/csv.html
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

def get_hackernews_articles_with_idea_in_the_title():
    endpoint = 'http://api.thriftdb.com/api.hnsearch.com/items/_search?filter[fields][title]=idea&start={0}&limit={1}&sortby=map(ms(create_ts),{2},{3},4294967295000)%20asc'

    incomplete_iso_8601_format = '%Y-%m-%dT%H:%M:%SZ'

    items = {}
    start = 0
    limit = 100
    begin_range = 0
    end_range = 0

    url = endpoint.format(start, limit, begin_range, str(int(end_range)))
    response = urllib2.urlopen(url).read()
    data = json.loads(response)

    prev_timestamp = datetime.fromtimestamp(0)

    results = data['results']

    while results:
        for e in data['results']:
            _id = e['item']['id']
            title = e['item']['title']
            points = e['item']['points']
            num_comments = e['item']['num_comments']
            timestamp = datetime.strptime(e['item']['create_ts'], incomplete_iso_8601_format)

            #if timestamp < prev_timestamp: # The results are not correctly sorted. We can't rely on this one.             if _id in items: # If the circle is complete.                 return items             prev_timestamp = timestamp                      items[_id] = {'id':_id, 'title':title, 'points':points, 'num_comments':num_comments, 'timestamp':timestamp}             title_utf8 = title.encode('utf-8')             print title_utf8, timestamp, _id, points, num_comments         start += len(results)         if start + limit > 1000:
            start = 0
            end_range = mktime(timestamp.timetuple())*1000

        url = endpoint.format(start, limit, begin_range, str(int(end_range))) # if not str(int(x)) then a float gives in the sci math form: '1.24267528e+12'
        response = urllib2.urlopen(url).read()
        data = json.loads(response)
        results = data['results']

    return items

if __name__ == '__main__':
    items = get_hackernews_articles_with_idea_in_the_title()

    with open('hn-articles.csv', 'wb') as f:
        hn_articles = CSVUnicodeWriter(f)

        hn_articles.writerow(['ID', 'Timestamp', 'Title', 'Points', '# Comments'])

        for k,e in items.items():
            hn_articles.writerow([str(e['id']), str(e['timestamp']), e['title'], str(e['points']), str(e['num_comments'])])

# It returns 3706 articles where the query says that they are 3711... find the bug...

Resources

Are Ideas Getting Harder to Find? (2016)
Science as Art
Thinking Skills Instruction: Concepts and Techniques (Anthology)
De Bono’s Lateral Thinking
TRIZ
Schumpeter’s Creative Destruction: A Review of the Evidence
Google Query: “ideas vs execution” OR “execution vs ideas”
Google Query: site:news.ycombinator.com AND (intitle:idea OR intitle:ideas)
Startup Ideas We’d Like to Fund
My list of ideas, if you’re looking for inspiration by Jacques Mattheij
Startup Ideas We’d Like to Fund by Paul Graham.
Ideas don’t make you rich. The correct execution of ideas does excerpt from Felix Dennis book.
Ideas suck by Chris Prescott.
Execution Matters, Ideas Don’t by Fred Wilson.
What Is Twitter’s Problem? No, It’s Not the Product
1000 results limit? (HNSearch NoAPI limits, bonus hack included in this article).
Year 2038 problem
How to use time > year 2038 on official Windows Python 2.5
Solr FunctionQuery
HackerNews Ideas Articles
Execution Is An Order Of Magnitude Easier Than Opportunity

Exporting StackOverflow User Blogs to Excel

It’s more simple than what you think

Do you find yourself with the need to automatically convert URLs that you have imported in your cells to hyperlinks? If you search on Google there are many solutions but the top ones add complexity using macros or VBA solutions.

I needed to do it quickly in the context of retrieving information from the StackOverflow API. I was searching for all sites from users in a specific country to keep track of their blogs. So I added

SELECT
   Id,
   DisplayName,
   Reputation,
   WebsiteUrl
FROM
   Users
WHERE
   Location like '%Buenos Aires%' or Location like '%Argentina%'
ORDER BY
   Reputation DESC

on http://data.stackexchange.com/stackoverflow/q/110548/ and exported it to CSV but when they were imported to Microsoft Excel I needed to convert the URLs manually. The results are ordered by StackOverflow ranking.

I found this solution, you can just use the Microsoft Excel hyperlink function. Say you have URLs as plain text in the cell D2, then on column E2 add something like =hyperlink(D2) and this will do the trick. You can now click on all the cells!

Please don’t tell recruiters about this. Hopefully they don’t develop web mashups.

Resources

Extraction of Main Text Content Using the Google Reader NoAPI

Introduction

In this article we will see how to extract the main text content from a blog using the Google Reader NoAPI.

Extracting the main text content from a web page is an important step in the text processing pipeline. The source code of pages in HTML is usually cluttered with advertising and other text which is not related to the main content. Formally, in the context of computer science, it is impossible for a computer to distinguish between the main content and other content on the same page. That is, no algorithm can recognize it for all possible cases. Sometimes it is even difficult for humans to distinguish it. Recognition of primary content is part of the machine learning/artificial intelligence field of study.

In practice there are many ways to recognize main content. If, for example, a blog platform includes attributes which indicate where the main content is, the process will be straightforward. Similarly, If the pages on a particular site have a well defined structure, we can also infer where the main content is by sampling a few pages. In this approach, we train the recognizer to apply patterns to additional pages. Of course purely manual work is another option. The quickest way to build an army of human recognizers is to put the job on sites like Amazon’s Mechanical Turk or similar services such as Microworkers.

For a good compilation of resources related to this subject you can see:

Extracting the Main Content from a Blog

If the blog platform includes information about the main text content on their tags, making an XPath expression for each one will do the trick. Now imagine that you want to do it automatically, without depending on each blog platform or blog theme. In this case you can read the RSS feed, which generally only includes main text, and extract the text from there. However, not all blogs post the complete text in the feed. The TechCrunch feed, for example, shows the first part of the text, but you have to click to continue reading. In this case you can use the partial text from the feed to recognize the complete text in the HTML. A potential problem with reading RSS feeds is that they only contain the most recent articles. To get around this limitation, we can get a longer feed history from Google Reader. Google Reader has some gaps and misses some articles, but this issue is beyond the scope of this article.

Getting Blog Text from Google Reader

Since Google Reader does not have a real API we will rely on the Google Reader API lib by Mauro Asprea from Wish and BAM!. He is an active reader of this blog and a friend.

We will retrieve posts by Fred Wilson, one of the most prolific VC bloggers, since he has blogged since 9/23/2003 on an almost daily basis, and includes the whole post within the feed.

Python code

#!/usr/bin/python
# *-* coding: utf-8 *-*

import sys
import time
from GoogleReader import  CONST
from GoogleReader.reader import GoogleReader
import lxml.html

USERNAME = '' # Replace with your Google Reader username
PASSWORD = '' # Replace with your Google Reader password. Not included in this post :-)

gr = GoogleReader()
login_info = (USERNAME, PASSWORD)
gr.identify(*login_info)
gr.login()

gr.add_subscription(url="http://feeds.feedburner.com/avc")
xmlfeed = gr.get_feed(url="http://feeds.feedburner.com/avc")

COUNT = 1000
i=0

print >>sys.stderr, "page:", i
for entry in xmlfeed.get_entries():
   print entry['title'].encode('utf-8'), time.ctime(entry['published'])
   doc = lxml.html.fromstring(entry['content']) # Thanks lxml.html for handling incomplete HTML documents!
   print doc.text_content().encode('utf-8')
   print "******************************************************************************************************"

continuation = xmlfeed.get_continuation()

i+=1
while continuation != None and i < COUNT:
   print >>sys.stderr, "page:", i
   xmlfeed = gr.get_feed(url="http://feeds.feedburner.com/avc", continuation = continuation)

   for entry in xmlfeed.get_entries():
      print entry['title'].encode('utf-8'), time.ctime(entry['published'])
      try:
         doc = lxml.html.fromstring(entry['content']) # Thanks lxml.html for handling incomplete HTML documents!
         print doc.text_content().encode('utf-8')
      except:
         print "------------------ ERROR -------------------"
         print entry['content']

      print "******************************************************************************************************"

   continuation = xmlfeed.get_continuation()
   i+=1

Notes

If you try this script you will realize that the oldest post retrieved is from 9/29/2005. The real first post however was on 9/23/2003. Why don’t we see it? I believe it is because Google Reader uses feed information from FeedBurner, which was launched in 2004 and acquired by Google in 2007, so they probably started recording feed entries then. Incidentally Union Square Ventures was one of the original FeedBurner investors.

There is an easier way to retrieve text in the specific case of Fred Wilson’s blog and other HTML5 modern sites. HTML5 provides an <article> tag, so you can just crawl the whole site and retrieve the content within the <article> tag. You’ll need an extra step to deduplicate the content since many of the crawled pages will appear more than once. For example if you follow categories like MBA Mondays you will find articles that also appear when you follow another path.

Lessons Learned

We can use Google Reader to easily extract text content from blogs.
Google Reader has its limitations: it doesn’t cover posts before a certain data and sometimes skips posts.
HTML5 adds a valuable new tag for differentiating article text from the rest of the content.

Additional Resources

The Data Portability Fact Sheet

Introduction

Parallego has been announced on TechCrunch after a stealth period as the latest social network that will challenge Facebook and Google Plus. Their investors include big names like Sequoia Capital, Andreessen Horowitz and Union Square Ventures, and they have top angels like Ron Conway. They really love developers, so they offer an API to show their commitment to openness.

Parallego doesn’t really exist, but announcements like this are part of startup breaking news about the web and entrepreneurship. These companies emphasize their love for developers and claim to be open because they provide APIs. The truth is that when you test their APIs you usually find a number of problems:

You can read the information but cannot write or modify it.
You have access to certain information but other information is unavailable.
The rate of API calls is low, so you can only make a few calls and must wait a certain period of time to continue.
You cannot make parallel requests in a multiprocess or multithreaded application.
There is no way to quickly pay for the service and access a better service. Google API Console is a step in that direction but a lot of important Google NoAPIs are unavailable.
Some OAuth2 protocol implementation does not work with the existing development libraries.
The service says it welcomes new applications, but this is not the case for new UIs and mobile clients. See Twitter to Devs: Don’t Make Twitter Clients… Or Else [mashable.com]
You cannot even export your own information. The time you have spent adding content to this service is lost once you leave it.
There is no love for developers: the forums are filled with questions and there are no official answers. See Rate limit with billing enabled [google.com] and Graph API rate limit? [facebook.com]
The company often changes its policies. The web mashup that you did seven months ago that attracted thousands of users is useless because the new API revision does not give you the data that you need for some specific features. See Should facebook pay compensation for deprecated API calls and changes [facebook.com]
Old content is removed without warning.

After a while, you begin to doubt, close your eyes and rethink again about the word “Open”. It seems somewhat meaningless. If you are older you may remember that Microsoft was accused of being closed, but you may also remember that in the worst case you could reverse engineer and access all the internals yourself. You need advanced knowledge of tools like IDA Pro, OllyDbg, and WinDbg of course, but it was possible. You can’t reverse engineer the cloud, however you can scrape the information, but this is time consuming both in terms of development and running time.

And while “Open” is repeated in every announcement from high profile web companies, your brain does not register the word anymore just like you do not see any of the ads on Google because your brain made has made its own AdBlock extension.

Data Portability Classification

For all of the above reasons we think the best initiative towards transparency is adding a fact sheet to every service so we can compare them and know how “open” they really are. WikiMatrix is a good example of how comparisons could be made.

Marco Paol from DBB has been informally collecting information about some web services and has put it in a public spreadsheet on Data Portability Comparison

Please feel free to send us clarifications, suggestions, and fixes.

Resources

Open Data and Linked Data [wikipedia.org]
DataPortability project [wikipedia.org]
Small data [smalldata.org]
The open data manual [opendatamanual.org]
Is It Open Data?
Open Data mailing lists [okfn.org]
Synaptic/Web
Open Knowledge Foundation Blog
The Friend of a Friend (FOAF) project
theinfo.org: Community for Getting, Processing, and Visualizing Large Data Sets
Plagiarism Today
PeopleBrowsr’s case against Twitter heads back to state court after federal court ruling
Archive Team archivists

The parallego.com domain is available at the time of writing of this article. Register it now! (Disclaimer: affiliate link).
Top Trumps Photos taken by noodlepie. Don’t forget to see the Top Trumps Prototypes series.

Scraping vs Antiscraping

Introduction

It’s not possible to jump into the subject of scrapers without confronting antiscraping techniques. The reverse is also true: if you want to develop good antiscraping techniques you must think like a scraper developer. Similarly, real hackers needs knowledge of security technologies while a good security system benefits from simulated attacks. This kind of “game dynamics” also applies to security algorithms. For example one of the best known public encryption algorithms, RSA, was invented by Ron Rivest, Adi Shamir and Leonard Adleman. Ron and Adi invented new algorithms and Adelman was in charge of breaking them. They eventually came up with RSA¹.

Antiscraping Measures and How to Pass Them

A preliminary chart:

Antiscraping techniques	Scraping techniques
The site only enables crawling by a known search engine bot.	The scraper can access the search engine cache.
The site doesn’t allow the same IP to access a lot of pages in a short period of time.	Use Tor, a set of proxies, or a crawling service like 80legs.
The site shows a captcha if it’s crawled in a massive way.	Use anti-captcha techniques or services like Mechanical Turk where real people can give the answer. Another alternative is to listen to the captcha and use voice recognition with noise.
The site uses javascript.	Use a javascript enabled crawler.

Many antiscraping measures are annoying for visitors. For example if you’re a “search engine junkie” you’ll find pretty quickly that Google shows you a captcha thinking that you are a bot.

Digression

I believe the web should follow a MVC (Model View Controller) type pattern where you can access the data (the model) independently of how you interact with it. This would enable stronger connections between different sites. Linked Data is one of such initiative, but there are others. Data Portability and APIs are a step towards this pattern, but when you are using APIs from large sites you realize that they’ve put a lot of limits. Starting a whole business based on third party APIs is very risky. You only have to look at the past to see a lot of changes on API features and policies. Facebook, Google and Twitter are good examples. API providers are afraid of losing control of their sites and the profits they generate. We need new business models which can get around this problem and benefit both API providers and consumers. In this sense should be created new business models not only based on advertising. One common approach is to charge for the use of the API. There are other models like that followed by the Guardian, which distribute their ads via their API. APIs carrying advertising is a promising concept. We hope that more creative people will came up with new models for a better MVC web.

References

Leonard Adleman Interview

Menu

Benchmarking Search Engines

A Holistic Approach

Notes

Additional Resources

See Also

See Also

Introduction

Installation and Usage

Development Notes

See Also

Description

Summary

Data Flow Engine

Web Application

Programming Language

See Also

Resources

Python PayPal IPN

Code

Resources

Ideas vs Execution

Magic Chart

HNSearch Script

Resources

It’s more simple than what you think

See Also

Resources

Introduction

Extracting the Main Content from a Blog

Getting Blog Text from Google Reader

Python code

Notes

Lessons Learned

See Also

Additional Resources

Introduction

Data Portability Classification

Resources

Introduction

Antiscraping Measures and How to Pass Them

Digression

See Also

References

Further reading