January 3, 2012

Helping Search Engines to Find Content in the Invisible Web

Discovering Hidden Web Resources

Search engines and social networks are digital telescopes. It is extremely difficult and time consuming to find web resources outside of their lens. It’s a search craft. Our intuition knows that there are interesting invisible information but we can’t touch it.

IMDB contains a lot of information about users but the site only offers sharing as a collateral feature. If we search on Google we can’t find all the users sharing their movie rankings. At the time of writing of this article the query: site:imdb.com inurl:”user/*/ratings” was returning a few results on Google. How we can help people, through search engines, to find more web resources? This article shows the first 10 million results of the Distributed Scraping With Multiple Tor Circuits process. In a short time Google will index this article and include these new resources so everyone can find them.

In the meantime you have the great honor to see web resources that are invisible for search engines. These page contains the first 10 million of IMDB users sharing their movie’s ratings. We have included a script below to get their ratings taking advantage of the comma separated value export offered by IMDB.

Python Code for Exporting IMDB Ratings in Comma Separated Values

get-user-ratings.py

#!/usr/bin/python2.7

import pymongo
import urllib2

MONGODB_HOSTNAME = 'localhost'

HTML = """
<html>
<body>
{0}
</body>
</html>
"""

EXPORT_URL = "http://www.imdb.com/list/export?list_id=ratings&author_id={0}"

def main():
   conn = pymongo.Connection(MONGODB_HOSTNAME, 27017)
   db = conn.scraping
   coll = db.imdb.ratings

   items = coll.find({'last_response':200})

   links = ""

   i = 0
   for item in items:
      url = item['url']
      index = 'ur{0:07}'.format(item['index'])
      filename = 'ur{0}.csv'.format(item['index'])
      links += "<a href='{0}'>{1}</a><br>".format(url, index)

     with open(filename, "wt") as h:
        h.write(urllib2.urlopen(EXPORT_URL.format(index)).read())

   print HTML.format(links)

if __name__ == '__main__':
   main()

Resources

  1. Discovering URLs through User Feedback
  2. Invisible Web
  3. Deep Web Research 2012

Photo taken by gari.baldi