Discovering Hidden Web Resources
Search engines and social networks are digital telescopes. It is extremely difficult and time consuming to find web resources outside of their lens. It’s a search craft. Our intuition knows that there are interesting invisible information but we can’t touch it.
IMDB contains a lot of information about users but the site only offers sharing as a collateral feature. If we search on Google we can’t find all the users sharing their movie rankings. At the time of writing of this article the query: site:imdb.com inurl:”user/*/ratings” was returning a few results on Google. How we can help people, through search engines, to find more web resources? This article shows the first 10 million results of the Distributed Scraping With Multiple Tor Circuits process. In a short time Google will index this article and include these new resources so everyone can find them.
In the meantime you have the great honor to see web resources that are invisible for search engines. These page contains the first 10 million of IMDB users sharing their movie’s ratings. We have included a script below to get their ratings taking advantage of the comma separated value export offered by IMDB.
Python Code for Exporting IMDB Ratings in Comma Separated Values
get-user-ratings.py
#!/usr/bin/python2.7 import pymongo import urllib2 MONGODB_HOSTNAME = 'localhost' HTML = """ <html> <body> {0} </body> </html> """ EXPORT_URL = "http://www.imdb.com/list/export?list_id=ratings&author_id={0}" def main(): conn = pymongo.Connection(MONGODB_HOSTNAME, 27017) db = conn.scraping coll = db.imdb.ratings items = coll.find({'last_response':200}) links = "" i = 0 for item in items: url = item['url'] index = 'ur{0:07}'.format(item['index']) filename = 'ur{0}.csv'.format(item['index']) links += "<a href='{0}'>{1}</a><br>".format(url, index) with open(filename, "wt") as h: h.write(urllib2.urlopen(EXPORT_URL.format(index)).read()) print HTML.format(links) if __name__ == '__main__': main()
Resources
Photo taken by gari.baldi