Multiple Circuit Tor Solution
When you rapidly fetch different web pages from a single IP address you risk getting stuck in the middle of the scraping. Some sites completely ban scrapers, while others follow a rate limit policy. For example, If you automate Google searches, Google will require you to solve captchas. Google is confused by many people using the same IP, and by search junkies. It used to be costly to get enough IPs to build a good scraping infrastructure. Now there are alternatives: cheap rotating proxies and Tor. Other options include specialized crawling and scraping services like 80legs, or even running Tor on AWS EC2 instances. The advantage of running Tor is its widespread network coverage. Tor is also free of charge. Unfortunately Tor does not allow you to control the bandwidth and latency.
All navigation performed when you start a session on Tor will be associated with the same exit point and its IP addresses. To renew these IP addresses you must restart Tor, or send a newnym signal, or as in our case study you can run multiples Tor instances at the same time If you assign different ports for each one. Many SOCKS proxies will then be ready for use. It is possible for more than one instance to share the same circuit, but that is beyond the scope of this article.
IMDB: A Case Study
If you like movies, Internet Movie Database is omnipresent in your daily life. IMDB users have always been able to share their movies and lists. Recently, however, the site turned previously shared public movie ratings private by default. Useful movie ratings disappeared from Internet with this change, and most of those that were manually set back to public are not indexed by search engines. All links that previously pointed to user ratings are now broken since the URLs have changed. How can you find all the public ratings available on IMDB?
If you use IMDB’s scraping policy it will take years, since the site contains tens of million of user pages. Distributed scraping is the best way to solve this issue and quickly discover which users are sharing their ratings. Our method just retrieves the HTTP response code to find out whether the user is sharing his rating.
Our code sample has three elements:
- Multiple Tor instances listening to different ports. The result is many SOCKS proxies available for use with different Tor circuits.
- A Python script that launches multiple workers in different threads. Each worker uses a different SOCK port.
- MongoDB to persist the state of the scraping if the process fails or if you want to stop the process and continue later.
Shell Script and Source Code
Prerequisites
Multiple Tor Launcher
You must run the following script before running the Python script. To adjust the number of Tor instances just change the interval in the loop.
#!/bin/bash base_socks_port=9050 base_control_port=8118 # Create data directory if it doesn't exist if [ ! -d "data" ]; then mkdir "data" fi #for i in {0..10} for i in {0..80} do j=$((i+1)) socks_port=$((base_socks_port+i)) control_port=$((base_control_port+i)) if [ ! -d "data/tor$i" ]; then echo "Creating directory data/tor$i" mkdir "data/tor$i" fi # Take into account that authentication for the control port is disabled. Must be used in secure and controlled environments echo "Running: tor --RunAsDaemon 1 --CookieAuthentication 0 --HashedControlPassword \"\" --ControlPort $control_port --PidFile tor$i.pid --SocksPort $socks_port --DataDirectory data/tor$i" tor --RunAsDaemon 1 --CookieAuthentication 0 --HashedControlPassword "" --ControlPort $control_port --PidFile tor$i.pid --SocksPort $socks_port --DataDirectory data/tor$i done
Python Script
The script below stores its results on MongoDB on the “imdb” db under the “imdb.ratings” collection. To handle the number of simultaneous workers you can change the “Discovery.NWorkers” variable. Note that the the number of workers must be equal to or less than the number of Tor instances.
#!/usr/bin/python import httplib import socks import urllib2 from Queue import Queue from threading import Thread, Condition, Lock from threading import active_count as threading_active_count import time from pymongo import Connection import pymongo url_format = 'http://www.imdb.com/user/ur{0}/ratings' http_codes_counter = {} MONGODB_HOSTNAME = '192.168.0.118' """ https://gist.github.com/869791 SocksiPy + urllib handler version: 0.2 author: e This module provides a Handler which you can use with urllib2 to allow it to tunnel your connection through a socks.sockssocket socket, without monkey patching the original socket... """ class SocksiPyConnection(httplib.HTTPConnection): def __init__(self, proxytype, proxyaddr, proxyport = None, rdns = True, username = None, password = None, *args, **kwargs): self.proxyargs = (proxytype, proxyaddr, proxyport, rdns, username, password) httplib.HTTPConnection.__init__(self, *args, **kwargs) def connect(self): self.sock = socks.socksocket() self.sock.setproxy(*self.proxyargs) if isinstance(self.timeout, float): self.sock.settimeout(self.timeout) self.sock.connect((self.host, self.port)) class SocksiPyHandler(urllib2.HTTPHandler): def __init__(self, *args, **kwargs): self.args = args self.kw = kwargs urllib2.HTTPHandler.__init__(self) def http_open(self, req): def build(host, port=None, strict=None, timeout=0): conn = SocksiPyConnection(*self.args, host=host, port=port, strict=strict, timeout=timeout, **self.kw) return conn return self.do_open(build, req) class Monitor(Thread): def __init__(self, queue, discovery): Thread.__init__(self) self.queue = queue self.discovery = discovery self.finish_signal = False def finish(self): self.finish_signal = True def run(self): while not self.finish_signal: time.sleep(5) print "Elements in Queue:", self.queue.qsize(), "Active Threads:", threading_active_count(), "Exceptions Counter:", self.discovery.exception_counter class Worker(Thread): def __init__(self, queue, discovery, socks_proxy_port): Thread.__init__(self) self.queue = queue self.discovery = discovery self.socks_proxy_port = socks_proxy_port self.opener = urllib2.build_opener(SocksiPyHandler(socks.PROXY_TYPE_SOCKS4, 'localhost', self.socks_proxy_port)) self.conn = Connection(MONGODB_HOSTNAME, 27017) self.db = self.conn.scraping self.coll = self.db.imdb.ratings def get_url(self, url): try: #h = urllib2.urlopen(url) h = self.opener.open(url) return h.getcode() except urllib2.URLError, e: return e.code def run(self): while True: try: index = self.queue.get() if index == None: self.queue.put(None) # Notify the next worker break url = url_format.format(index) code = self.get_url(url) self.coll.update({'index':index}, {'$set': {'last_response':code}}) self.discovery.lock.acquire() self.discovery.records_to_process -= 1 if self.discovery.records_to_process == 0: self.discovery.lock.notify() self.discovery.lock.release() except (socks.Socks4Error, httplib.BadStatusLine), e: # TypeError: 'Socks4Error' object is not callable print e self.discovery.exception_counter_lock.acquire() self.discovery.exception_counter += 1 self.discovery.exception_counter_lock.release() pass # leave this element for the next cycle time.sleep(1.5) class Croupier(Thread): Base = 0 Top = 25000000 #Top = 1000 def __init__(self, queue, discovery): Thread.__init__(self) self.conn = Connection(MONGODB_HOSTNAME, 27017) self.db = self.conn.scraping self.coll = self.db.imdb.ratings self.finish_signal = False self.queue = queue self.discovery = discovery self.discovery.records_to_process = 0 def run(self): # Look if imdb collection is empty. Only if its empty we create all the items c = self.coll.count() if c == 0: print "Inserting items" self.coll.ensure_index([('index', pymongo.ASCENDING), ('last_response', pymongo.ASCENDING)]) for i in xrange(Croupier.Base, Croupier.Top): self.coll.insert({'index':i, 'url': url_format.format(i), 'last_response': 0}) else: print "Using #", c, " persisted items" while True: #items = self.coll.find({'last_response': {'$ne': 200}}) items = self.coll.find({'$and': [{'last_response': {'$ne': 200}}, {'last_response' : {'$ne': 404}}]}, timeout = False) self.discovery.records_to_process = items.count() if self.discovery.records_to_process == 0: break for item in items: self.queue.put(item['index']) # Wait until the last item is updated on the db self.discovery.lock.acquire() while self.discovery.records_to_process != 0: self.discovery.lock.wait() self.discovery.lock.release() # time.sleep(5) # Send a 'signal' to workers to finish self.queue.put(None) def finish(self): self.finish_signal = True class Discovery: NWorkers = 71 SocksProxyBasePort = 9050 Contention = 10000 def __init__(self): self.queue = Queue(Discovery.Contention) self.workers = [] self.lock = Condition() self.exception_counter_lock = Lock() self.records_to_process = 0 self.exception_counter = 0 def start(self): croupier = Croupier(self.queue, self) croupier.start() for i in range(Discovery.NWorkers): worker = Worker(self.queue, self, Discovery.SocksProxyBasePort + i) self.workers.append(worker) for w in self.workers: w.start() monitor = Monitor(self.queue, self) monitor.start() for w in self.workers: w.join() croupier.join() print "Queue finished with:", self.queue.qsize(), "elements" monitor.finish() def main(): discovery = Discovery() discovery.start() if __name__ == '__main__': main() # # MISC NOTES # # - How many IMDB ratings pages are currently indexed by Google? query: inurl:www.imdb.com/user/*/ratings # - [pymongo] cursor id '239432858681488351' not valid at server Options: http://groups.google.com/group/mongodb-user/browse_thread/thread/4ed6e3d77fb1c2cf?pli=1 # That error generally means that the cursor timed out on the server - # this could be the case if you are performing a long running operation # while iterating over the cursor. The best bet is probably to turn off # the timeout by passing "timeout=False" in your call to find: #
This script will gather users with public ratings using the following MongoDB query: db.imdb.ratings.find({‘last_response’: 200})
Try exporting the movies ratings. This the easiest part because it is now a comma separated value file and you don’t need an XPath query.
Additional observations
- We are not just using MongoDB because it is fancy, but also because it is very practical for quickly prototyping and persisting data along the way. The well-known “global lock” limitation on MongoDB (and many other databases) does not significantly affect its ability to efficiently store data.
- We use SocksiPy to allow us to use different proxies at the same time.
- If you are serious about using Tor to build a distributed infrastructure you might consider running Tor proxies on AWS EC2 instances as needed.
- Do not forget to run Tor instances in a secure environment since the control port is open to everyone without authentication.
- Our solution is easily scalable.
- If you get many 503 return codes, try balancing the quantity of proxies and delaying each worker’s activity.
See Also
- Running Your Own Anonymous Rotating Proxies
- Web Scraping Ajax and Javascript Sites
- Luminati Distributed Web Crawling
thanks for this, it helped me get python connecting with TOR which I was finding difficult :)
Thanks a lot for your explanation of building multiple TOR by either multiple threads or the round robin method. However, i have a hard time forking your code since I am using mysql instead of mongodb, I am wondering could you please maybe talk about how to use multiple tor and connect with mysql, thanks !
Hi Bin, what are the difficult issues that you find converting from mongodb to mysql?
Hi Sebastian, I have already figured that out, and the reason is that you have to rewrite every command that updates the database, from mongodb syntax to mysql… btw, your code sometimes cannot finish after I changed the croupier to run only one time. I am pretty sure there is only one element in the queue, which is supposed to be none. However, it keeps printing Elements in queue 0: Active threads:xx Exception Counter :1. … Seems like the monitor cannot stop…. have that happened to you.
Do we really need to have the control port passwd be blank. Dont see it being used in the python script anywhere.
i am tring to connect .onion throught python, i can connect to tor successfully (got into check.torproject.org and saw Congratulations text) but when i try .onion sites my program says cant connect and says “name or service is not known”, do you have any idea ? where do i do wrong ?
That sounds like python is doing DNS resolution directly instead of through Tor. One of the main differences between SOCKS4 and SOCKS5 is that the latter added support for UDP in addition to TCP. Python is probably using UDP for DNS (most software does) so it’s not getting sent to the Tor router. Change socks.PROXY_TYPE_SOCKS4 to socks.PROXY_TYPE_SOCKS5 and that should fix the problem.
As you mentioned, I’ve been trying to automate Google searches, but when I try to connect to google using tor their server detects that I’m using tor and they don’t accept the request. Do you know any way of bypass such scheme?
Google is extremely good at preventing scraping. You’re getting caught by Google’s rate limiters – remember there are lots of other people using these same exit nodes. As far as I know there’s no way around this – you’ll likely need to find a different proxy service entirely.
For automated searches please look at our article: http://databigbang1.wpenginepowered.com/google-search-no-api/ and the possibility to use Google Custom Search API: https://news.ycombinator.com/item?id=2712386
I have some suggestions, it would be interesting to receive contributions on this subject.
This list is from a draft of an unpublished article on this subject:
USING AIRPLANE MODE ON THE MOBILE
– http://diegobasch.com/some-fresh-twitter-stats-as-of-july-2012
– http://superuser.com/questions/28409/how-do-i-force-my-iphone-to-obtain-a-new-ip-address
USING MULTIPLE WIRELESS NETWORKS WITH DISTRIBUTED LONG RANGE ANTENNAS
– http://neosmart.net/blog/2006/multiple-wireless-networks-with-one-wi-fi-card/
– http://www0.cs.ucl.ac.uk/staff/a.bittau/finalreport3.pdf
SECURITY BUGS
– http://www.ihteam.net/hacking-news/using-facebook-as-a-proxy/
USING GOOGLE APP ENGINE (Thanks https://twitter.com/brutuscat)
– http://stackoverflow.com/questions/529523/web-scraping-with-google-app-engine
PUTTING AN AGENT ON FRIENDS MACHINES/MOBILES
Hi Sebastian ,
why I can’t run multiple tor? there is warning when I run the shell script, ‘Could not bind to 127.0.0.1:9051: Address already in use.’ and ‘ Failed to parse/validate config’
It is possible that you have a process listening on that port OR a process was shutting down while you restarted it thinking that the port was already freed.
Thanks for sharing it! It helped me a lot with my scraping plans! Just a minor typo in your description: the results are saved under the “imdb.ratings” collection not the “scraping.ratings” one.
Thanks, it was updated.
TIP: highlight the movies from your voting history when you natigate IMDB, see this extension: https://chrome.google.com/webstore/detail/my-imdb/ngicopfkgbodejbbfalbmobdpjebhhmb
Hello, thank for the code it is working very well.
However, all my instances are using the same IP address.
Do you have any idea how to change that?
Actually, it seems the opener is using my local connection instead of tor instances.
Anybody facing similar issue?
Ok the issue is coming from the fact I’m using mechanize instead of urllib2:
self.browser=mechanize.Browser()
self.opener = mechanize.build_opener(SocksiPyHandler(socks.PROXY_TYPE_SOCKS4, ‘localhost’, self.socks_proxy_port))
It seems mechanize is not able to connect through tor with the same command (it is connecting through local IP instead).
Does anyone have any idea regarding the right command for using mechanize?
Does the other article about implementing your own rotating proxy help?
Hello,
Because of javascript, I’m thinking of starting instantiated browsers in each thread using each a specific tor circuit, and then control each browser behavior in each python thread.
However, I’m still not able to find out how to force firefox to use a specific tor circuit…
Please help!
Why not just enable .exit notation? Then you can just get a list of exit nodes and randomly choose one each time you start a new session. No need for running multiple nodes.
Excellent contribution.
Have you tried out the providers lately? Looking to get one, would love some resis, are those worth the deal ?
I just read through the whole article of yours and it was genuinely satisfactory. This is a striking article a commitment of thankfulness is all together for sharing this data. I will visit your blog dependably for some most recent post
Also, read a blog by Robin
Content writing company
I am quite interested in this topic. Hope you will elaborate more on it in future posts.
Look what Methu says
Hp Laptop Microphone Not Working