Rotating Proxies with HAProxy
Most web browsers and scrapers can only be configured to use one proxy per protocol. You can get around this limitation by running different instances of browsers and scrapers. Google Chrome and Firefox allow multiple profiles. However, running hundreds of browser instances is unwieldy.
A better option is to set up your own proxy to rotate among a set of Tor proxies.The Tor application implements a SOCKS proxy. Start multiple Tor instances on one or more machines and networks, then configure and run an HTTP load balancer to expose a single point of connection instead of adding the rotating logic within the client application. On the Distributed Scraping With Multiple Tor Circuits article we learned how to set up multiple Tor SOCKS proxies for web scraping and crawling. However our sample code launched multiple threads each of which uses a different proxy. In this example we use the HAProxy load balancer with a round-robin strategy to rotate our proxies.
When you are dealing with web crawling and scraping sites with Javascript, using a real browser with a high performance Javascript engine like V8 may be the best approach. Just configuring our rotating proxy in the browser does the trick. Another option is using HTMLUnit but the the V8 Javascript Engine parses web pages and runs Javascript more quickly. If you are using a browser you must be particularly careful to keep the scraped site from correlating your multiple requests. Try disabling cookies, local storage, and image loading, and only enabling Javascript, indeed, you need to cache as many requests as possible. If you need to support cookies, you have to run different browsers with different profiles.
Setup and Configuration
Prerequisites
HAProxy Configuration File
rotating-tor-proxies.cfg
global daemon maxconn 256 defaults mode http timeout connect 5000ms timeout client 50000ms timeout server 50000ms frontend rotatingproxies bind *:3128 default_backend tors option http_proxy backend tors option http_proxy server tor1 localhost:3129 server tor1 localhost:3130 server tor1 localhost:3131 server tor1 localhost:3132 server tor1 localhost:3133 server tor1 localhost:3134 server tor1 localhost:3135 server tor1 localhost:3136 server tor1 localhost:3137 server tor1 localhost:3138 balance roundrobin
Running
Run the following script, which launches many instances of Tor. Then runs one instance of delegated per Tor, and finally runs HAProxy to rotate the proxy servers. We have to use DeleGate because HAProxy does not support SOCKS.
#!/bin/bash base_socks_port=9050 base_http_port=3129 # leave 3128 for HAProxy base_control_port=8118 # Create data directory if it doesn't exist if [ ! -d "data" ]; then mkdir "data" fi #for i in {0..10} for i in {0..9} do j=$((i+1)) socks_port=$((base_socks_port+i)) control_port=$((base_control_port+i)) http_port=$((base_http_port+i)) if [ ! -d "data/tor$i" ]; then echo "Creating directory data/tor$i" mkdir "data/tor$i" fi # Take into account that authentication for the control port is disabled. Must be used in secure and controlled environments echo "Running: tor --RunAsDaemon 1 --CookieAuthentication 0 --HashedControlPassword \"\" --ControlPort $control_port --PidFile tor$i.pid --SocksPort $socks_port --DataDirectory data/tor$i" tor --RunAsDaemon 1 --CookieAuthentication 0 --HashedControlPassword "" --ControlPort $control_port --PidFile tor$i.pid --SocksPort $socks_port --DataDirectory data/tor$i echo "Running: ./delegate/src/delegated -P$http_port SERVER=http SOCKS=localhost:$socks_port" ./delegate/src/delegated -P$http_port SERVER=http SOCKS=localhost:$socks_port done haproxy -f rotating-tor-proxies.cfg
Just the kind of information I was looking for
Following your idea I implemented this using Monit (http://mmonit.com/monit/)
See the the Gist here https://gist.github.com/2519840
Thanks Mauro!
Many thanks.
“HTTP_X_FORWARDED_FOR” is added into the request header. Do you know which program
added it? How to remove it?
“HTTP_X_FORWARDED_FOR” is added by your proxy, it may contain your real IP. Personally, I find no useful thing about transparent proxies, its just an obstruction on your internet connection/latency. you should use anonymous or elite proxy to conceal your IP.
Hey Y’all,
Great article, and really helpful for some scraping needs I have. However, I can’t seem to get HAProxy to round-robin, or at least not rapidly. I’m using a HAProxy/Pilopo/Tor combination, and everything is running and communicating correctly. I’m testing my setup by issuing HTTP requests (C# HttpWebRequest) through HAProxy to a page I created on a development server that does nothing but show the current requesting IP address. My HAProxy configuration is the same as yours above. If I run my test console application, and make 10 requests in a loop, I get the same IP address every time. If I stop/restart my console application, the IP address is different. So it appears that HAProxy IS doing round-robin correctly, but not within a loop in the executing console app. Is there something I’m not understanding about how HAProxy decides when to move to the next server? Any thoughts you have would be greatly appreciated.
By the way, in case any of your readers need it, I’m creating multiple Pilopo/Tor instances programmaticaly in C# like this:
int baseSocksPort = 9050;
int baseHttpPort = 3129;
for (int i = 0; i < 10; i++)
{
Process newTor = new Process();
StringBuilder torArgs = new StringBuilder();
torArgs.Append(@"-f C:DevRequestProxyingtorrc");
torArgs.Append(" –PidFile tor" + i.ToString() + ".pid");
torArgs.Append(" –SocksPort " + baseSocksPort.ToString());
torArgs.Append(" –ControlPort " + (baseSocksPort + 1).ToString());
torArgs.Append(@" –DataDirectory C:DevRequestProxyingdatator" + i.ToString());
newTor.StartInfo = new ProcessStartInfo(@"C:Program Files (x86)VidaliaBundleTortor.exe", torArgs.ToString());
newTor.StartInfo.CreateNoWindow = true;
newTor.StartInfo.RedirectStandardOutput = true;
newTor.StartInfo.UseShellExecute = false;
newTor.Start();
}
It's a bit easier to escape things and add arguments this way than via batch file.
Thanks,
Landon
Hi Landon, as a preliminary check, can you look at your Pilopo/Tor logs to see if HAProxy is sending requests to the same Pilopo/Tor pair or Tor is using the same exit node on different Pilopo/Tor pairs?
Sebastian,
Thanks for the quick reply. It took me a while, but I finally realized that the reason I was getting the same IP when proxying in a loop was because of the default keep alive settings. Obviously I was calling HAProxy very quickly in that loop, so keep alive just used the same connection. I added “option httpclose” to my HAProxy config file, and now I get a different exit node IP for (almost) every request in the loop! Drove me about crazy, but it’s finally explained!
Thanks again for suggesting this approach in the article — I think it’s going to help me quite a bit. Cheers, Landon
everything in blog.databingbang.com is great
Sorry for the comments above.
I am trying to redo what you have explained in your post. However, to be safe, I am doing all these tests in AWS EC2, which is a Ubuntu 64bits os. Based on my knowledge, i can go through your process because DeleGate is not applicable in Ubuntu.
Someone agreed with me on this point:
http://ajitabhpandey.info/2011/03/delegate-a-multi-platform-multi-purpose-proxy-server/
I prefer this round-robin method to the one that you used scraping IMDB, mostly because this is easier.
Do you think you could come up with a substitute for DeleGate such that your fans using Ubuntu could be benefited.
Thanks a lot and your posts are greatly helpful.
Bin
I used DeleGate on Ubuntu. Try to compile it instead of retrieving it from a repository.
Hi Sebastian,
Any reason you didn’t simply load-balanced the socks connections between Delegate and your Tor nodes instead of load-balancing your HTTP connections between your client and Delegate ? In other words, insead of having [client => HAProxy => Delegate => Tor ], you would have [client => Delegate => HAProxy => Tor]. That would be only one instance of Delegate, instead of one per Tor node. Caching (if used) would also be shared amongs all Tor nodes.
We just use Delegate to translate from an HTTP proxy to a SOCKS one since HAProxy doesn’t support SOCKS proxies. A Tor instance exposes a SOCKS proxy.
We can combine your idea with this implementation adding another Delegate instance between the client and the HAProxy.
Thank you for your answer Sebastian.
However, in my understanding, HAProxy is totally able to load-balance connections to socks servers like Tor, as any other TCP connection. This is actually what I currently do, but as I cannot find anyone else doing the same, I may be mistanken. Your opinion on that?
Are you 100% sure that HAProxy supports connecting directly to socks proxies? I know that when this article was written it didn’t have that option AND I am not finding an existing option to do that in the docs.
Can you point to the documentation and provide an snippet of your configuration here?
I am 100% sure that HAProxy is a TCP load-balancer that can load-balance any TCP connection, including socks. I have been doing that for over a year now, although using Polipo (which I am not found of) instead of Delegate in _front_ of HAProxy, and it works well, according to “HAProxy Statistics Report” web interface.
However, it is important to keep in mind that HAProxy does not load balance HTTP requests “inside” a socks tunnel. Only socks connections them-selves can be load-balanced. If all HTTP requests are forwarded through a single socks tunnel that lasts forever, they can never be load-balanced.
To take advantage of load-balancing, one must open multiple socks connections and close them regularly so that new socks connections are routed to different Tor nodes. This is how Polipo seems to behave somehow, although I don’t really know how to explicitely configure it to do so. Can Delegate do that as well?
In the end, it is probably not as efficient as running a dedicated Polipo/Delegate instance for each Tor node (where each HTTP connection is load-balanced independently), but it is also certainly less cumbersome. I am just surprised nobody has documented that before.
Still relevant.
Hi Sebastian,
Thanks for sharing this so useful information. And there is a question.Can you help me? As we know as well, when we start Tor and delegate, each Tor instance bind a IP during it lifetime until we restart Tor. I want Tor instance to automatic change its IP. Is threre any way to do it??
Thanks a lot and your posts are greatly helpful.
FirstFeel
Elite Private Proxies API on Mashape.
Short life elite proxies updated every 1 minute.
Every request create an private hostname that expired after 5 minuted.
You can ask for geo located proxy or random party.
Includes proxies count by countries endpoint
Its all private proxies, its not scrapped from public lists.
https://market.mashape.com/garrylachman/elite-proxies
…
hi, thanks a lot for your tutorial, i face a problem connecting to https “ssl” websites, so how to fix this.
regards
Hey, I’m also facing Issue with SSL websites , if you found the solution could you please share with everyone.
love the script thanks a lot,
however when i try to modify the script to build new identity on all tor services i get error
i added MaxCircuitDirtiness 60
Jun 09 12:23:39.688 [warn] Failed to parse/validate config: Unknown option ’60’. Failing.
Jun 09 12:23:39.688 [err] Reading config failed–see warnings above.
can you help with this
best regards
Must be dumb, but how do I test this?
I tried this: curl -x x.x.x.x:3129 ipinfo.io
and i get an haproxy message. what port is the proxy ip supposed to answer on?
Can you write tutorial using own list of proxy. For example, if I already proxy package list contain 1000 proxy and I want to rotate it every 3 minutes then access those proxies using one single local IP. How to do that? Thanks.