Running Your Own Anonymous Rotating Proxies

Rotating Proxies with HAProxy

Most web browsers and scrapers can only be configured to use one proxy per protocol. You can get around this limitation by running different instances of browsers and scrapers. Google Chrome and Firefox allow multiple profiles. However, running hundreds of browser instances is unwieldy.

A better option is to set up your own proxy to rotate among a set of Tor proxies.The Tor application implements a SOCKS proxy. Start multiple Tor instances on one or more machines and networks, then configure and run an HTTP load balancer to expose a single point of connection instead of adding the rotating logic within the client application. On the Distributed Scraping With Multiple Tor Circuits article we learned how to set up multiple Tor SOCKS proxies for web scraping and crawling. However our sample code launched multiple threads each of which uses a different proxy. In this example we use the HAProxy load balancer with a round-robin strategy to rotate our proxies.

When you are dealing with web crawling and scraping sites with Javascript, using a real browser with a high performance Javascript engine like V8 may be the best approach. Just configuring our rotating proxy in the browser does the trick. Another option is using HTMLUnit but the the V8 Javascript Engine parses web pages and runs Javascript more quickly. If you are using a browser you must be particularly careful to keep the scraped site from correlating your multiple requests. Try disabling cookies, local storage, and image loading, and only enabling Javascript, indeed, you need to cache as many requests as possible. If you need to support cookies, you have to run different browsers with different profiles.

Setup and Configuration

Prerequisites

  1. Tor
  2. DeleGate
  3. HAProxy

HAProxy Configuration File

rotating-tor-proxies.cfg

global
        daemon
        maxconn 256

defaults
        mode http
        timeout connect 5000ms
        timeout client 50000ms
        timeout server 50000ms

frontend rotatingproxies
        bind *:3128
        default_backend tors
        option http_proxy

backend tors
        option http_proxy
        server tor1 localhost:3129
        server tor1 localhost:3130
        server tor1 localhost:3131
        server tor1 localhost:3132
        server tor1 localhost:3133
        server tor1 localhost:3134
        server tor1 localhost:3135
        server tor1 localhost:3136
        server tor1 localhost:3137
        server tor1 localhost:3138
        balance roundrobin

Running

Run the following script, which launches many instances of Tor. Then runs one instance of delegated per Tor, and finally runs HAProxy to rotate the proxy servers. We have to use DeleGate because HAProxy does not support SOCKS.

#!/bin/bash
base_socks_port=9050
base_http_port=3129 # leave 3128 for HAProxy
base_control_port=8118

# Create data directory if it doesn't exist
if [ ! -d "data" ]; then
	mkdir "data"
fi

#for i in {0..10}
for i in {0..9}

do
	j=$((i+1))
	socks_port=$((base_socks_port+i))
	control_port=$((base_control_port+i))
	http_port=$((base_http_port+i))
	if [ ! -d "data/tor$i" ]; then
		echo "Creating directory data/tor$i"
		mkdir "data/tor$i"
	fi
	# Take into account that authentication for the control port is disabled. Must be used in secure and controlled environments

	echo "Running: tor --RunAsDaemon 1 --CookieAuthentication 0 --HashedControlPassword \"\" --ControlPort $control_port --PidFile tor$i.pid --SocksPort $socks_port --DataDirectory data/tor$i"

	tor --RunAsDaemon 1 --CookieAuthentication 0 --HashedControlPassword "" --ControlPort $control_port --PidFile tor$i.pid --SocksPort $socks_port --DataDirectory data/tor$i

	echo 	"Running: ./delegate/src/delegated -P$http_port SERVER=http SOCKS=localhost:$socks_port"

	./delegate/src/delegated -P$http_port SERVER=http SOCKS=localhost:$socks_port
done

haproxy -f rotating-tor-proxies.cfg

See Also

  1. Distributed Scraping With Multiple Tor Circuits
  2. Web Scraping Ajax and Javascript Sites

Resources

  1. HAProxy The Reliable, High Performance TCP/HTTP Load Balancer
  2. DeleGate Multi-Purpose Application Level Gateway
  3. Python twisted proxyclient cascade / upstream to squid
  4. How SOPA’s ‘circumvention’ ban could put a target on Tor
  • F240268

    Just the kind of information I was looking for

  • http://plususefulness.blogspot.com/ Mauro Asprea

    Following your idea I implemented this using Monit (http://mmonit.com/monit/)
    See the the Gist here https://gist.github.com/2519840

    • http://blog.databigbang.com Sebastian Wain

      Thanks Mauro!

  • redice

    Many thanks.
    “HTTP_X_FORWARDED_FOR” is added into the request header. Do you know which program

    added it? How to remove it?

    • bobo

      “HTTP_X_FORWARDED_FOR” is added by your proxy, it may contain your real IP. Personally, I find no useful thing about transparent proxies, its just an obstruction on your internet connection/latency. you should use anonymous or elite proxy to conceal your IP.

  • http://www.facebook.com/landon.campbell.982 Landon Campbell

    Hey Y’all,

    Great article, and really helpful for some scraping needs I have. However, I can’t seem to get HAProxy to round-robin, or at least not rapidly. I’m using a HAProxy/Pilopo/Tor combination, and everything is running and communicating correctly. I’m testing my setup by issuing HTTP requests (C# HttpWebRequest) through HAProxy to a page I created on a development server that does nothing but show the current requesting IP address. My HAProxy configuration is the same as yours above. If I run my test console application, and make 10 requests in a loop, I get the same IP address every time. If I stop/restart my console application, the IP address is different. So it appears that HAProxy IS doing round-robin correctly, but not within a loop in the executing console app. Is there something I’m not understanding about how HAProxy decides when to move to the next server? Any thoughts you have would be greatly appreciated.

    By the way, in case any of your readers need it, I’m creating multiple Pilopo/Tor instances programmaticaly in C# like this:

    int baseSocksPort = 9050;
    int baseHttpPort = 3129;

    for (int i = 0; i < 10; i++)
    {
    Process newTor = new Process();
    StringBuilder torArgs = new StringBuilder();
    torArgs.Append(@"-f C:DevRequestProxyingtorrc");
    torArgs.Append(" –PidFile tor" + i.ToString() + ".pid");
    torArgs.Append(" –SocksPort " + baseSocksPort.ToString());
    torArgs.Append(" –ControlPort " + (baseSocksPort + 1).ToString());
    torArgs.Append(@" –DataDirectory C:DevRequestProxyingdatator" + i.ToString());
    newTor.StartInfo = new ProcessStartInfo(@"C:Program Files (x86)VidaliaBundleTortor.exe", torArgs.ToString());
    newTor.StartInfo.CreateNoWindow = true;
    newTor.StartInfo.RedirectStandardOutput = true;
    newTor.StartInfo.UseShellExecute = false;
    newTor.Start();
    }

    It's a bit easier to escape things and add arguments this way than via batch file.

    Thanks,

    Landon

    • http://blog.databigbang.com Sebastian Wain

      Hi Landon, as a preliminary check, can you look at your Pilopo/Tor logs to see if HAProxy is sending requests to the same Pilopo/Tor pair or Tor is using the same exit node on different Pilopo/Tor pairs?

      • http://www.facebook.com/landon.campbell.982 Landon Campbell

        Sebastian,

        Thanks for the quick reply. It took me a while, but I finally realized that the reason I was getting the same IP when proxying in a loop was because of the default keep alive settings. Obviously I was calling HAProxy very quickly in that loop, so keep alive just used the same connection. I added “option httpclose” to my HAProxy config file, and now I get a different exit node IP for (almost) every request in the loop! Drove me about crazy, but it’s finally explained!

        Thanks again for suggesting this approach in the article — I think it’s going to help me quite a bit. Cheers, Landon

  • Bin Wang

    everything in blog.databingbang.com is great

    • Bin Wang

      Sorry for the comments above.

      I am trying to redo what you have explained in your post. However, to be safe, I am doing all these tests in AWS EC2, which is a Ubuntu 64bits os. Based on my knowledge, i can go through your process because DeleGate is not applicable in Ubuntu.

      Someone agreed with me on this point:
      http://ajitabhpandey.info/2011/03/delegate-a-multi-platform-multi-purpose-proxy-server/

      I prefer this round-robin method to the one that you used scraping IMDB, mostly because this is easier.

      Do you think you could come up with a substitute for DeleGate such that your fans using Ubuntu could be benefited.

      Thanks a lot and your posts are greatly helpful.

      Bin

      • http://blog.databigbang.com Sebastian Wain

        I used DeleGate on Ubuntu. Try to compile it instead of retrieving it from a repository.

  • John Smith

    Hi Sebastian,
    Any reason you didn’t simply load-balanced the socks connections between Delegate and your Tor nodes instead of load-balancing your HTTP connections between your client and Delegate ? In other words, insead of having [client => HAProxy => Delegate => Tor ], you would have [client => Delegate => HAProxy => Tor]. That would be only one instance of Delegate, instead of one per Tor node. Caching (if used) would also be shared amongs all Tor nodes.

    • http://blog.databigbang.com/ Sebastian Wain

      We just use Delegate to translate from an HTTP proxy to a SOCKS one since HAProxy doesn’t support SOCKS proxies. A Tor instance exposes a SOCKS proxy.

      We can combine your idea with this implementation adding another Delegate instance between the client and the HAProxy.

      • John Smith

        Thank you for your answer Sebastian.
        However, in my understanding, HAProxy is totally able to load-balance connections to socks servers like Tor, as any other TCP connection. This is actually what I currently do, but as I cannot find anyone else doing the same, I may be mistanken. Your opinion on that?

        • http://blog.databigbang.com/ Sebastian Wain

          Are you 100% sure that HAProxy supports connecting directly to socks proxies? I know that when this article was written it didn’t have that option AND I am not finding an existing option to do that in the docs.

          Can you point to the documentation and provide an snippet of your configuration here?

          • John Smith

            I am 100% sure that HAProxy is a TCP load-balancer that can load-balance any TCP connection, including socks. I have been doing that for over a year now, although using Polipo (which I am not found of) instead of Delegate in _front_ of HAProxy, and it works well, according to “HAProxy Statistics Report” web interface.

            However, it is important to keep in mind that HAProxy does not load balance HTTP requests “inside” a socks tunnel. Only socks connections them-selves can be load-balanced. If all HTTP requests are forwarded through a single socks tunnel that lasts forever, they can never be load-balanced.

            To take advantage of load-balancing, one must open multiple socks connections and close them regularly so that new socks connections are routed to different Tor nodes. This is how Polipo seems to behave somehow, although I don’t really know how to explicitely configure it to do so. Can Delegate do that as well?

            In the end, it is probably not as efficient as running a dedicated Polipo/Delegate instance for each Tor node (where each HTTP connection is load-balanced independently), but it is also certainly less cumbersome. I am just surprised nobody has documented that before.

  • traverseda

    Still relevant.

  • firstfeel

    Hi Sebastian,

    Thanks for sharing this so useful information. And there is a question.Can you help me? As we know as well, when we start Tor and delegate, each Tor instance bind a IP during it lifetime until we restart Tor. I want Tor instance to automatic change its IP. Is threre any way to do it??

    Thanks a lot and your posts are greatly helpful.

    FirstFeel

  • Garry Lachman

    Elite Private Proxies API on Mashape.

    Short life elite proxies updated every 1 minute.
    Every request create an private hostname that expired after 5 minuted.

    You can ask for geo located proxy or random party.
    Includes proxies count by countries endpoint

    Its all private proxies, its not scrapped from public lists.

    https://market.mashape.com/garrylachman/elite-proxies

  • nooby

    hi, thanks a lot for your tutorial, i face a problem connecting to https “ssl” websites, so how to fix this.

    regards

  • someone

    love the script thanks a lot,

    however when i try to modify the script to build new identity on all tor services i get error

    i added MaxCircuitDirtiness 60

    Jun 09 12:23:39.688 [warn] Failed to parse/validate config: Unknown option ’60’. Failing.

    Jun 09 12:23:39.688 [err] Reading config failed–see warnings above.

    can you help with this

    best regards

  • chovysblog

    Must be dumb, but how do I test this?

    I tried this: curl -x x.x.x.x:3129 ipinfo.io
    and i get an haproxy message. what port is the proxy ip supposed to answer on?

    • chovysblog

      use port 3128 :)

      nice article.

  • chovysblog

    I find a problem with some sites and maximum number of redirects reached.

    Is there an HAProxy setting I can use to tell it to follow all redirects?

    with `request()` module in node.js:

    Exceeded maxRedirects. Probably stuck in a redirect loop https://www.thesun.co.uk/news/2081744/russia-brings-super-secret-spy-sub-back-into-service-after-16-years-languishing-in-port/

    from command line using my proxy:

    curl -x x.x.x.x:3128 ‘https://www.thelocal.se/20161031/hey-swedes-lets-salvage-our-submarine-together-says-russia’ -L
    curl: (56) Received HTTP code 502 from proxy after CONNECT

  • Yariv Portman

    Luminati’s peer to peer (P2P) network has over 18 million residential IPs that are not identified as Proxies/Tor. The architecture, which has an inbuilt IP rotation management layer, allows you to use our P2P network for sending your HTTP / HTTPS requests via millions of IPs in every country and every city worldwide. With the right architecture on your side, you can achieve a failure rate of less than 1%. learn more @ https://luminati.io/?affiliate=L_yariv

  • Uriel Ehrman

    Luminati has recently launched a new product based on datacenter IPs in addition to its signature residential network, providing a comprehensive solution for all proxy requirements.

    There are three types of static IPs that Luminati’s offering in their self-serve platform: Dedicated, Dedicated Domain, and Shared.

    What is the difference you ask? Let’s start from the basics!

    A static IP is simply a data center IP. Within these static IPs there are:
    Shared – A general pool of IPs shared by all the customers.
    Dedicated – The customer will be the only one using the IP for the selected amount of time and can also choose the amount of time the IP has not been in use prior.
    Dedicated Domain – These are shared IP’s where the customer selects a specific
    Domain as dedicated to them (the domain has not being used by these IPs during the given time).

    Where are these available?

    Worldwide.

    What are the benefits of Dedicated Domain IPs?
    — Higher stability
    — Grants access to your website anytime you need
    — Essential for particular third-party applications/scripts
    — Much less chance of getting blocked

    After extensive analysis, Luminati now gives the most flexible pricing allowing customers to pay per IP per day with no monthly commitment or redundant expenses.

    Learn more: https://luminati.io/?affiliate=L_uriel