Rotating Proxies with HAProxy
Most web browsers and scrapers can only be configured to use one proxy per protocol. You can get around this limitation by running different instances of browsers and scrapers. Google Chrome and Firefox allow multiple profiles. However, running hundreds of browser instances is unwieldy.
A better option is to set up your own proxy to rotate among a set of Tor proxies.The Tor application implements a SOCKS proxy. Start multiple Tor instances on one or more machines and networks, then configure and run an HTTP load balancer to expose a single point of connection instead of adding the rotating logic within the client application. On the Distributed Scraping With Multiple Tor Circuits article we learned how to set up multiple Tor SOCKS proxies for web scraping and crawling. However our sample code launched multiple threads each of which uses a different proxy. In this example we use the HAProxy load balancer with a round-robin strategy to rotate our proxies.
When you are dealing with web crawling and scraping sites with Javascript, using a real browser with a high performance Javascript engine like V8 may be the best approach. Just configuring our rotating proxy in the browser does the trick. Another option is using HTMLUnit but the the V8 Javascript Engine parses web pages and runs Javascript more quickly. If you are using a browser you must be particularly careful to keep the scraped site from correlating your multiple requests. Try disabling cookies, local storage, and image loading, and only enabling Javascript, indeed, you need to cache as many requests as possible. If you need to support cookies, you have to run different browsers with different profiles.
Setup and Configuration
Prerequisites
HAProxy Configuration File
rotating-tor-proxies.cfg
global daemon maxconn 256 defaults mode http timeout connect 5000ms timeout client 50000ms timeout server 50000ms frontend rotatingproxies bind *:3128 default_backend tors option http_proxy backend tors option http_proxy server tor1 localhost:3129 server tor1 localhost:3130 server tor1 localhost:3131 server tor1 localhost:3132 server tor1 localhost:3133 server tor1 localhost:3134 server tor1 localhost:3135 server tor1 localhost:3136 server tor1 localhost:3137 server tor1 localhost:3138 balance roundrobin
Running
Run the following script, which launches many instances of Tor. Then runs one instance of delegated per Tor, and finally runs HAProxy to rotate the proxy servers. We have to use DeleGate because HAProxy does not support SOCKS.
#!/bin/bash base_socks_port=9050 base_http_port=3129 # leave 3128 for HAProxy base_control_port=8118 # Create data directory if it doesn't exist if [ ! -d "data" ]; then mkdir "data" fi #for i in {0..10} for i in {0..9} do j=$((i+1)) socks_port=$((base_socks_port+i)) control_port=$((base_control_port+i)) http_port=$((base_http_port+i)) if [ ! -d "data/tor$i" ]; then echo "Creating directory data/tor$i" mkdir "data/tor$i" fi # Take into account that authentication for the control port is disabled. Must be used in secure and controlled environments echo "Running: tor --RunAsDaemon 1 --CookieAuthentication 0 --HashedControlPassword \"\" --ControlPort $control_port --PidFile tor$i.pid --SocksPort $socks_port --DataDirectory data/tor$i" tor --RunAsDaemon 1 --CookieAuthentication 0 --HashedControlPassword "" --ControlPort $control_port --PidFile tor$i.pid --SocksPort $socks_port --DataDirectory data/tor$i echo "Running: ./delegate/src/delegated -P$http_port SERVER=http SOCKS=localhost:$socks_port" ./delegate/src/delegated -P$http_port SERVER=http SOCKS=localhost:$socks_port done haproxy -f rotating-tor-proxies.cfg