Running Your Own Anonymous Rotating Proxies

Rotating Proxies with HAProxy

Most web browsers and scrapers can only be configured to use one proxy per protocol. You can get around this limitation by running different instances of browsers and scrapers. Google Chrome and Firefox allow multiple profiles. However, running hundreds of browser instances is unwieldy.

A better option is to set up your own proxy to rotate among a set of Tor proxies.The Tor application implements a SOCKS proxy. Start multiple Tor instances on one or more machines and networks, then configure and run an HTTP load balancer to expose a single point of connection instead of adding the rotating logic within the client application. On the Distributed Scraping With Multiple Tor Circuits article we learned how to set up multiple Tor SOCKS proxies for web scraping and crawling. However our sample code launched multiple threads each of which uses a different proxy. In this example we use the HAProxy load balancer with a round-robin strategy to rotate our proxies.

When you are dealing with web crawling and scraping sites with Javascript, using a real browser with a high performance Javascript engine like V8 may be the best approach. Just configuring our rotating proxy in the browser does the trick. Another option is using HTMLUnit but the the V8 Javascript Engine parses web pages and runs Javascript more quickly. If you are using a browser you must be particularly careful to keep the scraped site from correlating your multiple requests. Try disabling cookies, local storage, and image loading, and only enabling Javascript, indeed, you need to cache as many requests as possible. If you need to support cookies, you have to run different browsers with different profiles.

Setup and Configuration

Prerequisites

  1. Tor
  2. DeleGate
  3. HAProxy

HAProxy Configuration File

rotating-tor-proxies.cfg

global
        daemon
        maxconn 256

defaults
        mode http
        timeout connect 5000ms
        timeout client 50000ms
        timeout server 50000ms

frontend rotatingproxies
        bind *:3128
        default_backend tors
        option http_proxy

backend tors
        option http_proxy
        server tor1 localhost:3129
        server tor1 localhost:3130
        server tor1 localhost:3131
        server tor1 localhost:3132
        server tor1 localhost:3133
        server tor1 localhost:3134
        server tor1 localhost:3135
        server tor1 localhost:3136
        server tor1 localhost:3137
        server tor1 localhost:3138
        balance roundrobin

Running

Run the following script, which launches many instances of Tor. Then runs one instance of delegated per Tor, and finally runs HAProxy to rotate the proxy servers. We have to use DeleGate because HAProxy does not support SOCKS.

#!/bin/bash
base_socks_port=9050
base_http_port=3129 # leave 3128 for HAProxy
base_control_port=8118

# Create data directory if it doesn't exist
if [ ! -d "data" ]; then
	mkdir "data"
fi

#for i in {0..10}
for i in {0..9}

do
	j=$((i+1))
	socks_port=$((base_socks_port+i))
	control_port=$((base_control_port+i))
	http_port=$((base_http_port+i))
	if [ ! -d "data/tor$i" ]; then
		echo "Creating directory data/tor$i"
		mkdir "data/tor$i"
	fi
	# Take into account that authentication for the control port is disabled. Must be used in secure and controlled environments

	echo "Running: tor --RunAsDaemon 1 --CookieAuthentication 0 --HashedControlPassword \"\" --ControlPort $control_port --PidFile tor$i.pid --SocksPort $socks_port --DataDirectory data/tor$i"

	tor --RunAsDaemon 1 --CookieAuthentication 0 --HashedControlPassword "" --ControlPort $control_port --PidFile tor$i.pid --SocksPort $socks_port --DataDirectory data/tor$i

	echo 	"Running: ./delegate/src/delegated -P$http_port SERVER=http SOCKS=localhost:$socks_port"

	./delegate/src/delegated -P$http_port SERVER=http SOCKS=localhost:$socks_port
done

haproxy -f rotating-tor-proxies.cfg

See Also

  1. Distributed Scraping With Multiple Tor Circuits
  2. Web Scraping Ajax and Javascript Sites

Resources

  1. HAProxy The Reliable, High Performance TCP/HTTP Load Balancer
  2. DeleGate Multi-Purpose Application Level Gateway
  3. Python twisted proxyclient cascade / upstream to squid
  4. How SOPA’s ‘circumvention’ ban could put a target on Tor