I try to use TOR proxy for scraping and everything works fine in one thread, but this is slow.
I try to do something simple:
def get_new_ip():
with Controller.from_port(port = 9051) as controller:
controller.authenticate(password="password")
controller.signal(Signal.NEWNYM)
time.sleep(controller.get_newnym_wait())
def check_ip():
get_new_ip()
session = requests.session()
session.proxies = {'http': 'socks5h://localhost:9050', 'https': 'socks5h://localhost:9050'}
r = session.get('http://httpbin.org/ip')
r.text
with Pool(processes=3) as pool:
for _ in range(9):
pool.apply_async(check_ip)
pool.close()
pool.join()
When I run it, I see the output:
{"origin": "95.179.181.1, 95.179.181.1"}
{"origin": "95.179.181.1, 95.179.181.1"}
{"origin": "95.179.181.1, 95.179.181.1"}
{"origin": "151.80.53.232, 151.80.53.232"}
{"origin": "151.80.53.232, 151.80.53.232"}
{"origin": "151.80.53.232, 151.80.53.232"}
{"origin": "145.239.169.47, 145.239.169.47"}
{"origin": "145.239.169.47, 145.239.169.47"}
{"origin": "145.239.169.47, 145.239.169.47"}
Why is this happening and how do I give each thread its own IP?
By the way, I tried libraries like TorRequests, TorCtl the result is the same.
I understand that it appears that TOR has a delay before issuing a new IP, but why do the same IP get into different processes?
If you want different IPs for each connection, you can also use Stream Isolation over SOCKS by specifying a different proxy username:password
combination for each connection.
With this method, you only need one Tor instance and each requests client can use a different stream with a different exit node.
In order to set this up, add unique proxy credentials for each requests.session
object like so: socks5h://username:password@localhost:9050
import random
from multiprocessing import Pool
import requests
def check_ip():
session = requests.session()
creds = str(random.randint(10000,0x7fffffff)) + ":" + "foobar"
session.proxies = {'http': 'socks5h://{}@localhost:9050'.format(creds), 'https': 'socks5h://{}@localhost:9050'.format(creds)}
r = session.get('http://httpbin.org/ip')
print(r.text)
with Pool(processes=8) as pool:
for _ in range(9):
pool.apply_async(check_ip)
pool.close()
pool.join()
Tor Browser isolates streams on a per-domain basis by setting the credentials to firstpartydomain:randompassword
, where randompassword is a random nonce for each unique first party domain.
If you're crawling the same site and you want random IP's, then use a random username:password combination for each session. If you are crawling random domains and want to use the same circuit for requests to a domain, use Tor Browser's method of domain:randompassword
for credentials.
You only have one proxy, which is listening on the port 9050. All 3 processes are sending requests in parallel through that proxy so they share the same IP.
What is happening is:
- All 3 processes ask the proxy to get a new IP
- The proxy either request a new IP 3 times, receive 3 responses and apply the last one or it will recognize that it is already waiting for a new IP and disregard 2 of the requests, answering the 3 of them together. That will depend on the proxy implementation.
- The processes send their requests through the proxy, which results in the same IP.
- The processes are completed and another 3 processes are initiated. Rinse and repeat.
That is why the IPs are the same for every block of 3 requests.
You'll need 3 independent proxies to have 3 different IPs at the same time.
EDIT:
Possible solution using locks and assuming 3 proxies running on the background:
import contextlib, threading, time
_controller_ports = [
# (Controller Lock, connection port, management port)
(threading.Lock(), 9050, 9051),
(threading.Lock(), 9060, 9061),
(threading.Lock(), 9070, 9071),
]
def get_new_ip_for(port):
with Controller.from_port(port=port) as controller:
controller.authenticate(password="password")
controller.signal(Signal.NEWNYM)
time.sleep(controller.get_newnym_wait())
@contextlib.contextmanager
def get_port_with_new_ip():
while True:
for lock, con_port, manage_port in _controller_ports:
if lock.acquire(blocking=False):
get_new_ip_for(manage_port)
yield con_port
lock.release()
break
time.sleep(1)
def check_ip():
with get_port_with_new_ip() as port:
session = requests.session()
session.proxies = {'http': f'socks5h://localhost:{port}', 'https': f'socks5h://localhost:{port}'}
r = session.get('http://httpbin.org/ip')
print(r.text)
with Pool(processes=3) as pool:
for _ in range(9):
pool.apply_async(check_ip)
pool.close()
pool.join()