How i can get new ip from tor every requests in th

2020-07-17 06:47发布

问题:

I try to use TOR proxy for scraping and everything works fine in one thread, but this is slow. I try to do something simple:

def get_new_ip():
    with Controller.from_port(port = 9051) as controller:
        controller.authenticate(password="password")
        controller.signal(Signal.NEWNYM)
        time.sleep(controller.get_newnym_wait())


def check_ip():
    get_new_ip()
    session = requests.session()
    session.proxies = {'http': 'socks5h://localhost:9050', 'https': 'socks5h://localhost:9050'}
    r = session.get('http://httpbin.org/ip')
    r.text


with Pool(processes=3) as pool:
    for _ in range(9):
        pool.apply_async(check_ip)
    pool.close()
    pool.join()

When I run it, I see the output:

{"origin": "95.179.181.1, 95.179.181.1"}
{"origin": "95.179.181.1, 95.179.181.1"}
{"origin": "95.179.181.1, 95.179.181.1"}
{"origin": "151.80.53.232, 151.80.53.232"}
{"origin": "151.80.53.232, 151.80.53.232"}
{"origin": "151.80.53.232, 151.80.53.232"}
{"origin": "145.239.169.47, 145.239.169.47"}
{"origin": "145.239.169.47, 145.239.169.47"}
{"origin": "145.239.169.47, 145.239.169.47"}

Why is this happening and how do I give each thread its own IP? By the way, I tried libraries like TorRequests, TorCtl the result is the same.

I understand that it appears that TOR has a delay before issuing a new IP, but why do the same IP get into different processes?

回答1:

If you want different IPs for each connection, you can also use Stream Isolation over SOCKS by specifying a different proxy username:password combination for each connection.

With this method, you only need one Tor instance and each requests client can use a different stream with a different exit node.

In order to set this up, add unique proxy credentials for each requests.session object like so: socks5h://username:password@localhost:9050

import random
from multiprocessing import Pool
import requests

def check_ip():
    session = requests.session()
    creds = str(random.randint(10000,0x7fffffff)) + ":" + "foobar"
    session.proxies = {'http': 'socks5h://{}@localhost:9050'.format(creds), 'https': 'socks5h://{}@localhost:9050'.format(creds)}
    r = session.get('http://httpbin.org/ip')
    print(r.text)


with Pool(processes=8) as pool:
    for _ in range(9):
        pool.apply_async(check_ip)
    pool.close()
    pool.join()

Tor Browser isolates streams on a per-domain basis by setting the credentials to firstpartydomain:randompassword, where randompassword is a random nonce for each unique first party domain.

If you're crawling the same site and you want random IP's, then use a random username:password combination for each session. If you are crawling random domains and want to use the same circuit for requests to a domain, use Tor Browser's method of domain:randompassword for credentials.



回答2:

You only have one proxy, which is listening on the port 9050. All 3 processes are sending requests in parallel through that proxy so they share the same IP.

What is happening is:

  1. All 3 processes ask the proxy to get a new IP
  2. The proxy either request a new IP 3 times, receive 3 responses and apply the last one or it will recognize that it is already waiting for a new IP and disregard 2 of the requests, answering the 3 of them together. That will depend on the proxy implementation.
  3. The processes send their requests through the proxy, which results in the same IP.
  4. The processes are completed and another 3 processes are initiated. Rinse and repeat.

That is why the IPs are the same for every block of 3 requests.
You'll need 3 independent proxies to have 3 different IPs at the same time.


EDIT:

Possible solution using locks and assuming 3 proxies running on the background:

import contextlib, threading, time

_controller_ports = [
    # (Controller Lock, connection port, management port)
    (threading.Lock(), 9050, 9051),
    (threading.Lock(), 9060, 9061),
    (threading.Lock(), 9070, 9071),
]

def get_new_ip_for(port):
    with Controller.from_port(port=port) as controller:
        controller.authenticate(password="password")
        controller.signal(Signal.NEWNYM)
        time.sleep(controller.get_newnym_wait())

@contextlib.contextmanager
def get_port_with_new_ip():
    while True:
        for lock, con_port, manage_port in _controller_ports:
            if lock.acquire(blocking=False):
                get_new_ip_for(manage_port)
                yield con_port
                lock.release()
                break
        time.sleep(1)

def check_ip():
    with get_port_with_new_ip() as port:
        session = requests.session() 
        session.proxies = {'http': f'socks5h://localhost:{port}', 'https': f'socks5h://localhost:{port}'}
        r = session.get('http://httpbin.org/ip')
        print(r.text)

with Pool(processes=3) as pool:
    for _ in range(9):
        pool.apply_async(check_ip)
    pool.close()
    pool.join()