multithreaded crawler while using tor proxy

2019-01-28 21:21发布

问题:

I am trying to build multi threaded crawler that uses tor proxies: I am using following to establish tor connection:

from stem import Signal
from stem.control import Controller
controller = Controller.from_port(port=9151)
def connectTor():
    socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150)
    socket.socket = socks.socksocket


def renew_tor():
    global request_headers
    request_headers = {
        "Accept-Language": "en-US,en;q=0.5",
        "User-Agent": random.choice(BROWSERS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Referer": "http://thewebsite2.com",
        "Connection": "close"
    }

    controller.authenticate()
    controller.signal(Signal.NEWNYM)

Here is url fetcher:

def get_soup(url):
    while True:
        try:
            connectTor()
            r = requests.Session()
            response = r.get(url, headers=request_headers)
            the_page = response.content.decode('utf-8',errors='ignore')
            the_soup = BeautifulSoup(the_page, 'html.parser')
            if "captcha" in the_page.lower():
                print("flag condition matched while url: ", url)
                #print(the_page)
                renew_tor()
            else:
                return the_soup
                break
        except Exception as e:
            print ("Error while URL :", url, str(e))

I am then creating multithreaded fetch job:

with futures.ThreadPoolExecutor(200) as executor:
            for url in zurls:
                future = executor.submit(fetchjob,url)

then I am getting following error, which I am not seeing when I use multiprocessing:

 Socket connection failed (Socket error: 0x01: General SOCKS server failure)

I would appreciate Any advise to avoid socks error and improving the performance of crawling method to make it multi threaded.

回答1:

This is a perfect example of why monkey patching socket.socket is bad.

This replaces the socket used by all socket connections (which is most everything) with the SOCKS socket.

When you go to connect to the controller later, it attempts to use the SOCKS protocol to communicate instead of establishing a direct connection.

Since you're already using requests, I'd suggest getting rid of SocksiPy and the socks.socket = socks.socksocket code and using the SOCKS proxy functionality built into requests:

proxies = {
    'http': 'socks5h://127.0.0.1:9050',
    'https': 'socks5h://127.0.0.1:9050'
}

response = r.get(url, headers=request_headers, proxies=proxies)