Synchronise muti-threads in Python

2019-09-06 18:14发布

问题:

The class BrokenLinkTest in the code below does the following.

  1. takes a web page url
  2. finds all the links in the web page
  3. get the headers of the links concurrently (this is done to check if the link is broken or not)
  4. print 'completed' when all the headers are received.

from bs4 import BeautifulSoup
import requests

class BrokenLinkTest(object):

    def __init__(self, url):
        self.url = url
        self.thread_count = 0
        self.lock = threading.Lock()

    def execute(self):
        soup = BeautifulSoup(requests.get(self.url).text)
        self.lock.acquire()
        for link in soup.find_all('a'):
            url = link.get('href')
            threading.Thread(target=self._check_url(url))
        self.lock.acquire()

    def _on_complete(self):
        self.thread_count -= 1
        if self.thread_count == 0: #check if all the threads are completed
            self.lock.release()
            print "completed"

    def _check_url(self, url):
        self.thread_count += 1
        print url
        result = requests.head(url)
        print result
        self._on_complete()


BrokenLinkTest("http://www.example.com").execute()

Can the concurrency/synchronization part be done in a better way. I did it using threading.Lock. This is my first experiment with python threading.

回答1:

def execute(self):
    soup = BeautifulSoup(requests.get(self.url).text)
    threads = []
    for link in soup.find_all('a'):
        url = link.get('href')
        t = threading.Thread(target=self._check_url, args=(url,))
        t.start()
        threads.append(t)
    for thread in threads:
        thread.join()

You could use the join method to wait for all the threads to finish.

Note I also added a start call, and passed the bound method object to the target param. In your original example you were calling _check_url in the main thread and passing the return value to the target param.



回答2:

All threads in Python run on the same core, so you won't be gaining any performance by doing it this way. Also - it's very unclear what is actually happening?

  1. You are never actually starting a threads, you are just initializing it
  2. The threads themselves do absolutely nothing other than decrementing the thread count

You may only gain performance in a thread-based scenario if your program is delivering work to the IO (sending requests, writing to file and so on), where other threads can work in the meanwhile.