可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm making a webscraping app in Python with Django web framework. I need to scrap multiple queries using beautifulsoup library. Here is snapshot of code that I have written:

for url in websites:
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    links = soup.find_all("a", {"class":"dev-link"})

Actually here the scraping of webpage is going sequentially, I want to run it in parallel manner. I don't have much idea about threading in Python. can someone tell me, How can I do scrap in parallel manner? Any help would be appreciated.

回答1:

You can use hadoop (http://hadoop.apache.org/) to run your jobs in parallel. This is the very good tool to run parallel task.

回答2:

Try this solution.

import threading

def fetch_links(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    return soup.find_all("a", {"class": "dev-link"})

threads = [threading.Thread(target=fetch_links, args=(url,))
           for url in websites]

for t in thread:
    t.start()

Downloading web page content via requests.get() is a blocking operation, and Python threading can actually improve performance.

回答3:

If you want to use multithreading then,

import threading
import requests
from bs4 import BeautifulSoup

class Scrapper(threading.Thread):
    def __init__(self, threadId, name, url):
        threading.Thread.__init__(self)
        self.name = name
        self.id = threadId
        self.url = url

    def run(self):
        r = requests.get(self.url)
        soup = BeautifulSoup(r.content, 'html.parser')
        links = soup.find_all("a")
        return links
#list the websites in below list
websites = []
i = 1
for url in websites:
    thread = Scrapper(i, "thread"+str(i), url)
    res = thread.run()
    # print res

this might be helpful