How to scrap multiple html page in parallel with b

2019-06-14 02:35发布

问题:

I'm making a webscraping app in Python with Django web framework. I need to scrap multiple queries using beautifulsoup library. Here is snapshot of code that I have written:

for url in websites:
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    links = soup.find_all("a", {"class":"dev-link"})

Actually here the scraping of webpage is going sequentially, I want to run it in parallel manner. I don't have much idea about threading in Python. can someone tell me, How can I do scrap in parallel manner? Any help would be appreciated.

回答1:

You can use hadoop (http://hadoop.apache.org/) to run your jobs in parallel. This is the very good tool to run parallel task.



回答2:

Try this solution.

import threading

def fetch_links(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    return soup.find_all("a", {"class": "dev-link"})

threads = [threading.Thread(target=fetch_links, args=(url,))
           for url in websites]

for t in thread:
    t.start()

Downloading web page content via requests.get() is a blocking operation, and Python threading can actually improve performance.



回答3:

If you want to use multithreading then,

import threading
import requests
from bs4 import BeautifulSoup

class Scrapper(threading.Thread):
    def __init__(self, threadId, name, url):
        threading.Thread.__init__(self)
        self.name = name
        self.id = threadId
        self.url = url

    def run(self):
        r = requests.get(self.url)
        soup = BeautifulSoup(r.content, 'html.parser')
        links = soup.find_all("a")
        return links
#list the websites in below list
websites = []
i = 1
for url in websites:
    thread = Scrapper(i, "thread"+str(i), url)
    res = thread.run()
    # print res

this might be helpful



回答4:

when it comes to python and scraping, scrapy is probably the way to go.

scrapy is using twisted mertix library for parallelism so you dont have to worry about threading and the python GIL

If you must use beautifulsoap check this library out