I'm making a webscraping app in Python with Django web framework. I need to scrap multiple queries using beautifulsoup library. Here is snapshot of code that I have written:
for url in websites:
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all("a", {"class":"dev-link"})
Actually here the scraping of webpage is going sequentially, I want to run it in parallel manner. I don't have much idea about threading in Python.
can someone tell me, How can I do scrap in parallel manner? Any help would be appreciated.
You can use hadoop (http://hadoop.apache.org/) to run your jobs in parallel. This is the very good tool to run parallel task.
Try this solution.
import threading
def fetch_links(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
return soup.find_all("a", {"class": "dev-link"})
threads = [threading.Thread(target=fetch_links, args=(url,))
for url in websites]
for t in thread:
t.start()
Downloading web page content via requests.get()
is a blocking operation, and Python threading can actually improve performance.
If you want to use multithreading then,
import threading
import requests
from bs4 import BeautifulSoup
class Scrapper(threading.Thread):
def __init__(self, threadId, name, url):
threading.Thread.__init__(self)
self.name = name
self.id = threadId
self.url = url
def run(self):
r = requests.get(self.url)
soup = BeautifulSoup(r.content, 'html.parser')
links = soup.find_all("a")
return links
#list the websites in below list
websites = []
i = 1
for url in websites:
thread = Scrapper(i, "thread"+str(i), url)
res = thread.run()
# print res
this might be helpful
when it comes to python and scraping, scrapy is probably the way to go.
scrapy is using twisted mertix library for parallelism so you dont have to worry about threading and the python GIL
If you must use beautifulsoap check this library out