I am trying to scrape concurrently with selenium and multiprocessing modules. Below is roughly my approach:
- create queue with number of webdriver instances equal to number of workers
- create pool of workers
- each worker pulls webdriver instance from the queue
- when function terminates webdriver instance is put back on the queue
Here is the code:
#!/usr/bin/env python
# encoding: utf-8
import time
import codecs
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from multiprocessing import Pool
from Queue import Queue
def download_and_save(link_tuple):
link_id, link = link_tuple
print link_id
w = q.get()
with codecs.open('%s.html' % link_id, 'w', encoding='utf-8') as f:
def main(num_processes):
links = [
n = len(links)
link_tuples = [(link_id, link) for link_id, link in zip(xrange(n), links)]
pool = Pool(num_processes)
pool.map(download_and_save, link_tuples)
if __name__ == '__main__':
num_processes = 2
q = Queue()
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
for i in range(num_processes):
w = webdriver.PhantomJS(desired_capabilities=dcap)
This scripts runs but saved htmls are either duplicated or missing.
Here is a different approach that I've had success with: you keep your workers in __main__, and the workers pull from the task_q.