How can I download multiple links simultaneously? My script below works but only downloads one at a time and it is extremely slow. I can't figure out how to incorporate multithreading in my script.
The Python script:
from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
dirname = urlparse.urlparse(url).path.split('.')[-1]
f = urllib2.urlopen(url)
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
full_path = os.path.join(dirname, name)
open(full_path, 'w').write(converted)
print(name)
The HTML file called links.html
:
<a href="http://www.youversion.com/bible/gen.1.nmv-fas">http://www.youversion.com/bible/gen.1.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.2.nmv-fas">http://www.youversion.com/bible/gen.2.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.3.nmv-fas">http://www.youversion.com/bible/gen.3.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.4.nmv-fas">http://www.youversion.com/bible/gen.4.nmv-fas</a>
In 2017 there are some other options now, like asyncio and ThreadPoolExecutor.
Here is an example of ThreadPoolExecutor (included in Python futures)
submit() function will submit the task to a queue. (queue management is done for you)
You can set max_workers, a few times the number of CPU cores in practice, do some tests to see how faw you can go up, depending on context-switching overhead.
For more info: https://docs.python.org/3/library/concurrent.futures.html
It looks to me like the consumer - producer problem - see wikipedia
You may use
I use
multiprocessing
for parallelizing things -- for some reason I like it better thanthreading