I am trying to use urllib3 in simple thread to fetch several wiki pages. The script will
Create 1 connection for every thread (I don't understand why) and Hang forever. Any tip, advice or simple example of urllib3 and threading
import threadpool
from urllib3 import connection_from_url
HTTP_POOL = connection_from_url(url, timeout=10.0, maxsize=10, block=True)
def fetch(url, fiedls):
kwargs={'retries':6}
return HTTP_POOL.get_url(url, fields, **kwargs)
pool = threadpool.ThreadPool(5)
requests = threadpool.makeRequests(fetch, iterable)
[pool.putRequest(req) for req in requests]
@Lennart's script got this error:
http://en.wikipedia.org/wiki/2010-11_Premier_LeagueTraceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
http://en.wikipedia.org/wiki/List_of_MythBusters_episodeshttp://en.wikipedia.org/wiki/List_of_Top_Gear_episodes http://en.wikipedia.org/wiki/List_of_Unicode_characters result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
After adding import threadpool; import urllib3
and tpool = threadpool.ThreadPool(4)
@user318904's code got this error:
Traceback (most recent call last):
File "crawler.py", line 21, in <module>
tpool.map_async(fetch, urls)
AttributeError: ThreadPool instance has no attribute 'map_async'
Here is my take, a more current solution using Python3 and
concurrent.futures.ThreadPoolExecutor
.Some remarks
Python Cookbook
by Beazley and Jones.urllib3
.download
(like printing, saving to a file, etc.), there is no additional effort in joining the threads.ThreadPoolExecutor.submit
actually returns whateverdownload
would return, wrapped in aFuture
.HTTPConnection
's in a connection pool (viamaxsize
). Otherwise you might encounter (harmless) warnings when all threads try to access the same server (as in the example).Obviously it will create one connection per thread, how should else each thread be able to fetch a page? And you try to use the same connection, made from one url, for all urls. That can hardly be what you intended.
This code worked just fine:
Thread programming is hard, so I wrote workerpool to make exactly what you're doing easier.
More specifically, see the Mass Downloader example.
To do the same thing with urllib3, it looks something like this:
For more sophisticated code, have a look at workerpool.EquippedWorker (and the tests here for example usage). You can make the pool be the
toolbox
you pass in.I use something like this: