How to download a file with urllib3?

2019-03-21 16:59发布

问题:

This is based on another question on this site: What's the best way to download file using urllib3 However, I cannot comment there so I ask another question:

How to download a (larger) file with urllib3?

I tried to use the same code that works with urllib2 (Download file from web in Python 3), but it fails with urllib3:

http = urllib3.PoolManager()

with http.request('GET', url) as r, open(path, 'wb') as out_file:       
    #shutil.copyfileobj(r.data, out_file) # this writes a zero file
    shutil.copyfileobj(r.data, out_file)

This says that 'bytes' object has no attribute 'read'

I then tried to use the code in that question but it gets stuck in an infinite loop because data is always '0':

http = urllib3.PoolManager()
r = http.request('GET', url)

with open(path, 'wb') as out:
    while True:
        data = r.read(4096)         
        if data is None:
            break
        out.write(data)
r.release_conn()

However, if I read everything in memory, the file gets downloaded correctly:

http = urllib3.PoolManager()
r = http.request('GET', url)
with open(path, 'wb') as out:
    out.write(data)

I do not want to do this, as I might potentially download very large files. It is unfortunate that the urllib documentation does not cover the best practice in this topic.

(Also, please do not suggest requests or urllib2, because they are not flexible enough when it comes to self-signed certificates.)

回答1:

You were very close, the piece that was missing is setting preload_content=False (this will be the default in an upcoming version). Also you can treat the response as a file-like object, rather than the .data attribute (which is a magic property that will hopefully be deprecated someday).

- with http.request('GET', url) ...
+ with http.request('GET', url, preload_content=False) ...

This code should work:

http = urllib3.PoolManager()

with http.request('GET', url, preload_content=False) as r, open(path, 'wb') as out_file:       
    shutil.copyfileobj(r, out_file)

urllib3's response object also respects the io interface, so you can also do things like...

import io
response = http.request(..., preload_content=False)
buffered_response = io.BufferedReader(response, 2048)

As long as you add preload_content=False to any of your three attempts and treat the response as a file-like object, they should all work.

It is unfortunate that the urllib documentation does not cover the best practice in this topic.

You're totally right, I hope you'll consider helping us document this use case by sending a pull request here: https://github.com/shazow/urllib3