How to download large file in python with requests

2018-12-31 06:34发布

Requests is a really nice library. I'd like to use it for download big files (>1GB). The problem is it's not possible to keep whole file in memory I need to read it in chunks. And this is a problem with the following code

import requests

def DownloadFile(url)
    local_filename = url.split('/')[-1]
    r = requests.get(url)
    f = open(local_filename, 'wb')
    for chunk in r.iter_content(chunk_size=512 * 1024): 
        if chunk: # filter out keep-alive new chunks
            f.write(chunk)
    f.close()
    return 

By some reason it doesn't work this way. It still loads response into memory before save it to a file.

UPDATE

If you need a small client (Python 2.x /3.x) which can download big files from FTP, you can find it here. It supports multithreading & reconnects (it does monitor connections) also it tunes socket params for the download task.

4条回答
怪性笑人.
2楼-- · 2018-12-31 07:07

I figured out what should be changed. The trick was to set stream = True in the get() method.

After this python process stopped to suck memory (stays around 30kb regardless size of the download file).

Thank you @danodonovan for you syntax I use it here:

def download_file(url):
    local_filename = url.split('/')[-1]
    # NOTE the stream=True parameter
    r = requests.get(url, stream=True)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                #f.flush() commented by recommendation from J.F.Sebastian
    return local_filename

See http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow for further reference.

查看更多
宁负流年不负卿
3楼-- · 2018-12-31 07:09

Not exactly what OP was asking, but... it's ridiculously easy to do that with urllib:

from urllib.request import urlretrieve
url = 'http://mirror.pnl.gov/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso'
dst = 'ubuntu-16.04.2-desktop-amd64.iso'
urlretrieve(url, dst)

Or this way, if you want to save it to a temporary file:

from urllib.request import urlopen
from shutil import copyfileobj
from tempfile import NamedTemporaryFile
url = 'http://mirror.pnl.gov/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso'
with urlopen(url) as fsrc, NamedTemporaryFile(delete=False) as fdst:
    copyfileobj(fsrc, fdst)

I watched the process:

watch 'ps -p 18647 -o pid,ppid,pmem,rsz,vsz,comm,args; ls -al *.iso'

And I saw the file growing, but memory usage stayed at 17 MB. Am I missing something?

查看更多
柔情千种
4楼-- · 2018-12-31 07:19

It's much easier if you use Response.raw and shutil.copyfileobj():

import requests
import shutil

def download_file(url):
    local_filename = url.split('/')[-1]
    r = requests.get(url, stream=True)
    with open(local_filename, 'wb') as f:
        shutil.copyfileobj(r.raw, f)

    return local_filename

This streams the file to disk without using excessive memory, and the code is simple.

查看更多
人气声优
5楼-- · 2018-12-31 07:24

Your chunk size could be too large, have you tried dropping that - maybe 1024 bytes at a time? (also, you could use with to tidy up the syntax)

def DownloadFile(url):
    local_filename = url.split('/')[-1]
    r = requests.get(url)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
    return 

Incidentally, how are you deducing that the response has been loaded into memory?

It sounds as if python isn't flushing the data to file, from other SO questions you could try f.flush() and os.fsync() to force the file write and free memory;

    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()
                os.fsync(f.fileno())
查看更多
登录 后发表回答