Downloading files from an http server in python

2019-01-24 17:26发布

问题:

Using urllib2, we can get the http response from a web server. If that server simply holds a list of files, we could parse through the files and download each individually. However, I'm not sure what the easiest, most pythonic way to parse through the files would be.

When you get a whole http response of the generic file server list, through urllib2's urlopen() method, how can we neatly download each file?

回答1:

Urllib2 might be OK to retrieve the list of files. For downloading large amounts of binary files PycURL http://pycurl.sourceforge.net/ is a better choice. This works for my IIS based file server:

import re
import urllib2
import pycurl

url = "http://server.domain/"
path = "path/"
pattern = '<A HREF="/%s.*?">(.*?)</A>' % path

response = urllib2.urlopen(url+path).read()

for filename in re.findall(pattern, response):
    fp = open(filename, "wb")
    curl = pycurl.Curl()
    curl.setopt(pycurl.URL, url+path+filename)
    curl.setopt(pycurl.WRITEDATA, fp)
    curl.perform()
    curl.close()
    fp.close()


回答2:

You can use urllib.urlretrieve (in Python 3.x: urllib.request.urlretrieve):

import urllib
urllib.urlretrieve('http://site.com/', filename='filez.txt')

This should be work :)

and this is a fnction that can do the same thing (using urllib):

def download(url):
    webFile = urllib.urlopen(url)
    localFile = open(url.split('/')[-1], 'w')
    localFile.write(webFile.read())
    webFile.close()
    localFile.close()


回答3:

Can you guarantee that the URL you're requesting is a directory listing? If so, can you guarantee the format of the directory listing?

If so, you could use lxml to parse the returned document and find all of the elements that hold the path to a file, then iterate over those elements and download each file.



回答4:

Here's an untested solution:

import urllib2

response = urllib2.urlopen('http://server.com/file.txt')
urls = response.read().replace('\r', '').split('\n')

for file in urls:
  print 'Downloading ' + file

  response = urllib2.urlopen(file)

  handle = open(file, 'w')
  handle.write(response.read())
  handle.close()

It's untested, and it probably won't work. This is assuming you have an actual list of files inside of another file. Good luck!



回答5:

  1. Download the index file

    If it's really huge, it may be worth reading a chunk at a time; otherwise it's probably easier to just grab the whole thing into memory.

  2. Extract the list of files to get

    If the list is xml or html, use a proper parser; else if there is much string processing to do, use regex; else use simple string methods.

    Again, you can parse it all-at-once or incrementally. Incrementally is somewhat more efficient and elegant, but unless you are processing multiple tens of thousands of lines it's probably not critical.

  3. For each file, download it and save it to a file.

    If you want to try to speed things up, you could try running multiple download threads;

    another (significantly faster) approach might be to delegate the work to a dedicated downloader program like Aria2 http://aria2.sourceforge.net/ - note that Aria2 can be run as a service and controlled via XMLRPC, see http://sourceforge.net/apps/trac/aria2/wiki/XmlrpcInterface#InteractWitharia2UsingPython



回答6:

My suggestion would be to use BeautifulSoup (which is an HTML/XML parser) to parse the page for a list of files. Then, pycURL would definitely come in handy.

Another method, after you've got the list of files, is to use urllib.urlretrieve in a way similar to wget in order to simply download the file to a location on your filesystem.



回答7:

This is a non-convential way, but although it works

fPointer = open(picName, 'wb')
self.curl.setopt(self.curl.WRITEFUNCTION, fPointer.write) 


urllib.urlretrieve(link, picName) - correct way