可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Using urllib2, we can get the http response from a web server. If that server simply holds a list of files, we could parse through the files and download each individually. However, I'm not sure what the easiest, most pythonic way to parse through the files would be.
When you get a whole http response of the generic file server list, through urllib2's urlopen() method, how can we neatly download each file?
回答1:
Urllib2 might be OK to retrieve the list of files. For downloading large amounts of binary files PycURL http://pycurl.sourceforge.net/ is a better choice. This works for my IIS based file server:
import re
import urllib2
import pycurl
url = "http://server.domain/"
path = "path/"
pattern = '<A HREF="/%s.*?">(.*?)</A>' % path
response = urllib2.urlopen(url+path).read()
for filename in re.findall(pattern, response):
fp = open(filename, "wb")
curl = pycurl.Curl()
curl.setopt(pycurl.URL, url+path+filename)
curl.setopt(pycurl.WRITEDATA, fp)
curl.perform()
curl.close()
fp.close()
回答2:
You can use urllib.urlretrieve (in Python 3.x: urllib.request.urlretrieve):
import urllib
urllib.urlretrieve('http://site.com/', filename='filez.txt')
This should be work :)
and this is a fnction that can do the same thing (using urllib):
def download(url):
webFile = urllib.urlopen(url)
localFile = open(url.split('/')[-1], 'w')
localFile.write(webFile.read())
webFile.close()
localFile.close()
回答3:
Can you guarantee that the URL you're requesting is a directory listing? If so, can you guarantee the format of the directory listing?
If so, you could use lxml to parse the returned document and find all of the elements that hold the path to a file, then iterate over those elements and download each file.
回答4:
Here's an untested solution:
import urllib2
response = urllib2.urlopen('http://server.com/file.txt')
urls = response.read().replace('\r', '').split('\n')
for file in urls:
print 'Downloading ' + file
response = urllib2.urlopen(file)
handle = open(file, 'w')
handle.write(response.read())
handle.close()
It's untested, and it probably won't work. This is assuming you have an actual list of files inside of another file. Good luck!
回答5:
Download the index file
If it's really huge, it may be worth reading a chunk at a time;
otherwise it's probably easier to just grab the whole thing into memory.
Extract the list of files to get
If the list is xml or html, use a proper parser;
else if there is much string processing to do, use regex;
else use simple string methods.
Again, you can parse it all-at-once or incrementally.
Incrementally is somewhat more efficient and elegant,
but unless you are processing multiple tens of thousands
of lines it's probably not critical.
For each file, download it and save it to a file.
If you want to try to speed things up, you could try
running multiple download threads;
another (significantly faster) approach might be
to delegate the work to a dedicated downloader
program like Aria2 http://aria2.sourceforge.net/ -
note that Aria2 can be run as a service and controlled
via XMLRPC, see http://sourceforge.net/apps/trac/aria2/wiki/XmlrpcInterface#InteractWitharia2UsingPython
回答6:
My suggestion would be to use BeautifulSoup (which is an HTML/XML parser) to parse the page for a list of files. Then, pycURL would definitely come in handy.
Another method, after you've got the list of files, is to use urllib.urlretrieve in a way similar to wget in order to simply download the file to a location on your filesystem.
回答7:
This is a non-convential way, but although it works
fPointer = open(picName, 'wb')
self.curl.setopt(self.curl.WRITEFUNCTION, fPointer.write)
urllib.urlretrieve(link, picName) - correct way