I need to download several files via http in Python.
The most obvious way to do it is just using urllib2:
import urllib2
u = urllib2.urlopen('http://server.com/file.html')
localFile = open('file.html', 'w')
localFile.write(u.read())
localFile.close()
But I'll have to deal with the URLs that are nasty in some way, say like this: http://server.com/!Run.aspx/someoddtext/somemore?id=121&m=pdf
. When downloaded via the browser, the file has a human-readable name, ie. accounts.pdf
.
Is there any way to handle that in python, so I don't need to know the file names and hardcode them into my script?
2 Kender:
it is not safe -- web server can pass wrong formatted name as ["file.ext] or [file.ext'] or even be empty and localName[0] will raise exception. Correct code can looks like this:
Based on comments and @Oli's anwser, I made a solution like this:
It takes file name from Content-Disposition; if it's not present, uses filename from the URL (if redirection happened, the final URL is taken into account).
Combining much of the above, here is a more pythonic solution:
Using
wget
:Using urlretrieve:
urlretrieve also creates the directory structure if not exists.
Download scripts like that tend to push a header telling the user-agent what to name the file:
If you can grab that header, you can get the proper filename.
There's another thread that has a little bit of code to offer up for
Content-Disposition
-grabbing.