urllib2 file name

2019-01-08 12:43发布

站内文章 / Python

15 0

疯言疯语

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

If I open a file using urllib2, like so:

remotefile = urllib2.urlopen('http://example.com/somefile.zip')

Is there an easy way to get the file name other then parsing the original URL?

EDIT: changed openfile to urlopen... not sure how that happened.

EDIT2: I ended up using:

filename = url.split('/')[-1].split('#')[0].split('?')[0]

Unless I'm mistaken, this should strip out all potential queries as well.

回答1:

Did you mean urllib2.urlopen?

You could potentially lift the intended filename if the server was sending a Content-Disposition header by checking remotefile.info()['Content-Disposition'], but as it is I think you'll just have to parse the url.

You could use urlparse.urlsplit, but if you have any URLs like at the second example, you'll end up having to pull the file name out yourself anyway:

>>> urlparse.urlsplit('http://example.com/somefile.zip')
('http', 'example.com', '/somefile.zip', '', '')
>>> urlparse.urlsplit('http://example.com/somedir/somefile.zip')
('http', 'example.com', '/somedir/somefile.zip', '', '')

Might as well just do this:

>>> 'http://example.com/somefile.zip'.split('/')[-1]
'somefile.zip'
>>> 'http://example.com/somedir/somefile.zip'.split('/')[-1]
'somefile.zip'

回答2:

If you only want the file name itself, assuming that there's no query variables at the end like http://example.com/somedir/somefile.zip?foo=bar then you can use os.path.basename for this:

[user@host]$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) 
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.path.basename("http://example.com/somefile.zip")
'somefile.zip'
>>> os.path.basename("http://example.com/somedir/somefile.zip")
'somefile.zip'
>>> os.path.basename("http://example.com/somedir/somefile.zip?foo=bar")
'somefile.zip?foo=bar'

Some other posters mentioned using urlparse, which will work, but you'd still need to strip the leading directory from the file name. If you use os.path.basename() then you don't have to worry about that, since it returns only the final part of the URL or file path.

回答3:

I think that "the file name" isn't a very well defined concept when it comes to http transfers. The server might (but is not required to) provide one as "content-disposition" header, you can try to get that with remotefile.headers['Content-Disposition']. If this fails, you probably have to parse the URI yourself.

回答4:

Just saw this I normally do..

filename = url.split("?")[0].split("/")[-1]

回答5:

Using urlsplit is the safest option:

url = 'http://example.com/somefile.zip'
urlparse.urlsplit(url).path.split('/')[-1]

回答6:

Do you mean urllib2.urlopen? There is no function called openfile in the urllib2 module.

Anyway, use the urllib2.urlparse functions:

>>> from urllib2 import urlparse
>>> print urlparse.urlsplit('http://example.com/somefile.zip')
('http', 'example.com', '/somefile.zip', '', '')

Voila.

回答7:

You could also combine both of the two best-rated answers : Using urllib2.urlparse.urlsplit() to get the path part of the URL, and then os.path.basename for the actual file name.

Full code would be :

>>> remotefile=urllib2.urlopen(url)
>>> try:
>>>   filename=remotefile.info()['Content-Disposition']
>>> except KeyError:
>>>   filename=os.path.basename(urllib2.urlparse.urlsplit(url).path)

回答8:

The os.path.basename function works not only for file paths, but also for urls, so you don't have to manually parse the URL yourself. Also, it's important to note that you should use result.url instead of the original url in order to follow redirect responses:

import os
import urllib2
result = urllib2.urlopen(url)
real_url = urllib2.urlparse.urlparse(result.url)
filename = os.path.basename(real_url.path)

回答9:

I guess it depends what you mean by parsing. There is no way to get the filename without parsing the URL, i.e. the remote server doesn't give you a filename. However, you don't have to do much yourself, there's the urlparse module:

In [9]: urlparse.urlparse('http://example.com/somefile.zip')
Out[9]: ('http', 'example.com', '/somefile.zip', '', '', '')

回答10:

not that I know of.

but you can parse it easy enough like this:

url = 'http://example.com/somefile.zip'
print url.split('/')[-1]

回答11:

using requests, but you can do it easy with urllib(2)

import requests
from urllib import unquote
from urlparse import urlparse

sample = requests.get(url)

if sample.status_code == 200:
    #has_key not work here, and this help avoid problem with names

    if filename == False:

        if 'content-disposition' in sample.headers.keys():
            filename = sample.headers['content-disposition'].split('filename=')[-1].replace('"','').replace(';','')

        else:

            filename = urlparse(sample.url).query.split('/')[-1].split('=')[-1].split('&')[-1]

            if not filename:

                if url.split('/')[-1] != '':
                    filename = sample.url.split('/')[-1].split('=')[-1].split('&')[-1]
                    filename = unquote(filename)

回答12:

You probably can use simple regular expression here. Something like:

In [26]: import re
In [27]: pat = re.compile('.+[\/\?#=]([\w-]+\.[\w-]+(?:\.[\w-]+)?$)')
In [28]: test_set 

['http://www.google.com/a341.tar.gz',
 'http://www.google.com/a341.gz',
 'http://www.google.com/asdasd/aadssd.gz',
 'http://www.google.com/asdasd?aadssd.gz',
 'http://www.google.com/asdasd#blah.gz',
 'http://www.google.com/asdasd?filename=xxxbl.gz']

In [30]: for url in test_set:
   ....:     match = pat.match(url)
   ....:     if match and match.groups():
   ....:         print(match.groups()[0])
   ....:         

a341.tar.gz
a341.gz
aadssd.gz
aadssd.gz
blah.gz
xxxbl.gz

回答13:

Using PurePosixPath which is not operating system—dependent and handles urls gracefully is the pythonic solution:

>>> from pathlib import PurePosixPath
>>> path = PurePosixPath('http://example.com/somefile.zip')
>>> path.name
'somefile.zip'
>>> path = PurePosixPath('http://example.com/nested/somefile.zip')
>>> path.name
'somefile.zip'

Notice how there is no network traffic here or anything (i.e. those urls don't go anywhere) - just using standard parsing rules.

回答14:

import os,urllib2
resp = urllib2.urlopen('http://www.example.com/index.html')
my_url = resp.geturl()

os.path.split(my_url)[1]

# 'index.html'

This is not openfile, but maybe still helps :)

标签： python url urllib2

疯言疯语

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~