How to check the url is either web page link or fi

2020-03-05 02:11发布

问题:

Suppose i have links as follows:

    http://example.com/index.html
    http://example.com/stack.zip
    http://example.com/setup.exe
    http://example.com/news/

In the above links first and fourth links are web page links and second and third are the file link.

These are only some examples of files links i.e .zip and .exe, but there may be many other files.

Is there any standard way to distinguish between file url or web page link? Thanks in advance.

回答1:

import urllib
import mimetypes


def guess_type_of(link, strict=True):
    link_type, _ = mimetypes.guess_type(link)
    if link_type is None and strict:
        u = urllib.urlopen(link)
        link_type = u.headers.gettype() # or using: u.info().gettype()
    return link_type

Demo:

links = ['http://stackoverflow.com/q/21515098/538284', # It's a html page
         'http://upload.wikimedia.org/wikipedia/meta/6/6d/Wikipedia_wordmark_1x.png', # It's a png file
         'http://commons.wikimedia.org/wiki/File:Typing_example.ogv', # It's a html page
         'http://upload.wikimedia.org/wikipedia/commons/e/e6/Typing_example.ogv'   # It's an ogv file
]

for link in links:
    print(guess_type_of(link))

Output:

text/html
image/x-png
text/html
application/ogg


回答2:

import urllib
mytest = urllib.urlopen('http://www.sec.gov')
mytest.headers.items()

('content-length', '20833'), ('expires', 'Sun, 02 Feb 2014 19:36:12 GMT'), ('server', 'SEC'), ('connection', 'close'), ('cache-control', 'max-age=0'), ('date', 'Sun, 02 Feb 2014 19:36:12 GMT'), ('content-type', 'text/html')]

mytest.headers.items() is a list of tuples, you can see in my example that the last item in the list describes the content

I am not sure if the length varies so you could iterate through it to find the one that has 'content-type' in it.