Determine if url is a pdf or html file

2020-03-24 05:04发布

I am requesting ulrs using the requests package in python (e.g. file = requests.get(url)). The urls do not specify an extension in them, and sometimes a html file is returned and sometimes a pdf is returned.

Is there a way of determining if the returned file is a pdf or a html? (or more generally, what the file format is). The browser is able to determine, so I assume must be indicate in the response.

1条回答
劫难
2楼-- · 2020-03-24 05:21

This will be found in the Content-Type header, either text/html or application/pdf

 import requests

 r = requests.get('http://example.com/file')
 content_type = r.headers.get('content-type')

 if 'application/pdf' in content_type:
     ext = '.pdf'
 elif 'text/html' in content_type:
     ext = '.html'
 else:
     ext = ''
     print('Unknown type: {}'.format(content_type))

 with open('myfile'+ext, 'wb') as f:
     f.write(r.raw.read())
查看更多
登录 后发表回答