urllib2.urlopen cannot get image, but browser can

2019-03-03 10:42发布

问题:

There is a link with a gif image, but urllib2 can't download it.

import urllib.request as urllib2
uri = 'http://ums.adtechjp.com/mapuser?providerid=1074;userid=AapfqIzytwl7ks8AA_qiU_BNUs8AAAFYqnZh4Q'
try:
  req = urllib2.Request(uri, headers={ 'User-Agent': 'Mozilla/5.0' })
  file = urllib2.urlopen(req)
except urllib2.HTTPError as err:
  print('HTTP error!!!')
  file = err 
  print(err.code)
except urllib2.URLError as err:
  print('URL error!!!')
  print(err.reason)
  return 

data = file.read(1024)
print(data)

After script finishes, data remains empty. Why does it happen? There is no HTTPError, I can see in browser console that there is a valid gif and status of HTTP responce is 200 OK. Thank you.

回答1:

You should check all headers which browser sends to server.

This page needs two headers: User-Agent and Cookie

If you use DevTools in Chrome or Firefox you will see that normally browser (if it has no cookie yet) receives first response with cookie and 302 Moved Temporarily which redirects to the same url but with cookie and then it receives image.

You can try my cookie and maybe it receives image. Bu normally you have to do two requests - first to get cookie and second (with cookie) to get image.

import urllib.request as urllib2

uri = 'http://ums.adtechjp.com/mapuser?providerid=1074;userid=AapfqIzytwl7ks8AA_qiU_BNUs8AAAFYqnZh4Q'

headers = {
    'User-Agent': 'Mozilla/5.0',
    'Cookie': 'JEB2=583077046E650E2495131DE8FD2F1371',
}

try:
  req = urllib2.Request(uri, headers=headers)
  f = urllib2.urlopen(req)
except urllib2.HTTPError as err:
  print('HTTP error!!!')
  f = err 
  print(err.code)
except urllib2.URLError as err:
  print('URL error!!!')
  print(err.reason)

data = f.read(1024)
print(data)

If you use requests module then it will do all automatically and you will no need two requests.

import requests

uri = 'http://ums.adtechjp.com/mapuser?providerid=1074;userid=AapfqIzytwl7ks8AA_qiU_BNUs8AAAFYqnZh4Q'

headers = {
    'User-Agent': 'Mozilla/5.0',
}

r = requests.get(uri, headers=headers)

print(r.content)