URLDecoding requests

2020-03-31 07:08发布

问题:

I am trying to get the original url from requests. Here is what I have so far:

res = requests.get(...)
url = urllib.unquote(res.url).decode('utf8') 

I then get an error that says:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 60-61: ordinal not in range(128)

The original url I requested is:

https://www.microsoft.com/de-at/store/movies/american-pie-pr\xc3\xa4sentiert-nackte-tatsachen/8d6kgwzl63ql

And here is what happens when I try printing:

>>> print '111', res.url
111 https://www.microsoft.com/de-at/store/movies/american-pie-pr%C3%A4sentiert-nackte-tatsachen/8d6kgwzl63ql
>>> print '222', urllib.unquote( res.url )
222 https://www.microsoft.com/de-at/store/movies/american-pie-präsentiert-nackte-tatsachen/8d6kgwzl63ql
>>> print '333', urllib.unquote(res.url).decode('utf8') 
UnicodeEncodeError: 'ascii' codec can't encode characters in position 60-61: ordinal not in range(128)

Why is this occurring, and how would I fix this?

回答1:

UnicodeEncodeError: 'ascii' codec can't encode characters

You are trying to decode a string that is Unicode already. It raises AttributeError on Python 3 (unicode string has no .decode() method there). Python 2 tries to encode the string into bytes first using sys.getdefaultencoding() ('ascii') before passing it to .decode('utf8') which leads to UnicodeEncodeError.

In short, do not call .decode() on Unicode strings, use this instead:

print urllib.unquote(res.url.encode('ascii')).decode('utf-8')

Without .decode() call, the code prints bytes (assuming a bytestring is passed to unquote()) that may lead to mojibake if the character encoding used by your environment is not utf-8. To avoid mojibake, always print Unicode (don't print text as bytes), do not hardcode the character encoding of your environment inside your script i.e., .decode() is necessary here.


There is a bug in urllib.unquote() if you pass it a Unicode string:

>>> print urllib.unquote(u'​%C3%A4')
ä
>>> print urllib.unquote('​%C3%A4') # utf-8 output
ä

Pass bytestrings to unquote() on Python 2.