I am trying to get the original url from requests
. Here is what I have so far:
res = requests.get(...)
url = urllib.unquote(res.url).decode('utf8')
I then get an error that says:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 60-61: ordinal not in range(128)
The original url I requested is:
https://www.microsoft.com/de-at/store/movies/american-pie-pr\xc3\xa4sentiert-nackte-tatsachen/8d6kgwzl63ql
And here is what happens when I try printing:
>>> print '111', res.url
111 https://www.microsoft.com/de-at/store/movies/american-pie-pr%C3%A4sentiert-nackte-tatsachen/8d6kgwzl63ql
>>> print '222', urllib.unquote( res.url )
222 https://www.microsoft.com/de-at/store/movies/american-pie-präsentiert-nackte-tatsachen/8d6kgwzl63ql
>>> print '333', urllib.unquote(res.url).decode('utf8')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 60-61: ordinal not in range(128)
Why is this occurring, and how would I fix this?
UnicodeEncodeError: 'ascii' codec can't encode characters
You are trying to decode a string that is Unicode already. It raises AttributeError
on Python 3 (unicode string has no .decode()
method there). Python 2 tries to encode the string into bytes first using sys.getdefaultencoding()
('ascii'
) before passing it to .decode('utf8')
which leads to UnicodeEncodeError
.
In short, do not call .decode()
on Unicode strings, use this instead:
print urllib.unquote(res.url.encode('ascii')).decode('utf-8')
Without .decode()
call, the code prints bytes (assuming a bytestring is passed to unquote()
) that may lead to mojibake if the character encoding used by your environment is not utf-8. To avoid mojibake, always print Unicode (don't print text as bytes), do not hardcode the character encoding of your environment inside your script i.e., .decode()
is necessary here.
There is a bug in urllib.unquote()
if you pass it a Unicode string:
>>> print urllib.unquote(u'%C3%A4')
ä
>>> print urllib.unquote('%C3%A4') # utf-8 output
ä
Pass bytestrings to unquote()
on Python 2.