I have a unicode string like '%C3%A7%C3%B6asd+fjkls%25asd'
and I want to decode this string.
I used urllib.unquote_plus(str)
but it works wrong.
- expected :
çöasd+fjkls%asd
- result :
çöasd fjkls%asd
double coded utf-8 characters(%C3%A7
and %C3%B6
) are decoded wrong.
My python version is 2.7 under a linux distro.
What is the best way to get expected result?
You have 3 or 4 or 5 problems ... but repr()
and unicodedata.name()
are your friends; they unambiguously show you exactly what you have got, without the confusion engendered by people with different console encodings communicating the results of print fubar
.
Summary: either (a) you start with a unicode object and apply the unquote function to that or (b) you start off with a str object and your console encoding is not UTF-8.
If as you say you start off with a unicode object:
>>> s0 = u'%C3%A7%C3%B6asd+fjkls%25asd'
>>> print repr(s0)
u'%C3%A7%C3%B6asd+fjkls%25asd'
this is an accidental nonsense. If you apply urllibX.unquote_YYYY()
to it, you get another nonsense unicode object (u'\xc3\xa7\xc3\xb6asd+fjkls%asd'
) which would cause your shown symptoms when printed. You should convert your original unicode object to a str object immediately:
>>> s1 = s0.encode('ascii')
>>> print repr(s1)
'%C3%A7%C3%B6asd+fjkls%25asd'
then you should unquote it:
>>> import urllib2
>>> s2 = urllib2.unquote(s1)
>>> print repr(s2)
'\xc3\xa7\xc3\xb6asd+fjkls%asd'
Looking at the first 4 bytes of that, it's encoded in UTF-8. If you do print s2
, it will look OK if your console is expecting UTF-8, but if it's expecting ISO-8859-1 (aka latin1) you'll see your symptomatic rubbish (first char will be A-tilde). Let's park that thought for a moment and convert it to a Unicode object:
>>> s3 = s2.decode('utf8')
>>> print repr(s3)
u'\xe7\xf6asd+fjkls%asd'
and inspect it to see what we've actually got:
>>> import unicodedata
>>> for c in s3[:6]:
... print repr(c), unicodedata.name(c)
...
u'\xe7' LATIN SMALL LETTER C WITH CEDILLA
u'\xf6' LATIN SMALL LETTER O WITH DIAERESIS
u'a' LATIN SMALL LETTER A
u's' LATIN SMALL LETTER S
u'd' LATIN SMALL LETTER D
u'+' PLUS SIGN
Looks like what you said you expected. Now we come to the question of displaying it on your console. Note: don't freak out when you see "cp850"; I'm doing this portably and just happen to be doing this in a Command Prompt on Windows.
>>> import sys
>>> sys.stdout.encoding
'cp850'
>>> print s3
çöasd+fjkls%asd
Note: the unicode object was explicitly encoded using sys.stdout.encoding. Fortunately all the unicode characters in s3
are representable in that encoding (and cp1252 and latin1).
Using either unquote
or unquote_plus
will give you a byte string. If you want a Unicode string then you have to decode the byte string to unicode:
>>> print(urllib.unquote_plus('%C3%A7%C3%B6asd+fjkls%25asd').decode('utf8'))
çöasd fjkls%asd
>>>
Compared with:
>>> print(urllib.unquote_plus('%C3%A7%C3%B6asd+fjkls%25asd'))
çöasd fjkls%asd
>>>
Note that your input string must be a byte string: if you pass unicode to unquote/unquote_plus
then you'll get a bit of a mess. If this is the case then encode it first:
>>> print(urllib.unquote_plus(u'%C3%A7%C3%B6asd+fjkls%25asd'.encode('ascii')).decode('utf8'))
çöasd fjkls%asd
Try urllib2
once more:
print urllib2.unquote('%C3%A7%C3%B6asd+fjkls%25asd')
'%C3%A7%C3%B6asd+fjkls%25asd' - this is not a unicode string.
This is a url-encoded string. Use urllib2.unquote() instead.
You have a double problem: your string is unicode encoded and contains caracter urlencoded. Some match. You can normalize your string to ascci to be sure it won't be interpreted incorrectly:
>>> s = '%C3%A7%C3%B6asd+fjkls%25asd' # ascii string
>>> print urllib2.unquote(s) # works as expected
çöasd+fjkls%asd
>>> s = u'%C3%A7%C3%B6asd+fjkls%25asd' # unicode string
>>> print urllib2.unquote(s) # decode stuff that it shouldn't
çöasd+fjkls%asd
>>> print urllib2.unquote(s.encode('ascii')) # encode the unicode string to ascii: works!
çöasd+fjkls%asd
You are using unquote_plus
method which is taking space
into account and converting to +
. Just use unquote
method and you should be fine.
>>> import urllib
>>> print urllib.unquote('%C3%A7%C3%B6asd+fjkls%25asd')
çöasd+fjkls%asd
>>> print urllib.unquote_plus('%C3%A7%C3%B6asd+fjkls%25asd')
çöasd fjkls%asd