Consider this.
# -*- coding: utf-8 -*-
data = "cdbsb \xe2\x80\xa6 abc"
print data
#prints cdbsb … abc
^
print re.findall(ur"[\u2026]", data )
Why can't re
find this unicode character ? I have already checked
\xe2\x80\xa6 === … === U+2026
An alternative solution:
data
is of str type and contains ASCII character with hex value. But the search term is of unicode type . Print function converts default tosys.stdout.encoding
. When I try to printdata
as it is, the output differs fromdata.decode('utf-8')
. I am using Python 2.7If you go through the link provided by nhahtdh
Solving Unicode Problems in Python 2.7
You can see the original string was in
bytes
and we were searching for unicode. So it should never have worked.Following these we can solve it in 2 ways.
My guess is that the issue is because
data
is a byte-string. You might have the console encoding asutf-8
, hence when printing the string, the console converts the string toutf-8
and then shows it (You can check this out atsys.stdout.encoding
). Hence you are getting the character -…
.But most probably
re
does not do this decoding for you.If you convert
data
toutf-8
encoding, you would get the desired result when usingre.findall
. Example -