Unicode search not working

Consider this.

# -*- coding: utf-8 -*-
data = "cdbsb \xe2\x80\xa6 abc"
print data 
#prints cdbsb … abc
              ^
print re.findall(ur"[\u2026]", data )

Why can't re find this unicode character ? I have already checked

\xe2\x80\xa6 === … === U+2026

标签： python regex python-2.7 python-unicode

4条回答

祖国的老花朵

2楼-- · 2019-04-16 13:42

An alternative solution:

>>> data = "cdbsb \xe2\x80\xa6 abc"
>>> print data 
cdbsb … abc
>>> if u"\u2026".encode('utf8') in data: print True
... 
True
>>> if u"\u2026" in data.decode('utf8'): print True
... 
True

0人赞添加讨论(0) 举报

在下西门庆

3楼-- · 2019-04-16 13:47

data is of str type and contains ASCII character with hex value. But the search term is of unicode type . Print function converts default to sys.stdout.encoding. When I try to print data as it is, the output differs from data.decode('utf-8'). I am using Python 2.7

data = "cdbsb \xe2\x80\xa6 abc"
search = ur"[\u2026]"

print sys.stdout.encoding
## windows-1254

print data, type(data)
## cdbsb â€¦ abc <type 'str'>

print data.decode(sys.stdout.encoding)
## cdbsb â€¦ abc

print data.decode('utf-8')
## cdbsb … abc

print search, type(search)
## […] <type 'unicode'>

print re.findall(search, data.decode('utf-8'))
## [u'\u2026']

0人赞添加讨论(0) 举报

戒情不戒烟

4楼-- · 2019-04-16 13:55

If you go through the link provided by nhahtdh

Solving Unicode Problems in Python 2.7

You can see the original string was in bytes and we were searching for unicode. So it should never have worked.

encode(): Gets you from Unicode → bytes

decode(): Gets you from bytes → Unicode

Following these we can solve it in 2 ways.

# -*- coding: utf-8 -*-
data = "cdbsb \xe2\x80\xa6 abc".decode("utf-8")  #convert to unicode
print data
print re.findall(ur"[\u2026]", data )
print re.findall(ur"[\u2026]", data )[0].encode("utf-8")  #compare with unicode byte string and then reconvert to bytes for print

data1 = "cdbsb \xe2\x80\xa6 abc"  #let it remain bytes
print data1
print re.findall(r"\xe2\x80\xa6", data1 )[0] #search for bytes

0人赞添加讨论(0) 举报

Melony?

5楼-- · 2019-04-16 13:56

My guess is that the issue is because data is a byte-string. You might have the console encoding as utf-8 , hence when printing the string, the console converts the string to utf-8 and then shows it (You can check this out at sys.stdout.encoding ). Hence you are getting the character - … .

But most probably re does not do this decoding for you.

If you convert data to utf-8 encoding, you would get the desired result when using re.findall. Example -

>>> data = "cdbsb \xe2\x80\xa6 abc"
>>> print re.findall(ur"[\u2026]", data.decode('utf-8') )
[u'\u2026']

0人赞添加讨论(0) 举报

Unicode search not working

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间