How to match a emoticon in sentence with regular e

2019-06-08 13:16发布

I'm using Python to process Weibo (a twitter-like service in China) sentences. There are some emoticons in the sentences, whose corresponding unicode are \ue317 etc. To process the sentence, I need to encode the sentence with gbk, see below:

 string1_gbk = string1.decode('utf-8').encode('gb2312')

There will be a UnicodeEncodeError:'gbk' codec can't encode character u'\ue317'

I tried \\ue[0-9a-zA-Z]{3}, but it did not work. How could I match these emoticons in sentences?

标签： python regex emoticons

3条回答

Summer. ? 凉城

2楼-- · 2019-06-08 13:47

Try

string1_gbk = string1.decode('utf-8').encode('gb2312', 'replace')

Should output ? instead of those emoticons.

Python Docs - Python Wiki

0人赞添加讨论(0) 举报

来，给爷笑一个

3楼-- · 2019-06-08 13:49

It may be because the backslash is a special escape character in regexp syntax. The following worked for me:

>>> test_str = 'blah blah blah \ue317 blah blah \ueaa2 blah ue317'
>>> re.findall(r'\\ue[0-9A-Za-z]{3}', test_str)
['\\ue317', '\\ueaa2']

Notice it doesn't erroneously match the ue317 at the end, which has no preceding backslash. Obviously, use re.sub() if you wish to replace those character strings.

0人赞添加讨论(0) 举报

做个烂人

4楼-- · 2019-06-08 13:54

'\ue317' is not a substring of u"asdasd \ue317 asad" - it's human-readable unicode character representation, and can not be matched by regexp. regexp works with repr(u'\ue317')

0人赞添加讨论(0) 举报

How to match a emoticon in sentence with regular e

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间