I'm using Python to process Weibo (a twitter-like service in China) sentences.
There are some emoticons in the sentences, whose corresponding unicode are \ue317
etc.
To process the sentence, I need to encode the sentence with gbk, see below:
string1_gbk = string1.decode('utf-8').encode('gb2312')
There will be a UnicodeEncodeError:'gbk' codec can't encode character u'\ue317'
I tried \\ue[0-9a-zA-Z]{3}
, but it did not work. How could I match these emoticons in sentences?
Try
Should output ? instead of those emoticons.
Python Docs - Python Wiki
It may be because the backslash is a special escape character in regexp syntax. The following worked for me:
Notice it doesn't erroneously match the
ue317
at the end, which has no preceding backslash. Obviously, usere.sub()
if you wish to replace those character strings.'\ue317'
is not a substring ofu"asdasd \ue317 asad"
- it's human-readable unicode character representation, and can not be matched by regexp. regexp works withrepr(u'\ue317')