python regex fails to match a specific Unicode > 2

How do I parse a unicode 'string' for characters greater than \uFFFF?

tried re and regex but does not appear to properly match unicode characters that are greater than 2 hex values.

Take any unicode string (for example, a tweet text which is encoded in utf-8)

emotes = regex.findall('[\u263A\u263B\u062A\u32E1]',tweet_json_obj['text'])
if emotes: print "Happy:{0}".format(len(emotes))

Output is the number of smiley faces contained within the text, it works great!

but if I try to match for the emoticon set of unicode characters: http://www.fileformat.info/info/unicode/block/emoticons/index.htm

emotes = regex.findall('[\u01F600-\u01F64F]',tweet_json_obj['text'])
if emotes: print "Emoticon:{0}".format(len(emotes))

output is the (number) match for all the characters in the string, minus white spaces. How is it that regex is matching every character in the tweet, or at least what looks like string.printable?

Expected results are a return of 0 for a majority of the dataset, as I don't expect people to be inserting these emoticons, but they might... so I'd like to check for their existence. Is my regex incorrect?

标签： python regex python-2.7 unicode

1条回答

不美不萌又怎样

2楼-- · 2019-05-10 03:32

Codepoints outside of the BMP use \Uxxxxxxxx (so uppercase U and 8 hex characters). You are using \uxxxx, which only take four hex characters, the 00 is not part of the unicode codepoint:

>>> len(u'\u01f600')
3
>>> len(u'\U0001f600')
1
>>> u'\u01f600'[0]
'\u01f6'
>>> u'\u01f600'[1:]
'00'

You need to use a unicode pattern here:

u'[\U0001F600-\U0001F64F]'

Demo:

>>> import re
>>> re.search(u'[\U0001F600-\U0001F64F]', u'\U0001F600')
<_sre.SRE_Match object at 0xb73ead08>

You need to use a UCS4 Python build, otherwise non-BMP codepoints are implemented using UTF16 surrogate pairs, which won't work very well with regular expressions.

If len(u'\U0001f600') returns 2 then you are using a narrow UCS2 build instead, or you can look at sys.maxunicode; a wide build returns 1114111, a narrow build 65535.

On a UCS2 system, for this specific case, you could match the UTF16 surrogates with an expression as well:

ur'\ud83d[\ude00-\ude4f]'

This matches the UTF-16 surrogate pairs that make up the same range as [\U0001F600-\U0001F64F], but on narrow builds:

>>> import sys
>>> sys.maxunicode
65535
>>> import re
>>> re.search(u'\ud83d[\ude00-\ude4f]', u'\U0001F600')
<_sre.SRE_Match object at 0x105e9f5e0>

0人赞添加讨论(0) 举报

python regex fails to match a specific Unicode > 2

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间