python regex fails to match a specific Unicode > 2

2019-05-10 03:13发布

How do I parse a unicode 'string' for characters greater than \uFFFF?

tried re and regex but does not appear to properly match unicode characters that are greater than 2 hex values.

Take any unicode string (for example, a tweet text which is encoded in utf-8)

emotes = regex.findall('[\u263A\u263B\u062A\u32E1]',tweet_json_obj['text'])
if emotes: print "Happy:{0}".format(len(emotes))

Output is the number of smiley faces contained within the text, it works great!

but if I try to match for the emoticon set of unicode characters: http://www.fileformat.info/info/unicode/block/emoticons/index.htm

emotes = regex.findall('[\u01F600-\u01F64F]',tweet_json_obj['text'])
if emotes: print "Emoticon:{0}".format(len(emotes))

output is the (number) match for all the characters in the string, minus white spaces. How is it that regex is matching every character in the tweet, or at least what looks like string.printable?

Expected results are a return of 0 for a majority of the dataset, as I don't expect people to be inserting these emoticons, but they might... so I'd like to check for their existence. Is my regex incorrect?

1条回答
不美不萌又怎样
2楼-- · 2019-05-10 03:32

Codepoints outside of the BMP use \Uxxxxxxxx (so uppercase U and 8 hex characters). You are using \uxxxx, which only take four hex characters, the 00 is not part of the unicode codepoint:

>>> len(u'\u01f600')
3
>>> len(u'\U0001f600')
1
>>> u'\u01f600'[0]
'\u01f6'
>>> u'\u01f600'[1:]
'00'

You need to use a unicode pattern here:

u'[\U0001F600-\U0001F64F]'

Demo:

>>> import re
>>> re.search(u'[\U0001F600-\U0001F64F]', u'\U0001F600')
<_sre.SRE_Match object at 0xb73ead08>

You need to use a UCS4 Python build, otherwise non-BMP codepoints are implemented using UTF16 surrogate pairs, which won't work very well with regular expressions.

If len(u'\U0001f600') returns 2 then you are using a narrow UCS2 build instead, or you can look at sys.maxunicode; a wide build returns 1114111, a narrow build 65535.

On a UCS2 system, for this specific case, you could match the UTF16 surrogates with an expression as well:

ur'\ud83d[\ude00-\ude4f]'

This matches the UTF-16 surrogate pairs that make up the same range as [\U0001F600-\U0001F64F], but on narrow builds:

>>> import sys
>>> sys.maxunicode
65535
>>> import re
>>> re.search(u'\ud83d[\ude00-\ude4f]', u'\U0001F600')
<_sre.SRE_Match object at 0x105e9f5e0>
查看更多
登录 后发表回答