How do I parse a unicode 'string' for characters greater than \uFFFF
?
tried re
and regex
but does not appear to properly match unicode characters that are greater than 2 hex values.
Take any unicode string (for example, a tweet text which is encoded in utf-8
)
emotes = regex.findall('[\u263A\u263B\u062A\u32E1]',tweet_json_obj['text'])
if emotes: print "Happy:{0}".format(len(emotes))
Output is the number of smiley faces contained within the text, it works great!
but if I try to match for the emoticon set of unicode characters: http://www.fileformat.info/info/unicode/block/emoticons/index.htm
emotes = regex.findall('[\u01F600-\u01F64F]',tweet_json_obj['text'])
if emotes: print "Emoticon:{0}".format(len(emotes))
output is the (number) match for all the characters in the string, minus white spaces. How is it that regex is matching every character in the tweet, or at least what looks like string.printable?
Expected results are a return of 0 for a majority of the dataset, as I don't expect people to be inserting these emoticons, but they might... so I'd like to check for their existence. Is my regex incorrect?
Codepoints outside of the BMP use
\Uxxxxxxxx
(so uppercaseU
and 8 hex characters). You are using\uxxxx
, which only take four hex characters, the00
is not part of the unicode codepoint:You need to use a
unicode
pattern here:Demo:
You need to use a UCS4 Python build, otherwise non-BMP codepoints are implemented using UTF16 surrogate pairs, which won't work very well with regular expressions.
If
len(u'\U0001f600')
returns 2 then you are using a narrow UCS2 build instead, or you can look atsys.maxunicode
; a wide build returns 1114111, a narrow build 65535.On a UCS2 system, for this specific case, you could match the UTF16 surrogates with an expression as well:
This matches the UTF-16 surrogate pairs that make up the same range as
[\U0001F600-\U0001F64F]
, but on narrow builds: