I'd like to be able to detect emoji in text and look up their names.
I've had no luck using unicodedata module and I suspect that I'm not understanding the UTF-8 conventions.
I'd guess that I need to load my doc as as utf-8, then break the unicode "strings" into unicode symbols. Iterate over these and look them up.
#new example loaded using pandas and encoding UTF-8
'A man tried to get into my car\U0001f648'
type(test) = unicode
import unicodedata as uni
uni.name(test[0])
Out[89]: 'LATIN CAPITAL LETTER A'
uni.name(test[-3])
Out[90]: 'LATIN SMALL LETTER R'
uni.name(test[-1])
ValueError Traceback (most recent call last)
<ipython-input-105-417c561246c2> in <module>()
----> 1 uni.name(test[-1])
ValueError: no such name
# just to be clear
uni.name(u'\U0001f648')
ValueError: no such name
I looked up the unicode symbol via google and it's a legit symbol. Perhaps the unicodedata module isn't very comprehensive...?
I'm considering making my own look up table from here. Interested in other ideas...this one seems do-able.