可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'd like to be able to detect emoji in text and look up their names.

I've had no luck using unicodedata module and I suspect that I'm not understanding the UTF-8 conventions.

I'd guess that I need to load my doc as as utf-8, then break the unicode "strings" into unicode symbols. Iterate over these and look them up.

#new example loaded using pandas and encoding UTF-8                     
'A man tried to get into my car\U0001f648'          

type(test) = unicode

import unicodedata as uni
uni.name(test[0])
Out[89]: 'LATIN CAPITAL LETTER A'

uni.name(test[-3])
Out[90]: 'LATIN SMALL LETTER R'    

uni.name(test[-1])
ValueError                                Traceback (most recent call last)
<ipython-input-105-417c561246c2> in <module>()
----> 1 uni.name(test[-1])
ValueError: no such name

# just to be clear
uni.name(u'\U0001f648')
ValueError: no such name

I looked up the unicode symbol via google and it's a legit symbol. Perhaps the unicodedata module isn't very comprehensive...?

I'm considering making my own look up table from here. Interested in other ideas...this one seems do-able.

回答1:

My problem was in using Python2.7 for the unicodedata module. using Conda I created a python 3.3 environment and now unicodedata works as expected and I've given up on all weird hacks I was working on.

# using python 3.3
import unicodedata as uni

In [2]: uni.name('\U0001f648')
Out[2]: 'SEE-NO-EVIL MONKEY'

Thanks to Mark Ransom for pointing out that I originally had Mojibake from not correctly importing my data. Thanks again for your help.

回答2:

Here's a way to read the link you provided. It's translated from Python 2 so there might be a glitch or two.

import re
import urllib2
rexp = re.compile(r'U\+([0-9A-Za-z]+)[^#]*# [^)]*\) *(.*)')
mapping = {}
for line in urllib2.urlopen('ftp://ftp.unicode.org/Public/emoji/1.0/emoji-data.txt'):
    line = line.decode('utf-8')
    m = rexp.match(line)
    if m:
        mapping[chr(int(m.group(1), 16))] = m.group(2)