I have a list that has WhatsApp emoticons encoded as utf-8 characters. The table I am using to decode the emoticons is at http://apps.timwhitlock.info/emoji/tables/unicode
With this table I am trying to count the number of emoticons used, which I have successfully done using regex techniques. The problem is I have created a dictionary where the keys are the utf-8 characters as strings and the key_values are integers. The following:
print d_emo
for k, v in d_emo.items():
print k.encode('utf8'), v
produces this output:
{'\\xF0\\x9F\\x98\\xA2': 2, '\\xF0\\x9F\\x98\\x82': 1, '\\xF0\\x9F\\x98\\x86': 2, '\\xF0\\x9F\\x98\\x89': 1, '\\xF0\\x9F\\x8D\\xB5': 2, '\\xF0\\x9F\\x8D\\xB0': 4, '\\xF0\\x9F\\x8D\\xAB': 2, '\\xF0\\x9F\\x8D\\xA9': 2, '\\xF0\\x9F\\x98\\x98': 1, '\\xE2\\x98\\xBA': 33, '\\xE2\\x98\\x95': 1}
\xF0\x9F\x98\xA2 2
\xF0\x9F\x98\x82 1
\xF0\x9F\x98\x86 2
\xF0\x9F\x98\x89 1
\xF0\x9F\x8D\xB5 2
\xF0\x9F\x8D\xB0 4
\xF0\x9F\x8D\xAB 2
\xF0\x9F\x8D\xA9 2
\xF0\x9F\x98\x98 1
\xE2\x98\xBA 33
\xE2\x98\x95 1
If I use this code:
for k, v in d_emo.items():
print k.encode('utf-8').decode('unicode_escape'), v
I get
ð¢ 2
ð 1
ð 2
ð 1
ðµ 2
ð° 4
ð« 2
ð© 2
ð 1
⺠33
â 1
I should be getting smiley faces and the like. Any suggestions? This is in Python 2.7.
This will decode the Unicode characters correctly, but in Python 2.X you are somewhat limited when using characters outside the BMP (Basic Multilingual Plane, characters U+0000 to U+FFFF):
Output: