Correctly extract Emojis from a Unicode string

I am working in Python 2 and I have a string containing emojis as well as other unicode characters. I need to convert it to a list where each entry in the list is a single character/emoji.

x = u'


   
    



        
        
        
        
        2条回答

           
       
           
           
           
                                              
            
                                  
            
            
            
            
            
            闹够了就滚                          
            
             
             2楼-- · 2019-02-06 02:58
             
             
             
                          
             
                                                                          
I would use the uniseg library (pip install uniseg):

# -*- coding: utf-8 -*-
from uniseg import graphemecluster as gc

print list(gc.grapheme_clusters(u'                                                                    
                                                        
            

              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                

                  
                



                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                

             
            
                                  
            
            
            
            
            
            欢心                          
            
             
             3楼-- · 2019-02-06 03:05
             
             
             
                          
             
                                                                          
First of all, in Python2, you need to use Unicode strings (u'<...>') for Unicode characters to be seen as Unicode characters. And correct source encoding if you want to use the chars themselves rather than the \UXXXXXXXX representation in source code.

Now, as per Python: getting correct string length when it contains surrogate pairs and Python returns length of 2 for single Unicode character string, in Python2 "narrow" builds (with sys.maxunicode==65535), 32-bit Unicode characters are represented as surrogate pairs, and this is not transparent to string functions. This has only been fixed in 3.3 (PEP0393).

The simplest resolution (save for migrating to 3.3+) is to compile a Python "wide" build from source as outlined on the 3rd link. In it, Unicode characters are all 4-byte (thus are a potential memory hog) but if you need to routinely handle wide Unicode chars, this is probably an acceptable price.

The solution for a "narrow" build is to make a custom set of string functions (len, slice; maybe as a subclass of unicode) that would detect surrogate pairs and handle them as a single character. I couldn't readily find an existing one (which is strange), but it's not too hard to write:


as per UTF-16#U+10000 to U+10FFFF - Wikipedia,


the 1st character (high surrogate) is in range 0xD800..0xDBFF
the 2nd character (low surrogate) - in range 0xDC00..0xDFFF
these ranges are reserved and thus cannot occur as regular characters



So here's the code to detect a surrogate pair:

def is_surrogate(s,i):
    if 0xD800 <= ord(s[i]) <= 0xDBFF:
        try:
            l = s[i+1]
        except IndexError:
            return False
        if 0xDC00 <= ord(l) <= 0xDFFF:
            return True
        else:
            raise ValueError("Illegal UTF-16 sequence: %r" % s[i:i+2])
    else:
        return False


And a function that returns a simple slice:

def slice(s,start,end):
    l=len(s)
    i=0
    while i<start and i<l:
        if is_surrogate(s,i):
            start+=1
            end+=1
            i+=1
        i+=1
    while i<end and i<l:
        if is_surrogate(s,i):
            end+=1
            i+=1
        i+=1
    return s[start:end]


Here, the price you pay is performance, as these functions are much slower than built-ins:

>>> ux=u"a"*5000+u"\U00100000"*30000+u"b"*50000
>>> timeit.timeit('slice(ux,10000,100000)','from __main__ import slice,ux',number=1000)
46.44128203392029    #msec
>>> timeit.timeit('ux[10000:100000]','from __main__ import slice,ux',number=1000000)
8.814016103744507    #usec

    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                
   
   
               

               

     
                      登录 后发表回答

                                           
        
               

               

    




   


   
   
   
  
   相关问题
      
    
    
   
   

     


   
   how to define constructor for Python's new Nam   

   



     


   
   streaming md5sum of contents of a large remote tar   

   



     


   
   How to get the background from multiple images by   

   



     


   
   Evil ctypes hack in python   

   



     


   
   Correctly parse PDF paragraphs with Python   

   



        
      
    查看全部
   
   
  
   相关文章
 
   
   

     


   
   问个python基础问题，为什么时间不更新 及 name 'ss' is not   

     


   
   c#调用python3程序   

     


   
   如何安全的关闭程序   

     


   
   反爬能检测到JS模拟的键盘输入吗   

     


   
   有没有方法即使程序最小化也能对其发送按键   

     


   
   tkinter这样怎么不能分别赋值？   

     


   
   mouseMoveEvent奇怪的崩溃   

     


   
   在liunx 安装Levenshtein错误   

        
        
    查看全部
                 收藏的人(4)

Correctly extract Emojis from a Unicode string

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间