Replacing Emoji Unicode Range from Arabic Tweets u

I am trying to replace emoji from Arabic tweets using java.

I used this code:

String line = "اييه تقولي اجل الارسنال تعادل امس بعد ما كان فايز


   
    



        
        
        
        
        2条回答

           
       
           
           
           
                                              
            
                                  
            
            
            
            
            
            走好不送                          
            
             
             2楼-- · 2019-03-21 15:37
             
             
             
                          
             
                                                                          
Java 5 and 6

If you are stuck running your program on Java 5 or 6 JVM, and you want to match characters in the range from U+1F601 to U+1F64F, use surrogate pairs in the character class:

Pattern emoticons = Pattern.compile("[\uD83D\uDE01-\uD83D\uDE4F]");


This method is valid even in Java 7 and above, since in Sun/Oracle's implementation, if you decompile Pattern.compile() method, the String containing the pattern is converted into an array of code points before compilation.

Java 7 and above


You can use the construct \x{...} in David Wallace's answer, which is available from Java 7.
Or alternatively, you can also specify the whole Emoticons Unicode block, which spans from code point U+1F600 (instead of U+1F601) to U+1F64F.

Pattern emoticons = Pattern.compile("\\p{InEmoticons}");


Since Emoticons block support is added in Java 7, this method is also only valid from Java 7.
Although the other methods are preferred, you can specify supplemental characters by specifying the escape in the regex. While there is no reason to do this in the source code, this change in Java 7 corrects the behavior in applications where regex is used for searching, and directly pasting the character is not possible.

Pattern emoticons = Pattern.compile("[\\uD83D\\uDE01-\\uD83D\\uDE4F]");


/!\ Warning

Never ever mix the syntax together when you specify a supplemental code point, like:


"[\\uD83D\uDE01-\\uD83D\\uDE4F]"
"[\uD83D\\uDE01-\\uD83D\\uDE4F]"


Those will specify to match the code point U+D83D and the range from code point U+DE01 to code point U+1F64F in Oracle's implementation.


Note

In Java 5 and 6, Oracle's implementation, the implementation of Pattern.u() doesn't collapse valid regex-escaped surrogate pairs "\\uD83D\\uDE01". As the result, the pattern is interpreted as 2 lone surrogates, which will fail to match anything.
    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                
             
            
                                  
            
            
            
            
            
            Rolldiameter                          
            
             
             3楼-- · 2019-03-21 15:43
             
             
             
                          
             
                                                                          
From the Javadoc for the Pattern class


  A Unicode character can also be represented in a regular-expression by
  using its Hex notation(hexadecimal code point value) directly as
  described in construct \x{...}, for example a supplementary character
  U+2011F can be specified as \x{2011F}, instead of two consecutive
  Unicode escape sequences of the surrogate pair \uD840\uDD1F.


This means that the regular expression that you're looking for is ([\x{1F601}-\x{1F64F}]).  Of course, when you write this as a Java String literal,  you must escape the backslashes.

Pattern unicodeOutliers = Pattern.compile("([\\x{1F601}-\\x{1F64F}])");


Note that the construct \x{...} is only available from Java 7.
    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                
   
   
               
               
     
                      登录 后发表回答



   
   
   
  
   相关问题
      
    
    
   
   

     


   
   Delete Messages from a Topic in Apache Kafka   

   



     


   
   Jackson Deserialization not calling deserialize on   

   



     


   
   How to maintain order of key-value in DataFrame sa   

   



     


   
   StackExchange API - Deserialize Date in JSON Respo   

   



     


   
   Difference between Types.INTEGER and Types.NULL in   

   



        
      
    查看全部
   
   
  
   相关文章
 
   
   

     


   
   java 数组拆分为新数组   

     


   
   如何安全的关闭程序   

     


   
   反爬能检测到JS模拟的键盘输入吗   

     


   
   有没有方法即使程序最小化也能对其发送按键   

     


   
   List可以存储接口类型的数据吗？   

     


   
   java 打包后引用jar包方法找不到   

     


   
   Java PDFBox 向PDF文件中写入图片   

     


   
   Java代码制表用什么框架比较好？   

        
        
    查看全部
                 收藏的人(5)

Replacing Emoji Unicode Range from Arabic Tweets u

Java 5 and 6

Java 7 and above

/!\ Warning

Note

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间

`/!\` Warning