What is the regex to extract all the emojis from a

I have a String encoded in UTF-8. For example:

Thats a nice joke


   
    



        
        
        
        
        14条回答

           
       
           
           
           
                                              
            
                                  
            
            
            
            
            
            孤傲高冷的网名                          
            
             
             2楼-- · 2020-01-23 03:49
             
             
             
                          
             
                                                                          
        You may also use emoji4j library.

String emojiText = "A                                                                     
                                                        
            

              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                

                  
                



                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                

             
            
                                  
            
            
            
            
            
            Explosion°爆炸                          
            
             
             3楼-- · 2020-01-23 03:49
             
             
             
                          
             
                                                                          
        Regex is too slow, and Emoji is updated very fast.

Try this project simple-emoji-4j

Compatible with Emoji 12.0 (2018.10.15)

Simple with:

EmojiUtils.containsEmoji(str)

    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                
             
            
                                  
            
            
            
            
            
            相关推荐>>                          
            
             
             4楼-- · 2020-01-23 03:57
             
             
             
                          
             
                                                                          
        Emoji regex

public static final String sEmojiRegex = "(?:[\\u2700-\\u27bf]|" +

        "(?:[\\ud83c\\udde6-\\ud83c\\uddff]){2}|" +
        "[\\ud800\\udc00-\\uDBFF\\uDFFF]|[\\u2600-\\u26FF])[\\ufe0e\\ufe0f]?(?:[\\u0300-\\u036f\\ufe20-\\ufe23\\u20d0-\\u20f0]|[\\ud83c\\udffb-\\ud83c\\udfff])?" +

        "(?:\\u200d(?:[^\\ud800-\\udfff]|" +

        "(?:[\\ud83c\\udde6-\\ud83c\\uddff]){2}|" +
        "[\\ud800\\udc00-\\uDBFF\\uDFFF]|[\\u2600-\\u26FF])[\\ufe0e\\ufe0f]?(?:[\\u0300-\\u036f\\ufe20-\\ufe23\\u20d0-\\u20f0]|[\\ud83c\\udffb-\\ud83c\\udfff])?)*|" +

        "[\\u0023-\\u0039]\\ufe0f?\\u20e3|\\u3299|\\u3297|\\u303d|\\u3030|\\u24c2|[\\ud83c\\udd70-\\ud83c\\udd71]|[\\ud83c\\udd7e-\\ud83c\\udd7f]|\\ud83c\\udd8e|[\\ud83c\\udd91-\\ud83c\\udd9a]|[\\ud83c\\udde6-\\ud83c\\uddff]|[\\ud83c\\ude01-\\ud83c\\ude02]|\\ud83c\\ude1a|\\ud83c\\ude2f|[\\ud83c\\ude32-\\ud83c\\ude3a]|[\\ud83c\\ude50-\\ud83c\\ude51]|\\u203c|\\u2049|[\\u25aa-\\u25ab]|\\u25b6|\\u25c0|[\\u25fb-\\u25fe]|\\u00a9|\\u00ae|\\u2122|\\u2139|\\ud83c\\udc04|[\\u2600-\\u26FF]|\\u2b05|\\u2b06|\\u2b07|\\u2b1b|\\u2b1c|\\u2b50|\\u2b55|\\u231a|\\u231b|\\u2328|\\u23cf|[\\u23e9-\\u23f3]|[\\u23f8-\\u23fa]|\\ud83c\\udccf|\\u2934|\\u2935|[\\u2190-\\u21ff]";


some emojis (1627)

// count = 1627
public static final String sEmojiTest = "                                                                    
                                                        
            

              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                

                  
                



                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                

             
            
                                  
            
            
            
            
            
            Fickle 薄情                          
            
             
             5楼-- · 2020-01-23 03:58
             
             
             
                          
             
                                                                          
        
  the pdf that you just mentioned says Range: 1F300–1F5FF for Miscellaneous Symbols and Pictographs. So lets say I want to capture any character lying within this range. Now what to do?


Okay, but I will just note that the emoji in your question are outside that range! :-)

The fact that these are above 0xFFFF complicates things, because Java strings store UTF-16. So we can't just use one simple character class for it. We're going to have surrogate pairs. (More: http://www.unicode.org/faq/utf_bom.html)

U+1F300 in UTF-16 ends up being the pair \uD83C\uDF00; U+1F5FF ends up being \uD83D\uDDFF. Note that the first character went up, we cross at least one boundary. So we have to know what ranges of surrogate pairs we're looking for.

Not being steeped in knowledge about the inner workings of UTF-16, I wrote a program to find out (source at the end — I'd double-check it if I were you, rather than trusting me). It tells me we're looking for \uD83C followed by anything in the range \uDF00-\uDFFF (inclusive), or \uD83D followed by anything in the range \uDC00-\uDDFF (inclusive).

So armed with that knowledge, in theory we could now write a pattern:

// This is wrong, keep reading
Pattern p = Pattern.compile("(?:\uD83C[\uDF00-\uDFFF])|(?:\uD83D[\uDC00-\uDDFF])");


That's an alternation of two non-capturing groups, the first group for the pairs starting with \uD83C, and the second group for the pairs starting with \uD83D.

But that fails (doesn't find anything). I'm fairly sure it's because we're trying to specify half of a surrogate pair in various places:

Pattern p = Pattern.compile("(?:\uD83C[\uDF00-\uDFFF])|(?:\uD83D[\uDC00-\uDDFF])");
// Half of a pair --------------^------^------^-----------^------^------^


We can't just split up surrogate pairs like that, they're called surrogate pairs for a reason. :-)

Consequently, I don't think we can use regular expressions (or indeed, any string-based approach) for this at all. I think we have to search through char arrays.

char arrays hold UTF-16 values, so we can find those half-pairs in the data if we look for it the hard way:

String s = new StringBuilder()
                .append("Thats a nice joke ")
                .appendCodePoint(0x1F606)
                .appendCodePoint(0x1F606)
                .appendCodePoint(0x1F606)
                .append(" ")
                .appendCodePoint(0x1F61B)
                .toString();
char[] chars = s.toCharArray();
int index;
char ch1;
char ch2;

index = 0;
while (index < chars.length - 1) { // -1 because we're looking for two-char-long things
    ch1 = chars[index];
    if ((int)ch1 == 0xD83C) {
        ch2 = chars[index+1];
        if ((int)ch2 >= 0xDF00 && (int)ch2 <= 0xDFFF) {
            System.out.println("Found emoji at index " + index);
            index += 2;
            continue;
        }
    }
    else if ((int)ch1 == 0xD83D) {
        ch2 = chars[index+1];
        if ((int)ch2 >= 0xDC00 && (int)ch2 <= 0xDDFF) {
            System.out.println("Found emoji at index " + index);
            index += 2;
            continue;
        }
    }
    ++index;
}


Obviously that's just debug-level code, but it does the job. (In your given string, with its emoji, of course it won't find anything as they're outside the range. But if you change the upper bound on the second pair to 0xDEFF instead of 0xDDFF, it will. No idea if that would also include non-emojis, though.)



Source of my program to find out what the surrogate ranges were:

public class FindRanges {

    public static void main(String[] args) {
        char last0 = '\0';
        char last1 = '\0';
        for (int x = 0x1F300; x <= 0x1F5FF; ++x) {
            char[] chars = new StringBuilder().appendCodePoint(x).toString().toCharArray();
            if (chars[0] != last0) {
                if (last0 != '\0') {
                    System.out.println("-\\u" + Integer.toHexString((int)last1).toUpperCase());
                }
                System.out.print("\\u" + Integer.toHexString((int)chars[0]).toUpperCase() + " \\u" + Integer.toHexString((int)chars[1]).toUpperCase());
                last0 = chars[0];
            }
            last1 = chars[1];
        }
        if (last0 != '\0') {
            System.out.println("-\\u" + Integer.toHexString((int)last1).toUpperCase());
        }
    }
}


Output:

\uD83C \uDF00-\uDFFF
\uD83D \uDC00-\uDDFF
    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                
             
            
                                  
            
            
            
            
            
            何必那么认真                          
            
             
             6楼-- · 2020-01-23 04:00
             
             
             
                          
             
                                                                          
        Had a similar problem. The following served me well and matches surrogate pairs 

public class SplitByUnicode {
    public static void main(String[] argv) throws Exception {
        String string = "Thats a nice joke                                                                     
                                                        
            

              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                

                  
                



                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                

             
            
                                  
            
            
            
            
            
            \"骚年 ilove                          
            
             
             7楼-- · 2020-01-23 04:00
             
             
             
                          
             
                                                                          
        This worked for me in java 8:

public static String mysqlSafe(String input) {
  if (input == null) return null;
    StringBuilder sb = new StringBuilder();

    for (int i = 0; i < input.length(); i++) {
      if (i < (input.length() - 1)) { // Emojis are two characters long in java, e.g. a rocket emoji is "\uD83D\uDE80";
        if (Character.isSurrogatePair(input.charAt(i), input.charAt(i + 1))) {
          i += 1; //also skip the second character of the emoji
          continue;
        }
      }
      sb.append(input.charAt(i));
    }

  return sb.toString();
}

    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                
   1
2
3
下一页

   
               

               

     
                      登录 后发表回答

                                           
        
               

               

    




   


   
   
   
  
   相关问题
      
    
    
   
   

     


   
   Delete Messages from a Topic in Apache Kafka   

   



     


   
   Jackson Deserialization not calling deserialize on   

   



     


   
   How to maintain order of key-value in DataFrame sa   

   



     


   
   StackExchange API - Deserialize Date in JSON Respo   

   



     


   
   Difference between Types.INTEGER and Types.NULL in   

   



        
      
    查看全部
   
   
  
   相关文章
 
   
   

     


   
   java 数组拆分为新数组   

     


   
   如何安全的关闭程序   

     


   
   反爬能检测到JS模拟的键盘输入吗   

     


   
   有没有方法即使程序最小化也能对其发送按键   

     


   
   List可以存储接口类型的数据吗？   

     


   
   java 打包后引用jar包方法找不到   

     


   
   Java PDFBox 向PDF文件中写入图片   

     


   
   Java代码制表用什么框架比较好？   

        
        
    查看全部
                 收藏的人(6) 

                            
             
             
                          
             
             
                          
             
             
                          
             
             
                          
             
             
                          
             
             
             
              
     
                 
           





  
    
      
      采纳回答
    
    

     
        
        
        
            
                向帮助了您的知道网友说句感谢的话吧!
            
            
                
                    
                        非常感谢!






  
    
      
      编辑标签
    
    

    
        

                最多设置5个标签!

                    
                      
          
                       java
                          regex
                          utf-8
                          emoji
                       
            
            
           
        
          
           
            
          
          
                        
                    

            

                
                 
      
    

    

  


 


  
    
      
      举报内容
    
    






检举类型


检举内容


检举用户




检举原因



广告推广


恶意灌水


回答内容与提问无关



抄袭答案


其他





检举说明(必填)






    

                
                 
      



    

  




 打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮
 

 










付费偷看金额在0.1-10元之间





 

 

 
 
 
 

 



  
    
      
      
       
               您已邀请15人回答
       查看邀请

       
        
        
       

       
    
    
     
    
 擅长该话题的人
 回答过该话题的人
我关注的人

 
     
       
       

    

  


  
                            
   




  
  
    
    



      
      

    
        
            
                
                    标签大全
                    站内问题
                    专栏文章
                    站内专家
                    站内话题
                    站内公告
                     财富值规则
                
               
            
                         
                宁ICP备15000671号-9
                
                站内文章地图xml
                
                站内问答地图xml
                
                站内作者地图xml
               
                站内标签地图xml
            
            
            
                        本站部分内容来自互联网，其发布内容言论不代表本站观点，如果其链接、内容的侵犯您的权益，烦请联系我们，我们将及时予以处理。
            
            
                        邮箱：z19940522666@163.com
            
            
                
                Copyright © 2016-2018 WHATSNSV3.8