What is the regex to extract all the emojis from a

2020-01-23 03:47发布

I have a String encoded in UTF-8. For example:

Thats a nice joke                 

14条回答
做自己的国王
2楼-- · 2020-01-23 04:10

Assuming that you are asking for standard Unicode emoji ranges (there are different blocks by vendor) you may consider these three ranges:

  • 0x20a0 - 0x32ff
  • 0x1f000 - 0x1ffff
  • 0xfe4e5 - 0xfe4ee

Besides all the thoughtful explanation that T.J.Crowder has shared with us, needs to be said that beginning with Java 7 is possible to match UTF-16 encoded surrogate pairs with ease.

Take a look at the docs:

http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

A Unicode character can also be represented in a regular-expression by using its Hex notation(hexadecimal code point value) directly as described in construct \x{...}, for example a supplementary character U+2011F can be specified as \x{2011F}, instead of two consecutive Unicode escape sequences of the surrogate pair \uD840\uDD1F.

Nevertheless, if you cannot switch to Java 7, you can extend the valuable UnicodeEscaper provided by Guava.

Here an implementation for the sake of example:

public class SimpleEscaper extends UnicodeEscaper
{
    @Override
    protected char[] escape(int codePoint)
    {
        if (0x1f000 >= codePoint && codePoint <= 0x1ffff)
        {
            return Integer.toHexString(codePoint).toCharArray();
        }

        return Character.toChars(codePoint);
    }
}
查看更多
我欲成王,谁敢阻挡
3楼-- · 2020-01-23 04:12

There are two ways to solve this sticky problem.

The first one is Using third-party libs like emoji-java and emoji4j. These are mentioned above. You can easily use the method containsEmoji or removesEmoji, etc. And in your own Apps, you need to keep update with these libs.

As for me, I want to find a simple solution to solve this problem.

After a whole day of searching, I've found a magic regex:

"(?:[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83E\uDD00-\uD83E\uDDFF]|[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|[\u2600-\u26FF]\uFE0F?|[\u2700-\u27BF]\uFE0F?|\u24C2\uFE0F?|[\uD83C\uDDE6-\uD83C\uDDFF]{1,2}|[\uD83C\uDD70\uD83C\uDD71\uD83C\uDD7E\uD83C\uDD7F\uD83C\uDD8E\uD83C\uDD91-\uD83C\uDD9A]\uFE0F?|[\u0023\u002A\u0030-\u0039]\uFE0F?\u20E3|[\u2194-\u2199\u21A9-\u21AA]\uFE0F?|[\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55]\uFE0F?|[\u2934\u2935]\uFE0F?|[\u3030\u303D]\uFE0F?|[\u3297\u3299]\uFE0F?|[\uD83C\uDE01\uD83C\uDE02\uD83C\uDE1A\uD83C\uDE2F\uD83C\uDE32-\uD83C\uDE3A\uD83C\uDE50\uD83C\uDE51]\uFE0F?|[\u203C\u2049]\uFE0F?|[\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE]\uFE0F?|[\u00A9\u00AE]\uFE0F?|[\u2122\u2139]\uFE0F?|\uD83C\uDC04\uFE0F?|\uD83C\uDCCF\uFE0F?|[\u231A\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA]\uFE0F?)"

which I have tested OK in Java. It perfectly solved my problem.

You can view this on the Github page:

https://github.com/zly394/EmojiRegex

Notes:

The answer which provided by @Eric Nakagawa contains some errors, which cannot be operated properly.

查看更多
登录 后发表回答