Replacing Emoji Unicode Range from Arabic Tweets u

2019-03-21 15:13发布

I am trying to replace emoji from Arabic tweets using java.

I used this code:

String line = "اييه تقولي اجل الارسنال تعادل امس بعد ما كان فايز                 

2条回答
走好不送
2楼-- · 2019-03-21 15:37

Java 5 and 6

If you are stuck running your program on Java 5 or 6 JVM, and you want to match characters in the range from U+1F601 to U+1F64F, use surrogate pairs in the character class:

Pattern emoticons = Pattern.compile("[\uD83D\uDE01-\uD83D\uDE4F]");

This method is valid even in Java 7 and above, since in Sun/Oracle's implementation, if you decompile Pattern.compile() method, the String containing the pattern is converted into an array of code points before compilation.

Java 7 and above

  1. You can use the construct \x{...} in David Wallace's answer, which is available from Java 7.

  2. Or alternatively, you can also specify the whole Emoticons Unicode block, which spans from code point U+1F600 (instead of U+1F601) to U+1F64F.

    Pattern emoticons = Pattern.compile("\\p{InEmoticons}");
    

    Since Emoticons block support is added in Java 7, this method is also only valid from Java 7.

  3. Although the other methods are preferred, you can specify supplemental characters by specifying the escape in the regex. While there is no reason to do this in the source code, this change in Java 7 corrects the behavior in applications where regex is used for searching, and directly pasting the character is not possible.

    Pattern emoticons = Pattern.compile("[\\uD83D\\uDE01-\\uD83D\\uDE4F]");
    

    /!\ Warning

    Never ever mix the syntax together when you specify a supplemental code point, like:

    • "[\\uD83D\uDE01-\\uD83D\\uDE4F]"

    • "[\uD83D\\uDE01-\\uD83D\\uDE4F]"

    Those will specify to match the code point U+D83D and the range from code point U+DE01 to code point U+1F64F in Oracle's implementation.

Note

In Java 5 and 6, Oracle's implementation, the implementation of Pattern.u() doesn't collapse valid regex-escaped surrogate pairs "\\uD83D\\uDE01". As the result, the pattern is interpreted as 2 lone surrogates, which will fail to match anything.

查看更多
Rolldiameter
3楼-- · 2019-03-21 15:43

From the Javadoc for the Pattern class

A Unicode character can also be represented in a regular-expression by using its Hex notation(hexadecimal code point value) directly as described in construct \x{...}, for example a supplementary character U+2011F can be specified as \x{2011F}, instead of two consecutive Unicode escape sequences of the surrogate pair \uD840\uDD1F.

This means that the regular expression that you're looking for is ([\x{1F601}-\x{1F64F}]). Of course, when you write this as a Java String literal, you must escape the backslashes.

Pattern unicodeOutliers = Pattern.compile("([\\x{1F601}-\\x{1F64F}])");

Note that the construct \x{...} is only available from Java 7.

查看更多
登录 后发表回答