I am trying to replace emoji from Arabic tweets using java.
I used this code:
String line = "اييه تقولي اجل الارسنال تعادل امس بعد ما كان فايز
I am trying to replace emoji from Arabic tweets using java.
I used this code:
String line = "اييه تقولي اجل الارسنال تعادل امس بعد ما كان فايز
Java 5 and 6
If you are stuck running your program on Java 5 or 6 JVM, and you want to match characters in the range from U+1F601 to U+1F64F, use surrogate pairs in the character class:
This method is valid even in Java 7 and above, since in Sun/Oracle's implementation, if you decompile
Pattern.compile()
method, the String containing the pattern is converted into an array of code points before compilation.Java 7 and above
You can use the construct
\x{...}
in David Wallace's answer, which is available from Java 7.Or alternatively, you can also specify the whole Emoticons Unicode block, which spans from code point U+1F600 (instead of U+1F601) to U+1F64F.
Since Emoticons block support is added in Java 7, this method is also only valid from Java 7.
Although the other methods are preferred, you can specify supplemental characters by specifying the escape in the regex. While there is no reason to do this in the source code, this change in Java 7 corrects the behavior in applications where regex is used for searching, and directly pasting the character is not possible.
/!\
WarningNever ever mix the syntax together when you specify a supplemental code point, like:
"[\\uD83D\uDE01-\\uD83D\\uDE4F]"
"[\uD83D\\uDE01-\\uD83D\\uDE4F]"
Those will specify to match the code point U+D83D and the range from code point U+DE01 to code point U+1F64F in Oracle's implementation.
Note
In Java 5 and 6, Oracle's implementation, the implementation of
Pattern.u()
doesn't collapse valid regex-escaped surrogate pairs"\\uD83D\\uDE01"
. As the result, the pattern is interpreted as 2 lone surrogates, which will fail to match anything.From the Javadoc for the
Pattern
classThis means that the regular expression that you're looking for is
([\x{1F601}-\x{1F64F}])
. Of course, when you write this as a JavaString
literal, you must escape the backslashes.Note that the construct
\x{...}
is only available from Java 7.