Android compare UTF-8 string with UTF-8 input stri

In my android application, i want to compare an utf-8 string, for example "bãi" with string which user type on edittext.
However, if I type "bãi" to edittext and get input string by using method edittext.getText().toString(), it will return string like

and it will not equal "bãi"

I also try

String input = new String(input.getBytes("UTF-8"), "UTF-8");

but it not work. input.equals("bãi") will return false.

Is anyone know how solve this problem. Thanks for any help.

In Unicode, certain characters can be represented in more than one way. For example, in the word bãi the middle character can be represented in two ways:

a single codepoint U+00E3 (LATIN SMALL LETTER A WITH TILDE)
two codepoints U+0061 (LATIN SMALL LETTER A) and U+0303 (COMBINING TILDE)

For display, both should look the same.

For string comparison, this poses a problem. The solution is to normalize the strings first according to Unicode Standard Annex #15 — Unicode Normalization Forms.

Normalization is supported in Java (incl. Android) by the Normalizer class (for Android see Normalizer).

The code below shows the result:

String s1 = "b\u00e3i";
String s2 = "ba\u0303i";
System.out.println(String.format("Before normalization: %s == %s => %b", s1, s2, s1.equals(s2)));

String n1 = Normalizer.normalize(s1, Form.NFD);
String n2 = Normalizer.normalize(s2, Form.NFD);
System.out.println(String.format("After normalization:  %s == %s => %b", n1, n2, n1.equals(n2)));

It outputs:

Before normalization: bãi == bãi => false
After normalization:  bãi == bãi => true

BTW: The form Form.NFD decomposes the strings, i.e. it creates the longer representation with two codepoints. Form.NFC would create the shorter form.