Android compare UTF-8 string with UTF-8 input stri

2019-05-07 19:11发布

问题:

In my android application, i want to compare an utf-8 string, for example "bãi" with string which user type on edittext.
However, if I type "bãi" to edittext and get input string by using method edittext.getText().toString(), it will return string like

and it will not equal "bãi"

I also try

String input = new String(input.getBytes("UTF-8"), "UTF-8");

but it not work. input.equals("bãi") will return false.

Is anyone know how solve this problem. Thanks for any help.

回答1:

In Unicode, certain characters can be represented in more than one way. For example, in the word bãi the middle character can be represented in two ways:

  1. a single codepoint U+00E3 (LATIN SMALL LETTER A WITH TILDE)
  2. two codepoints U+0061 (LATIN SMALL LETTER A) and U+0303 (COMBINING TILDE)

For display, both should look the same.

For string comparison, this poses a problem. The solution is to normalize the strings first according to Unicode Standard Annex #15 — Unicode Normalization Forms.

Normalization is supported in Java (incl. Android) by the Normalizer class (for Android see Normalizer).

The code below shows the result:

String s1 = "b\u00e3i";
String s2 = "ba\u0303i";
System.out.println(String.format("Before normalization: %s == %s => %b", s1, s2, s1.equals(s2)));

String n1 = Normalizer.normalize(s1, Form.NFD);
String n2 = Normalizer.normalize(s2, Form.NFD);
System.out.println(String.format("After normalization:  %s == %s => %b", n1, n2, n1.equals(n2)));

It outputs:

Before normalization: bãi == bãi => false
After normalization:  bãi == bãi => true

BTW: The form Form.NFD decomposes the strings, i.e. it creates the longer representation with two codepoints. Form.NFC would create the shorter form.



标签: android utf-8