java.nio.charset.Charset.forName("utf8").decode decodes a byte sequence of
ED A0 80 ED B0 80
into the Unicode codepoint:
U+10000
java.nio.charset.Charset.forName("utf8").decode also decodes a byte sequence of
F0 90 80 80
into the Unicode codepoint:
U+10000
This is verified by the code below.
Now this seems to be telling me that the UTF-8 encoding scheme will decode ED A0 80 ED B0 80
and F0 90 80 80
into the same unicode codepoint.
However, if I visit https://www.google.com/search?query=%ED%A0%80%ED%B0%80,
I can see that it is clearly different from the page https://www.google.com/search?query=%F0%90%80%80
Since the Google Search is using UTF-8 encoding scheme (correct me if I'm wrong) as well,
This suggests that the UTF-8 does not decode ED A0 80 ED B0 80
and F0 90 80 80
into the same unicode codepoint(s).
So basically I was wondering, by the official standard, should UTF-8 decode ED A0 80 ED B0 80
byte sequence into the Unicode codepoint U+10000 ?
Code:
public class Test {
public static void main(String args[]) {
java.nio.ByteBuffer bb = java.nio.ByteBuffer.wrap(new byte[] { (byte) 0xED, (byte) 0xA0, (byte) 0x80, (byte) 0xED, (byte) 0xB0, (byte) 0x80 });
java.nio.CharBuffer cb = java.nio.charset.Charset.forName("utf8").decode(bb);
for (int x = 0, xx = cb.limit(); x < xx; ++x) {
System.out.println(Integer.toHexString(cb.get(x)));
}
System.out.println();
bb = java.nio.ByteBuffer.wrap(new byte[] { (byte) 0xF0, (byte) 0x90, (byte) 0x80, (byte) 0x80 });
cb = java.nio.charset.Charset.forName("utf8").decode(bb);
for (int x = 0, xx = cb.limit(); x < xx; ++x) {
System.out.println(Integer.toHexString(cb.get(x)));
}
}
}
ED A0 80 ED B0 80
is the UTF-8 encoding of the UTF-16 surrogate pairD800 DC00
. This is NOT allowed in UTF-8:However, such an encoding is used in CESU-8 and Java's "Modified UTF-8".
It appears, based on the search box, that Google is using some kind of encoding auto-detection. If you pass it
F0 90 80 80
, which is valid UTF-8, it interprets it as UTF-8 (Java's UTF8 is really a CESU-8 variant. The first case is using surrogate pairs encoded in UTF8 "style".
decodes as
U+10000
, orLINEAR B SYLLABLE B008 A
.decodes as
U+d800 U+dc00
.