Is ED A0 80 ED B0 80 a valid UTF-8 byte sequence?

java.nio.charset.Charset.forName("utf8").decode decodes a byte sequence of

 ED A0 80 ED B0 80

into the Unicode codepoint:

 U+10000

java.nio.charset.Charset.forName("utf8").decode also decodes a byte sequence of

 F0 90 80 80

into the Unicode codepoint:

 U+10000

This is verified by the code below.

Now this seems to be telling me that the UTF-8 encoding scheme will decode ED A0 80 ED B0 80 and F0 90 80 80 into the same unicode codepoint.

However, if I visit https://www.google.com/search?query=%ED%A0%80%ED%B0%80,

I can see that it is clearly different from the page https://www.google.com/search?query=%F0%90%80%80

Since the Google Search is using UTF-8 encoding scheme (correct me if I'm wrong) as well,

This suggests that the UTF-8 does not decode ED A0 80 ED B0 80 and F0 90 80 80 into the same unicode codepoint(s).

So basically I was wondering, by the official standard, should UTF-8 decode ED A0 80 ED B0 80 byte sequence into the Unicode codepoint U+10000 ?

Code:

public class Test {

    public static void main(String args[]) {
        java.nio.ByteBuffer bb = java.nio.ByteBuffer.wrap(new byte[] { (byte) 0xED, (byte) 0xA0, (byte) 0x80, (byte) 0xED, (byte) 0xB0, (byte) 0x80 });
        java.nio.CharBuffer cb = java.nio.charset.Charset.forName("utf8").decode(bb);
        for (int x = 0, xx = cb.limit(); x < xx; ++x) {
            System.out.println(Integer.toHexString(cb.get(x)));
        }
        System.out.println();
        bb = java.nio.ByteBuffer.wrap(new byte[] { (byte) 0xF0, (byte) 0x90, (byte) 0x80, (byte) 0x80 });
        cb = java.nio.charset.Charset.forName("utf8").decode(bb);
        for (int x = 0, xx = cb.limit(); x < xx; ++x) {
            System.out.println(Integer.toHexString(cb.get(x)));
        }
    }
}

标签： java language-agnostic unicode utf-8

3条回答

Viruses.

2楼-- · 2019-04-05 17:18

ED A0 80 ED B0 80 is the UTF-8 encoding of the UTF-16 surrogate pair D800 DC00. This is NOT allowed in UTF-8:

However, pairs of UCS-2 values between D800 and DFFF (surrogate pairs in Unicode parlance)...need special treatment: the UTF-16 transformation must be undone, yielding a UCS-4 character that is then transformed as above.

However, such an encoding is used in CESU-8 and Java's "Modified UTF-8".

Since the Google Search is using UTF-8 encoding scheme (correct me if I'm wrong) as well,

It appears, based on the search box, that Google is using some kind of encoding auto-detection. If you pass it F0 90 80 80, which is valid UTF-8, it interprets it as UTF-8 (


             
            
                                  
            
            
            
            
            
            干净又极端                          
            
             
             3楼-- · 2019-04-05 17:32
             
             
             
                          
             
                                                                          
Java's UTF8 is really a CESU-8 variant. The first case is using surrogate pairs encoded in UTF8 "style".
    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                
             
            
                                  
            
            
            
            
            
            干净又极端                          
            
             
             4楼-- · 2019-04-05 17:35
             
             
             
                          
             
                                                                          
F0 90 80 80


decodes as U+10000, or LINEAR B SYLLABLE B008 A.

ED A0 80 ED B0 80


decodes as U+d800 U+dc00.
    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...


     
                      登录 后发表回答



   
   
   
  
   相关问题
      
    
    
   
   

     


   
   Delete Messages from a Topic in Apache Kafka   

   



     


   
   Jackson Deserialization not calling deserialize on   

   



     


   
   How to maintain order of key-value in DataFrame sa   

   



     


   
   StackExchange API - Deserialize Date in JSON Respo   

   



     


   
   Difference between Types.INTEGER and Types.NULL in   

   



        
      
    查看全部
   
   
  
   相关文章
 
   
   

     


   
   java 数组拆分为新数组   

     


   
   如何安全的关闭程序   

     


   
   反爬能检测到JS模拟的键盘输入吗   

     


   
   有没有方法即使程序最小化也能对其发送按键   

     


   
   List可以存储接口类型的数据吗？   

     


   
   java 打包后引用jar包方法找不到   

     


   
   Java PDFBox 向PDF文件中写入图片   

     


   
   Java代码制表用什么框架比较好？   

        
        
    查看全部
                 收藏的人(6)

Is ED A0 80 ED B0 80 a valid UTF-8 byte sequence?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间