UTF8 Encoding?

2020-05-01 07:14发布

What is UTF-8 encoding and why text files saved in this Format are more bigger than the other?

For example I had typed 'A' in the notepad and save it in UTF-8 format.

After that, The file size turns to : 4 bytes . why?

3条回答
老娘就宠你
2楼-- · 2020-05-01 07:57

that's only because of the BOM, byte order mark. UTF-8 only expands characters that have a numeric value greater than 127 (non-ASCII).

not all text editors do this. Notepad is notorious for it (the useless UTF-8 BOM).

查看更多
欢心
3楼-- · 2020-05-01 08:11

It's almost certainly because whatever you're using to save the file is also including the byte order mark which in UTF-8 is 0xEF 0xBB 0xBF.

As for what UTF-8 is - it's a Unicode encoding which uses progressively more bytes for higher Unicode values; importantly, ASCII characters are stored as single bytes (the same bytes as they would be in ASCII). So any ASCII file is also a UTF-8 file with the same text. This web page has more, as does Wikipedia.

查看更多
兄弟一词,经得起流年.
4楼-- · 2020-05-01 08:12

Because a BOM (byte order mark) was inserted at the start of the file.

The BOM is a special character U+FEFF meant not to have any meaning except as a way to detect the encoding of a file. You can read about it here: http://unicode.org/faq/utf_bom.html#BOM

In the case of UTF-8, the BOM is encoded as \xEF \xBB \xBF which is where the 3 extra bytes come from. Notepad and other text editors look for the BOM to guess the encoding of the file. If it sees \xFF \xFE it will assume it is UCS-2 encoded in little endian format. A \xFE \xFF means UCS-2 encoded in big endian format.

查看更多
登录 后发表回答