What's the difference between UTF-8 and UTF-8

2018-12-31 01:21发布

What's different between UTF-8 and UTF-8 without a BOM? Which is better?

20条回答
ら面具成の殇う
2楼-- · 2018-12-31 01:38

From http://en.wikipedia.org/wiki/Byte-order_mark:

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.

Always using a BOM in your file will ensure that it always opens correctly in an editor which supports UTF-8 and BOM.

My real problem with the absence of BOM is the following. Suppose we've got a file which contains:

abc

Without BOM this opens as ANSI in most editors. So another user of this file opens it and appends some native characters, for example:

abg-αβγ

Oops... Now the file is still in ANSI and guess what, "αβγ" does not occupy 6 bytes, but 3. This is not UTF-8 and this causes other problems later on in the development chain.

查看更多
骚的不知所云
3楼-- · 2018-12-31 01:39

I look at this from a different perspective. I think UTF-8 with BOM is better as it provides more information about the file. I use UTF-8 without BOM only if I face problems.

I am using multiple languages (even Cyrillic) on my pages for a long time and when the files are saved without BOM and I re-open them for editing with an editor (as cherouvim also noted), some characters are corrupted.

Note that Windows' classic Notepad automatically saves files with a BOM when you try to save a newly created file with UTF-8 encoding.

I personally save server side scripting files (.asp, .ini, .aspx) with BOM and .html files without BOM.

查看更多
像晚风撩人
4楼-- · 2018-12-31 01:39

As mentioned above, UTF-8 with BOM may cause problems with non-BOM-aware (or compatible) software. I once edited HTML files encoded as UTF-8 + BOM with the Mozilla-based KompoZer, as a client required that WYSIWYG program.

Invariably the layout would get destroyed when saving. It took my some time to fiddle my way around this. These files then worked well in Firefox, but showed a CSS quirk in Internet Explorer destroying the layout, again. After fiddling with the linked CSS files for hours to no avail I discovered that Internet Explorer didn't like the BOMfed HTML file. Never again.

Also, I just found this in Wikipedia:

The shebang characters are represented by the same two bytes in extended ASCII encodings, including UTF-8, which is commonly used for scripts and other text files on current Unix-like systems. However, UTF-8 files may begin with the optional byte order mark (BOM); if the "exec" function specifically detects the bytes 0x23 0x21, then the presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent the script interpreter from being executed. Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[15] for this reason and for wider interoperability and philosophical concerns

查看更多
伤终究还是伤i
5楼-- · 2018-12-31 01:40

When you want to display information encoded in UTF-8 you may not face problems. Declare for example an HTML document as UTF-8 and you will have everything displayed in your browser that is contained in the body of the document.

But this is not the case when we have text, CSV and XML files, either on Windows or Linux.

For example, a text file in Windows or Linux, one of the easiest things imaginable, it is not (usually) UTF-8.

Save it as XML and declare it as UTF-8:

<?xml version="1.0" encoding="UTF-8"?>

It will not display (it will not be be read) correctly, even if it's declared as UTF-8.

I had a string of data containing French letters, that needed to be saved as XML for syndication. Without creating a UTF-8 file from the very beginning (changing options in IDE and "Create New File") or adding the BOM at the beginning of the file

$file="\xEF\xBB\xBF".$string;

I was not able to save the French letters in an XML file.

查看更多
十年一品温如言
6楼-- · 2018-12-31 01:41

The UTF-8 BOM is a sequence of Bytes at the start of a text-stream (EF BB BF) that allows the reader to more reliably guess a file as being encoded in UTF-8.

Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

According to the Unicode standard, the BOM for UTF-8 files is not recommended:

2.6 Encoding Schemes

... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.

查看更多
低头抚发
7楼-- · 2018-12-31 01:41

Question: What's different between UTF-8 and UTF-8 without a BOM? Which is better?

Here are some excerpts from the Wikipedia article on the byte order mark (BOM) that I believe offer a solid answer to this question.

On the meaning of the BOM and UTF-8:

The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8.

Argument for NOT using a BOM:

The primary motivation for not using a BOM is backwards-compatibility with software that is not Unicode-aware... Another motivation for not using a BOM is to encourage UTF-8 as the "default" encoding.

Argument FOR using a BOM:

The argument for using a BOM is that without it, heuristic analysis is required to determine what character encoding a file is using. Historically such analysis, to distinguish various 8-bit encodings, is complicated, error-prone, and sometimes slow. A number of libraries are available to ease the task, such as Mozilla Universal Charset Detector and International Components for Unicode.

Programmers mistakenly assume that detection of UTF-8 is equally difficult (it is not because of the vast majority of byte sequences are invalid UTF-8, while the encodings these libraries are trying to distinguish allow all possible byte sequences). Therefore not all Unicode-aware programs perform such an analysis and instead rely on the BOM.

In particular, Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad will not correctly read UTF-8 text unless it has only ASCII characters or it starts with the BOM, and will add a BOM to the start when saving text as UTF-8. Google Docs will add a BOM when a Microsoft Word document is downloaded as a plain text file.

On which is better, WITH or WITHOUT the BOM:

The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it “SHOULD forbid use of U+FEFF as a signature.”

My Conclusion:

Use the BOM only if compatibility with a software application is absolutely essential.

Also note that while the referenced Wikipedia article indicates that many Microsoft applications rely on the BOM to correctly detect UTF-8, this is not the case for all Microsoft applications. For example, as pointed out by @barlop, when using the Windows Command Prompt with UTF-8, commands such type and more do not expect the BOM to be present. If the BOM is present, it can be problematic as it is for other applications.


† The chcp command offers support for UTF-8 (without the BOM) via code page 65001.

查看更多
登录 后发表回答