What's different between UTF-8 and UTF-8 without a BOM? Which is better?
相关问题
- UrlEncodeUnicode and browser navigation errors
- WebElement.getText() function and utf8
- How to convert a string to a byte array which is c
- Character Encoding in iframes
- Unicode issue with makemessages --all Django 1.6.2
相关文章
- iconv() Vs. utf8_encode()
- Why is `'↊'.isnumeric()` false?
- How to display unicode in SVG?
- When sending XML to JMS should I use TextMessage o
- Spanish Characters in HTML Page Title
- Google app engine datastore string encoding proble
- UnicodeEncodeError when saving ImageField containi
- How can i get know that my String contains diacrit
From http://en.wikipedia.org/wiki/Byte-order_mark:
Always using a BOM in your file will ensure that it always opens correctly in an editor which supports UTF-8 and BOM.
My real problem with the absence of BOM is the following. Suppose we've got a file which contains:
Without BOM this opens as ANSI in most editors. So another user of this file opens it and appends some native characters, for example:
Oops... Now the file is still in ANSI and guess what, "αβγ" does not occupy 6 bytes, but 3. This is not UTF-8 and this causes other problems later on in the development chain.
I look at this from a different perspective. I think UTF-8 with BOM is better as it provides more information about the file. I use UTF-8 without BOM only if I face problems.
I am using multiple languages (even Cyrillic) on my pages for a long time and when the files are saved without BOM and I re-open them for editing with an editor (as cherouvim also noted), some characters are corrupted.
Note that Windows' classic Notepad automatically saves files with a BOM when you try to save a newly created file with UTF-8 encoding.
I personally save server side scripting files (.asp, .ini, .aspx) with BOM and .html files without BOM.
As mentioned above, UTF-8 with BOM may cause problems with non-BOM-aware (or compatible) software. I once edited HTML files encoded as UTF-8 + BOM with the Mozilla-based KompoZer, as a client required that WYSIWYG program.
Invariably the layout would get destroyed when saving. It took my some time to fiddle my way around this. These files then worked well in Firefox, but showed a CSS quirk in Internet Explorer destroying the layout, again. After fiddling with the linked CSS files for hours to no avail I discovered that Internet Explorer didn't like the BOMfed HTML file. Never again.
Also, I just found this in Wikipedia:
When you want to display information encoded in UTF-8 you may not face problems. Declare for example an HTML document as UTF-8 and you will have everything displayed in your browser that is contained in the body of the document.
But this is not the case when we have text, CSV and XML files, either on Windows or Linux.
For example, a text file in Windows or Linux, one of the easiest things imaginable, it is not (usually) UTF-8.
Save it as XML and declare it as UTF-8:
It will not display (it will not be be read) correctly, even if it's declared as UTF-8.
I had a string of data containing French letters, that needed to be saved as XML for syndication. Without creating a UTF-8 file from the very beginning (changing options in IDE and "Create New File") or adding the BOM at the beginning of the file
I was not able to save the French letters in an XML file.
The UTF-8 BOM is a sequence of Bytes at the start of a text-stream (EF BB BF) that allows the reader to more reliably guess a file as being encoded in UTF-8.
Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.
According to the Unicode standard, the BOM for UTF-8 files is not recommended:
Here are some excerpts from the Wikipedia article on the byte order mark (BOM) that I believe offer a solid answer to this question.
On the meaning of the BOM and UTF-8:
Argument for NOT using a BOM:
Argument FOR using a BOM:
On which is better, WITH or WITHOUT the BOM:
My Conclusion:
Use the BOM only if compatibility with a software application is absolutely essential.
Also note that while the referenced Wikipedia article indicates that many Microsoft applications rely on the BOM to correctly detect UTF-8, this is not the case for all Microsoft applications. For example, as pointed out by @barlop, when using the Windows Command Prompt with UTF-8†, commands such
type
andmore
do not expect the BOM to be present. If the BOM is present, it can be problematic as it is for other applications.† The
chcp
command offers support for UTF-8 (without the BOM) via code page 65001.