What's different between UTF-8 and UTF-8 without a BOM? Which is better?
相关问题
- UrlEncodeUnicode and browser navigation errors
- WebElement.getText() function and utf8
- How to convert a string to a byte array which is c
- Character Encoding in iframes
- Unicode issue with makemessages --all Django 1.6.2
相关文章
- iconv() Vs. utf8_encode()
- Why is `'↊'.isnumeric()` false?
- How to display unicode in SVG?
- When sending XML to JMS should I use TextMessage o
- Spanish Characters in HTML Page Title
- Google app engine datastore string encoding proble
- UnicodeEncodeError when saving ImageField containi
- How can i get know that my String contains diacrit
The other excellent answers already answered that:
EF BB BF
But, as additional information to this, the BOM for UTF-8 could be a good way to "smell" if a string was encoded in UTF-8... Or it could be a legitimate string in any other encoding...
For example, the data [EF BB BF 41 42 43] could either be:
So while it can be cool to recognize the encoding of a file content by looking at the first bytes, you should not rely on this, as show by the example above
Encodings should be known, not divined.
BOM tends to boom (no pun intended (sic)) somewhere, someplace. And when it booms (for example, doesn't get recognized by browsers, editors, etc.), it shows up as the weird characters

at the start of the document (for example, HTML file, JSON response, RSS, etc.) and causes the kind of embarrassments like the recent encoding issue experienced during the talk of Obama on Twitter.It's very annoying when it shows up at places hard to debug or when testing is neglected. So it's best to avoid it unless you must use it.
UTF-8 with BOM only helps if the file actually contains some non-ASCII characters. If it is included and there aren't any, then it will possibly break older applications that would have otherwise interpreted the file as plain ASCII. These applications will definitely fail when they come across a non ASCII character, so in my opinion the BOM should only be added when the file can, and should, no longer be interpreted as plain ASCII.
Edit: Just want to make it clear that I prefer to not have the BOM at all, add it in if some old rubbish breaks with out it, and replacing that legacy application is not feasible.
Don't make anything expect a BOM for UTF8.
One practical difference is that if you write a shell script for Mac OS X and save it as plain UTF-8, you will get the response:
in response to the shebang line specifying which shell you wish to use:
If you save as UTF-8, no BOM (say in BBEdit) all will be well.
The Unicode Byte Order Mark (BOM) FAQ provides a concise answer:
It'a an old question with many good answers but one thing should be added.
All answers are very general. What I'd like to add are examples of the BOM usage that actually cause real problems and yet many people don't know about it.
BOM breaks scripts
Shell scripts, Perl scripts, Python scripts, Ruby scripts, Node.js scripts or any other executable that needs to be run by an interpreter - all start with a shebang line which looks like one of those:
It tells the system which interpreter needs to be run when invoking such a script. If the script is encoded in UTF-8, one may be tempted to include a BOM at the beginning. But actually the "#!" characters are not just characters. They are in fact a magic number that happens to be composed out of two ASCII characters. If you put something (like a BOM) before those characters, then the file will look like it had a different magic number and that can lead to problems.
See Wikipedia, article: Shebang, section: Magic number:
BOM is illegal in JSON
See RFC 7159, Section 8.1:
BOM is redundant in JSON
Not only it is illegal in JSON, it is also not needed to determine the character encoding because there are more reliable ways to unambiguously determine both the character encoding and endianness used in any JSON stream (see this answer for details).
BOM breaks JSON parsers
Not only it is illegal in JSON and not needed, it actually breaks all software that determine the encoding using the method presented in RFC 4627:
Determining the encoding and endianness of JSON, examining the first 4 bytes for the NUL byte:
Now, if the file starts with BOM it will look like this:
Note that:
Depending on the implementation, all of those may be interpreted incorrectly as UTF-8 and then misinterpreted or rejected as invalid UTF-8, or not recognized at all.
Additionally if the implementation tests for valid JSON as I recommend, it will reject even the input that is indeed encoded as UTF-8 because it doesn't start with an ASCII character < 128 as it should according to the RFC.
Other data formats
BOM in JSON is not needed, is illegal and breaks software that works correctly according to the RFC. It should be a nobrainer to just not use it then and yet, there are always people who insist on breaking JSON by using BOMs, comments, different quoting rules or different data types. Of course anyone is free to use things like BOMs or anything else if you need it - just don't call it JSON then.
For other data formats than JSON, take a look how it really looks like. If the only encodings are UTF-* and the first character must be an ASCII character lower than 128 then you already have all the information needed to determine both the encoding and the endianness of your data. Adding BOMs even as an optional feature would only make it more complicated and error prone.
Other uses of BOM
As for the uses outside of JSON or scripts, I think there are already very good answers here. I wanted to add more detailed info specifically about scripting and serialization because it is an example of BOM characters causing real problems.