What's different between UTF-8 and UTF-8 without a BOM? Which is better?
相关问题
- UrlEncodeUnicode and browser navigation errors
- WebElement.getText() function and utf8
- How to convert a string to a byte array which is c
- Character Encoding in iframes
- Unicode issue with makemessages --all Django 1.6.2
相关文章
- iconv() Vs. utf8_encode()
- Why is `'↊'.isnumeric()` false?
- How to display unicode in SVG?
- When sending XML to JMS should I use TextMessage o
- Spanish Characters in HTML Page Title
- Google app engine datastore string encoding proble
- UnicodeEncodeError when saving ImageField containi
- How can i get know that my String contains diacrit
There are at least three problems with putting a BOM in UTF-8 encoded files.
And, as others have mentioned, it is neither sufficient nor necessary to have a BOM to detect that something is UTF-8:
Quoted at the bottom of the Wikipedia page on BOM: http://en.wikipedia.org/wiki/Byte-order_mark#cite_note-2
This question already has a million-and-one answers and many of them are quite good, but I wanted to try and clarify when a BOM should or should not be used.
As mentioned, any use of the UTF BOM (Byte Order Mark) in determining whether a string is UTF-8 or not is educated guesswork. If there is proper metadata available (like
charset="utf-8"
), then you already know what you're supposed to be using, but otherwise you'll need to test and make some assumptions. This involves checking whether the file a string comes from begins with the hexadecimal byte code, EF BB BF.If a byte code corresponding to the UTF-8 BOM is found, the probability is high enough to assume it's UTF-8 and you can go from there. When forced to make this guess, however, additional error checking while reading would still be a good idea in case something comes up garbled. You should only assume a BOM is not UTF-8 (i.e. latin-1 or ANSI) if the input definitely shouldn't be UTF-8 based on it's source. If there is no BOM, however, you can simply determine whether it's supposed to be UTF-8 by validating against the encoding.
Why is a BOM not recommended?
When should you encode with a BOM?
If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used.
When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data.
It should be noted that for some files you must not have the BOM even on Windows. Examples are
SQL*plus
orVBScript
files. In case such files contains a BOM you get an error when you try to execute them.Short answer: In UTF-8, a BOM is encoded as the bytes
EF BB BF
at the beginning of the file.Long answer:
Originally, it was expected that Unicode would be encoded in UTF-16/UCS-2. The BOM was designed for this encoding form. When you have 2-byte code units, it's necessary to indicate which order those two bytes are in, and a common convention for doing this is to include the character U+FEFF as a "Byte Order Mark" at the beginning of the data. The character U+FFFE is permanently unassigned so that its presence can be used to detect the wrong byte order.
UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence
EF BB FF
) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.Without. As Martin Cote answered, the Unicode standard does not recommend it. It causes problems with non-BOM-aware software.
A better way to detect whether a file is UTF-8 is to perform a validity check. UTF-8 has strict rules about what byte sequences are valid, so the probability of a false positive is negligible. If a byte sequence looks like UTF-8, it probably is.
UTF-8 with BOM is better identified. I have reached this conclusion the hard way. I am working on a project where one of the results is a CSV file, including Unicode characters.
If the CSV file is saved without a BOM, Excel thinks it's ANSI and shows gibberish. Once you add "EF BB BF" at the front (for example, by re-saving it using Notepad with UTF-8; or Notepad++ with UTF-8 with BOM), Excel opens it fine.
Prepending the BOM character to Unicode text files is recommended by RFC 3629: "UTF-8, a transformation format of ISO 10646", November 2003 at http://tools.ietf.org/html/rfc3629 (this last info found at: http://www.herongyang.com/Unicode/Notepad-Byte-Order-Mark-BOM-FEFF-EFBBBF.html)