I know that the default encoding of XML is UTF-8. All XML consumers MUST and so on and so forth. So this is not just a question whether or not XML has a default encoding.
I also know that the XML-Declarataion <?xml version="1.0" ... ?>
at the beginning of the document itself is optional. And that specifying the encoding therein is optional as well.
So I ask myself if the following two XML-Declarations are two expressions for the exact same thing:
<?xml version="1.0"?>
<?xml version="1.0" encoding="UTF-8"?>
From my own current understanding I would say those are equivalent but I do not know. Has the equivalence of these two declarations been specified somewhere?
(Consider these two example lines being each the first line of an XML document, preceded by any (zero) bytes and being UTF-8 encoded)
The way I read the spec, UTF-8 is not the default encoding in an XML declaration. It is only the default encoding "for an entity which begins with neither a Byte Order Mark nor an encoding declaration". If a document is in UTF-16 and has a BOM, it may have an XML declaration without an encoding declaration or no XML declaration at all and still be valid XML.
Only for documents without a BOM, the two XML declarations you mentioned should be equivalent.
The Short Answer
Under the very specific circumstances of a UTF-8 encoded document with no external encoding information (which I understand from the comments is what you're interested in), there is no difference between the two declarations.
The long answer is far more interesting though.
What The Spec Says
If you look at Appendix F1 of the XML specification, that explains the process that should be followed to determine the encoding when there is no external encoding information.
If the document is encoded as one of the UTF variants, the parser should be able to detect the encoding within the first 4 bytes, either from the Byte Order Mark, or the start of the XML declaration.
However, according to the spec, it should still read the encoding declaration.
If they don't match, according to section 4.3.3:
Encoded UTF-16, Declared UTF-8
Let's see what happens in reality when we create an XML document encoded as UTF-16 but with the encoding declaration set to UTF-8.
Opera, Firefox and Chrome all interpret the document as UTF-16, ignoring the encoding declaration. Internet Explorer (version 9 at least), displays a blank document, but no actual error.
So if you include a UTF-8 encoding declaration on your UTF-8 document and someone at a later stage converts it to UTF-16, it'll work in most browsers, but fail in IE (and, I assume, most Microsoft XML APIs). If you had left the encoding declaration off, you would have been fine.
Technically I think IE is the most accurate. The fact that it doesn't display an error as such might be explained by the fact that the error is occurring at the encoding level rather than the XML level. It is assumedly doing its best to interpret the UTF-16 characters as UTF-8, failing to find any characters that decode, and ending up passing on an empty character sequence to the XML parser.
Encoded UTF-8, Declared Otherwise
You might now think that Firefox, Chrome and Opera are just ignoring the encoding declaration altogether, but that's not always the case.
If you encode a document as UTF-8 (with a byte order marker so it's unmistakable as anything else), but set the encoding declaration to Latin1, all of the browsers will successfully decode the content as Latin1, ignoring the UTF-8 BOM.
Again this seems right to me. The fact that the BOM characters aren't valid in Latin1 just means they are silently dropped at the character decoding level.
This doesn't work for all declared encodings on a UTF-8 document though. If the declared encoding is UTF-16, we're back with Opera, Firefox and Chrome ignoring the declared encoding, while Internet Explorer returns a blank document.
Essentially, anything that makes IE return a blank document is going to make other browsers ignore the declared encoding.
Other Inconsistencies
It's also worth mentioning the importance of the Byte Order Mark. According to section 4.3.3 of the spec:
However, if you try and read a UTF-16 encoded XML document without a BOM, most browsers will nevertheless accept it as valid. Only Firefox reports it as an XML Parsing Error.
External Encoding Information
Up to now, we've been considering what happens when there is no external encoding information, but, as others have mentioned, if the document is received via HTTP or enclosed in a MIME envelope of some sort, the encoding information from those sources should take preference over the document encoding.
Most of the details for the various XML MIME types are described in RFC3023. However, the reality is somewhat different from what is specified.
First of all, text/xml with an omitted charset parameter should use a charset of US-ASCII, but that requirement has almost always been ignored. Browsers will typically use the value of the XML encoding declaration, or default to UTF-8 if there is none.
Second, if there is a UTF-8 BOM on the document, and the XML encoding declaration is either UTF-8 or not included, the document will be interpreted as UTF-8, regardless of the charset used in the Content-Type.
The only time the encoding from the Content-Type seems to take precedence is when there is no BOM and an explicit charset is specified in the Content-Type.
In any event, there are no cases (involving Content-Type) where including a UTF-8 XML encoding declaration on a UTF-8 document is any different from not having an encoding declaration at all.
In isolation, both are equivalent. You have already cited the relevant parts of the specifications which show that both declarations are equivalent.
However XML can have an envelope, such as the HTTP
Content-Type
header. The W3C specifies that this envelope information has priority over any other declarations in the file. So for example, if you are retrieving XML via http, you could potentially get this:In this case, the XML should be read as ascii, because the default charset for
text/*
mime types is ascii. This is why you should useapplication/xml
mime types--these default to utf-8. The "application" prefix means that the relevant application specifications define things like default encoding. (I.e. the XML spec takes over.) Withtext/*
mime types, the default is ascii and thecharset
parameter must be included in the mime type to change charset.Here's another case:
In this case, a conforming XML processor should read this file as
win-1252
, notutf-8
.Another case:
Here the encoding is
win-1252
.Here the encoding is
ascii
.It would not be unreasonable for the second declaration to be rejected if it arrived at the start of a document that had already been detected as having a non-UTF-8 compatible encoding (such as UTF-16). However, given your statement that the document is UTF-8 encoded, there is no difference between how they would be treated.
An externally-specified encoding would take precedence in both cases; both documents would still be treated identically.