I have some UTF-8 text in a file utf8.txt
. The file contains some characters that are outside the ASCII range. I tried the following code:
var fname = "utf8.txt";
var enc = Encoding.GetEncoding("ISO-8859-1", EncoderFallback.ExceptionFallback,
DecoderFallback.ExceptionFallback);
var s = System.IO.File.ReadAllText(fname, enc);
The expected behavior is that the code should throw an exception, since it is not valid ISO-8859-1 text. Instead, the behavior is that it correctly decodes the UTF-8 text into the right characters (it looks correct in the debugger).
Is this a bug in .Net
?
EDIT:
The file I tested with originally was UTF-8 with BOM. If I remove the BOM, the behavior changes. It still does not throw an exception, however it produces an incorrect Unicode string (the string does not look correct in the debugger).
EDIT:
To produce my test file, run the following code:
var fname = "utf8.txt";
var utf8_bom_e_circumflex_bytes = new byte[] {0xEF, 0xBB, 0xBF, 0xC3, 0xAA};
System.IO.File.WriteAllBytes(fname, utf8_bom_e_circumflex_bytes);
EDIT:
I think I have a firm handle on what is going on (although I don't agree with part of .Net's behavior).
If the file starts with UTF-8 BOM, and the data is valid UTF-8, then
ReadAllText
will completely ignore the encoding you passed in and (properly) decode the file as UTF-8. (I have not tested what happens if the BOM is a lie and the file is not really UTF-8) I don't agree with this behavior. I think .Net should either throw an exception or use the encoding I gave it.If the file has no BOM, .Net has no trivial (and 100% reliable) way to determine that the text is not really ISO-8859-1, since most (all?) UTF-8 text is also valid ISO-8859-1, although gibberish. So it just follows your instructions and decodes the file with the encoding you gave it. (I do agree with this behavior)
In ISO-8859-1 all possible bytes have mappings to characters, so no exception will ever result from reading a non-ISO-8859-1 file as ISO-8859-1.
(True, all the bytes in the range 0x80–0x9F will become invisible control codes that you never want, but they're still valid, just useless. This is true of quite a few of the ISO-8859 encodings, which put the C1 control codes in the range 0x80–0x9F, but not all. You can certainly get an exception with other encodings that leave bytes unmapped, eg Windows-1252.)
Yep. This is hinted at in the doc:
I agree with you that this behaviour is pretty stupid. I would prefer to
ReadAllBytes
and check it throughEncoding.GetString
manually.