System.IO.File.ReadAllText not throwing exception

2019-05-10 19:58发布

问题:

I have some UTF-8 text in a file utf8.txt. The file contains some characters that are outside the ASCII range. I tried the following code:

var fname = "utf8.txt";
var enc = Encoding.GetEncoding("ISO-8859-1", EncoderFallback.ExceptionFallback,
    DecoderFallback.ExceptionFallback);
var s = System.IO.File.ReadAllText(fname, enc);

The expected behavior is that the code should throw an exception, since it is not valid ISO-8859-1 text. Instead, the behavior is that it correctly decodes the UTF-8 text into the right characters (it looks correct in the debugger).

Is this a bug in .Net?

EDIT:

The file I tested with originally was UTF-8 with BOM. If I remove the BOM, the behavior changes. It still does not throw an exception, however it produces an incorrect Unicode string (the string does not look correct in the debugger).

EDIT:

To produce my test file, run the following code:

var fname = "utf8.txt";
var utf8_bom_e_circumflex_bytes = new byte[] {0xEF, 0xBB, 0xBF, 0xC3, 0xAA};
System.IO.File.WriteAllBytes(fname, utf8_bom_e_circumflex_bytes);

EDIT:

I think I have a firm handle on what is going on (although I don't agree with part of .Net's behavior).

  • If the file starts with UTF-8 BOM, and the data is valid UTF-8, then ReadAllText will completely ignore the encoding you passed in and (properly) decode the file as UTF-8. (I have not tested what happens if the BOM is a lie and the file is not really UTF-8) I don't agree with this behavior. I think .Net should either throw an exception or use the encoding I gave it.

  • If the file has no BOM, .Net has no trivial (and 100% reliable) way to determine that the text is not really ISO-8859-1, since most (all?) UTF-8 text is also valid ISO-8859-1, although gibberish. So it just follows your instructions and decodes the file with the encoding you gave it. (I do agree with this behavior)

回答1:

should throw an exception, since it is not valid ISO-8859-1 text

In ISO-8859-1 all possible bytes have mappings to characters, so no exception will ever result from reading a non-ISO-8859-1 file as ISO-8859-1.

(True, all the bytes in the range 0x80–0x9F will become invisible control codes that you never want, but they're still valid, just useless. This is true of quite a few of the ISO-8859 encodings, which put the C1 control codes in the range 0x80–0x9F, but not all. You can certainly get an exception with other encodings that leave bytes unmapped, eg Windows-1252.)

If the file starts with UTF-8 BOM, and the data is valid UTF-8, then ReadAllText will completely ignore the encoding you passed in and (properly) decode the file as UTF-8.

Yep. This is hinted at in the doc:

This method attempts to automatically detect the encoding of a file based on the presence of byte order marks.

I agree with you that this behaviour is pretty stupid. I would prefer to ReadAllBytes and check it through Encoding.GetString manually.