How do I convert from a possibly Windows 1252 '

2020-07-22 09:43发布

问题:

I've got a FileUpload control in an ASP.NET web page which is used to upload a file, the contents of which (in a stream) are processed in the C# code behind and output on the page later, using HtmlEncode.

But, some of this output is becoming mangled, specifically the symbol '£' is output as the Unicode FFFD REPLACEMENT CHARACTER. I've tracked this down to the input file, which is Windows 1252 ('ANSI') encoded.

The question is,

  1. How do I determine whether the file is encoded as 1252 or UTF8? It could be either, and

  2. How do I convert it to UTF8 if it is in Windows 1252, preserving the symbol £ etc?

I've looked online but cannot find a satisfactory answer.

回答1:

If you know that the file is encoded with Windows 1252, you can open the file with a StreamReader and pass the proper encoding. That is:

StreamReader reader = new StreamReader("filename", Encoding.GetEncoding("Windows-1252"), true);

The "true" tells it to set the encoding based on the byte order marks at the front of the file, if they're there. Otherwise it opens it as Windows-1252.

You can then read the file and, if you want to convert to UTF-8, write to a file that you've opened with that endcoding.

The short answer to your first question is that there isn't a 100% satisfactory way to determine the encoding of a file. If there are byte order marks, you can determine what flavor of Unicode it is, but without the BOM, you're stuck with using heuristics to determine the encoding.

I don't have a good reference for the heuristics. You might search for "how does Notepad determine the character set". I recall seeing something about that some time ago.

In practice, I've found the following to work for most of what I do:

StreamReader reader = new StreamReader("filename", Encoding.Default, true);

Most of the files I read are those that I create with .NET's StreamWriter, and they're in UTF-8 with the BOM. Other files that I get are typically written with some tool that doesn't understand Unicode or code pages, and I just treat it as a stream of bytes, which Encoding.Default does well.