Read txt files (in unicode and utf8) by means of C

2020-04-02 05:38发布

问题:

I created two txt files (windows notepad) with the same content "thank you - спасибо" and saved them in utf8 and unicode. In notepad they look fine. Then I tried to read them using .Net:

...File.ReadAllText(utf8FileFullName, Encoding.UTF8);

and

...File.ReadAllText(unicodeFileFullName, Encoding.Unicode);

But in both cases I got this "thank you - ???????". What's wrong?

Upd: code for utf8

static void Main(string[] args)
        {
            var encoding = Encoding.UTF8;
            var file = new FileInfo(@"D:\encodes\enc.txt");
            Console.OutputEncoding = encoding;
            var content = File.ReadAllText(file.FullName, encoding);
            Console.WriteLine("encoding: " + encoding);
            Console.WriteLine("content: " + content);
            Console.ReadLine();
        }

Result: thanks ÑпаÑибо

回答1:

Edited as UTF8 should support the characters. It seems that you're outputting to a console or a location which hasn't had its encoding set. If so, you need to set the encoding. For the console you can do this

string allText = File.ReadAllText(unicodeFileFullName, Encoding.UTF8);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine(allText);


回答2:

Use the Encoding type Default

File.ReadAllText(unicodeFileFullName, Encoding.Default);

It will fix the ???? Chracters.



回答3:

When outputting Unicode or UTF-8 encoded multi-byte characters to the console you will need to set the encoding as well as ensure that the console has a font set that supports the multi-byte character in order to display the corresponding glyph. With your existing code a MessageBox.Show(content) or display on a Windows or Web Form would appear correctly.

Have a look at http://msdn.microsoft.com/en-us/library/system.console.aspx for an explanation on setting fonts within the console window.

"Support for Unicode characters requires the encoder to recognize a particular Unicode character, and also requires a font that has the glyphs needed to render that character. To successfully display Unicode characters to the console, the console font must be set to a non-raster or TrueType font such as Consolas or Lucida Console."

As a side note, you can use the FileStream class to read the first three bytes of the file and look for the byte order mark indicator to automatically set the encoding when reading the file. For example, if byte[0] == 0xEF && byte[1] == 0xBB && byte[2] == 0xBF then you have a UTF-8 encoded file. Refer to http://en.wikipedia.org/wiki/Byte_order_mark for more information.