ReadText from file in ANSII encoding

2020-06-28 13:04发布

问题:

I use Q42.Winrt library to download html file to cache. But when i use ReadTextAsync i have exception:

No mapping for the Unicode character exists in the target multi-byte code page. (Exception from HRESULT: 0x80070459)

My code very simple

var parsedPage = await WebDataCache.GetAsync(new Uri(String.Format("http://someUrl.here")));
var parsedStream = await FileIO.ReadTextAsync(parsedPage);

I open downloaded file and there is ANSII encoding. I think i need to convert it to UTF-8 but i don't know how.

回答1:

The problem is that the encoding of the original page is not in Unicode, it's Windows-1251, and the ReadTextAsync function only handles Unicode or UTF8. The way around this is to read the file as binary and then use Encoding.GetEncoding to interpret the bytes with the 1251 code page and produce the string (which is always Unicode).

For example,

        String parsedStream;
        var parsedPage = await WebDataCache.GetAsync(new Uri(String.Format("http://bash.im")));

        var buffer = await FileIO.ReadBufferAsync(parsedPage);
        using (var dr = DataReader.FromBuffer(buffer))
        {
            var bytes1251 = new Byte[buffer.Length];
            dr.ReadBytes(bytes1251);

            parsedStream = Encoding.GetEncoding("Windows-1251").GetString(bytes1251, 0, bytes1251.Length);
        }

The challenge is you don't know from the stored bytes what the code page is, so it works here but may not work for other sites. Generally, UTF-8 is what you'll get from the web, but not always. The Content-Type response header of this page shows the code page, but that information isn't stored in the file.