Encoding issue when handling a string that contain

2019-07-26 04:49发布

问题:

I am parsing some web content in a response from a HttpWebRequest.

This web content is using charset ISO-8859-1 and when parsing it and finally getting the word needed from the response, I am receiving a string with a question mark like this and I want to know which is the right way to transform it back into a readable string.

So, what I've tried is to convert the current word encoding into UTF-8 like this:

(I am wondering if UTF-8 could solve my problem)

string word = "ESPA�OL";

Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf = Encoding.GetEncoding("UTF-8");

byte[] isoBytes = iso.GetBytes(word);
byte[] utfBytes = Encoding.Convert(iso, utf, isoBytes);

string utfWord = utf.GetString(utfBytes);

Console.WriteLine(utfWord);

However, utfWord variable outputs ESPA?OL which is still wrong. The correct output is supposed to be ESPAÑOL.

Can someone please give me the right directions to solve this, if possible?

回答1:

The word in question is "ESPAÑOL". This can be encoded correctly in ISO-8859-1 since all characters in the word are represented in ISO-8859-1.

You can see this for yourself using the following simple program:

using System;
using System.Diagnostics;
using System.Text;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            Encoding enc = Encoding.GetEncoding("ISO-8859-1");
            string original = "ESPAÑOL";
            byte[] iso_8859_1 = enc.GetBytes(original);
            string roundTripped = enc.GetString(iso_8859_1);
            Debug.Assert(original == roundTripped);
            Console.WriteLine(roundTripped);
        }
    }
}

What this tells you is that you need to properly diagnose where the erroneous character comes from. By the time that you have a � character, it is too late. The information has been lost. The presence of the � character indicates that, at some point, a conversion was performed into a character set that did not contain the character Ñ.

A conversion from ISO-8859-1 to a Unicode encoding will correctly handle "ESPAÑOL" because that word can be encoded in ISO-8859-1.

The most likely explanation is that somewhere along the way, the text "ESPAÑOL" is being converted to a character set that does not contain the letter Ñ.