How should I decode a UTF-8 string

I have a string like:

About \xee\x80\x80John F Kennedy\xee\x80\x81\xe2\x80\x99s Assassination . unsolved mystery \xe2\x80\x93 45 years later. Over the last decade, a lot of individuals have speculated on conspiracy theories that ...

I understand that \xe2\x80\x93 is a dash character. But how should I decode the above string in C#?

标签： c# string utf-8

3条回答

beautiful°

2楼-- · 2020-03-23 18:36

Scan the input string char-by-char and convert values starting with \x (string to byte[] and back to string using UTF8 decoder), leaving all other characters unchanged:

static string Decode(string input)
{
    var sb = new StringBuilder();
    int position = 0;
    var bytes = new List<byte>();
    while(position < input.Length)
    {
        char c = input[position++];
        if(c == '\\')
        {
            if(position < input.Length)
            {
                c = input[position++];
                if(c == 'x' && position <= input.Length - 2)
                {
                    var b = Convert.ToByte(input.Substring(position, 2), 16);
                    position += 2;
                    bytes.Add(b);
                }
                else
                {
                    AppendBytes(sb, bytes);
                    sb.Append('\\');
                    sb.Append(c);
                }
                continue;
            }
        }
        AppendBytes(sb, bytes);
        sb.Append(c);
    }
    AppendBytes(sb, bytes);
    return sb.ToString();
}

private static void AppendBytes(StringBuilder sb, List<byte> bytes)
{
    if(bytes.Count != 0)
    {
        var str = System.Text.Encoding.UTF8.GetString(bytes.ToArray());
        sb.Append(str);
        bytes.Clear();
    }
}

Output:

About John F Kennedy’s Assassination . unsolved mystery – 45 years later. Over the last decade, a lot of individuals have speculated on conspiracy theories that ...

0人赞添加讨论(0) 举报

Ridiculous、

3楼-- · 2020-03-23 18:51

If you have a string like that, then you have used the wrong encoding when you decoded it in the first place. There is no "UTF-8 string", the UTF-8 data is whent the text is encoded into binary data (bytes). When it's decoded into a string, then it's not UTF-8 any more.

You should use the UTF-8 encoding when you create the string from binary data, once the string is created using the wrong encoding, you can't reliably fix it.

If there is no other alternative, you could try to fix the string by encoding it again using the same wrong encoding that was used to create it, and then decode it using the corrent encoding. There is however no guarantee that this will work for all strings, some characters will simply be lost during the wrong decoding. Example:

// wrong use of encoding, to try to fix wrong decoding
str = Encoding.UTF8.GetString(Encoding.Default.GetBytes(str));

0人赞添加讨论(0) 举报

smile是对你的礼貌

4楼-- · 2020-03-23 19:00

Finally I've used something like this:

public static string UnescapeHex(string data)
{
    return Encoding.UTF8.GetString(Array.ConvertAll(Regex.Unescape(data).ToCharArray(), c => (byte) c));
}

0人赞添加讨论(0) 举报

How should I decode a UTF-8 string

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间