How to deal with ISO-2022-JP ( and other character

2019-06-06 07:17发布

Part of my application accepts arbitrary text and posts it as an Update to Twitter. Everything works fine, until it comes to posting foreign ( non ASCII/UTF7/8 ) character sets, then things no longer work.

For example, if someone posts:
に投稿できる

It ( within my code in Visual Studio debugger ) becomes:
=?ISO-2022-JP?B?GyRCJEtFajlGJEckLSRrGyhC?=

Googling has told me that this represents ( minus ? as delimiters )

=?ISO-2022-JP is the text encoding
?B means it is base64 encoded
?GyRCJEtFajlGJEckLSRrGyhC? Is the encoded string

For the life of me, I can't figure out how to get this string posted as an update to Twitter in it's original Japanese characters. As it stands now, sending '=?ISO-2022-JP?B?GyRCJEtFajlGJEckLSRrGyhC?=' to Twitter will result in exactly that getting posted. Ive also tried breaking the string up into pieces as above, using System.Text.Encoding to convert to UTF8 from ISO-2022-JP and vice versa, base64 decoded and not. Additionally, ive played around with the URL Encoding of the status update like this:


string[] bits = tweetText.Split(new char[] { '?' });
if (bits.Length >= 4)
{
textEncoding = System.Text.Encoding.GetEncoding(bits[1]);
xml = oAuth.oAuthWebRequest(TwitterLibrary.oAuthTwitter.Method.POST, url, "status=" +   System.Web.HttpUtility.UrlEncode(decodedText, textEncoding)); 
}

No matter what I do, the results never end up back to normal.

EDIT: Got it in the end. For those following at home, it was pretty close to the answer listed below in the end. It was just Visual Studios debugger was steering me the wrong way and a bug in the Twitter Library I was using. End result was this:


decodedText = textEncoding.GetString(System.Convert.FromBase64String(bits[3]));
byte[] originalBytes = textEncoding.GetBytes(decodedText);
byte[] utfBytes = System.Text.Encoding.Convert(textEncoding, System.Text.Encoding.UTF8, originalBytes);
// now, back to string form
decodedText = System.Text.Encoding.UTF8.GetString(utfBytes);

Thanks all.

2条回答
祖国的老花朵
2楼-- · 2019-06-06 07:40

Your understanding of how the text is encoded seems correct. In python

'GyRCJEtFajlGJEckLSRrGyhC'.decode('base64').decode('ISO-2022-JP')

returns the correct unicode string. Note that you need to decode base64 first in order to get the ISO-2022-JP-encoded text.

查看更多
对你真心纯属浪费
3楼-- · 2019-06-06 07:56

This produced the output you are looking for:

using System;
using System.Text;

class Program {
  static void Main(string[] args) {
    string input = "に投稿できる";
    Console.WriteLine(EncodeTwit(input));
    Console.ReadLine();
  }
  public static string EncodeTwit(string txt) {
    var enc = Encoding.GetEncoding("iso-2022-jp");
    byte[] bytes = enc.GetBytes(txt);
    char[] chars = new char[(bytes.Length * 3 + 1) / 2];
    int len = Convert.ToBase64CharArray(bytes, 0, bytes.Length, chars, 0);
    return "=?ISO-2022-JP?B?" + new string(chars, 0, len) + "?=";
  }
}

Standards are great, there are so many to choose from. ISO never disappoints, there are no less than 3 ISO-2022-JP encodings. If you have trouble then also try encodings 50221 and 50222.

查看更多
登录 后发表回答