How to deal with ISO-2022-JP ( and other character

Part of my application accepts arbitrary text and posts it as an Update to Twitter. Everything works fine, until it comes to posting foreign ( non ASCII/UTF7/8 ) character sets, then things no longer work.

For example, if someone posts:
に投稿できる

It ( within my code in Visual Studio debugger ) becomes:
=?ISO-2022-JP?B?GyRCJEtFajlGJEckLSRrGyhC?=

Googling has told me that this represents ( minus ? as delimiters )

=?ISO-2022-JP is the text encoding
?B means it is base64 encoded
?GyRCJEtFajlGJEckLSRrGyhC? Is the encoded string

For the life of me, I can't figure out how to get this string posted as an update to Twitter in it's original Japanese characters. As it stands now, sending '=?ISO-2022-JP?B?GyRCJEtFajlGJEckLSRrGyhC?=' to Twitter will result in exactly that getting posted. Ive also tried breaking the string up into pieces as above, using System.Text.Encoding to convert to UTF8 from ISO-2022-JP and vice versa, base64 decoded and not. Additionally, ive played around with the URL Encoding of the status update like this:


string[] bits = tweetText.Split(new char[] { '?' });
if (bits.Length >= 4)
{
textEncoding = System.Text.Encoding.GetEncoding(bits[1]);
xml = oAuth.oAuthWebRequest(TwitterLibrary.oAuthTwitter.Method.POST, url, "status=" +   System.Web.HttpUtility.UrlEncode(decodedText, textEncoding)); 
}

No matter what I do, the results never end up back to normal.

EDIT: Got it in the end. For those following at home, it was pretty close to the answer listed below in the end. It was just Visual Studios debugger was steering me the wrong way and a bug in the Twitter Library I was using. End result was this:


decodedText = textEncoding.GetString(System.Convert.FromBase64String(bits[3]));
byte[] originalBytes = textEncoding.GetBytes(decodedText);
byte[] utfBytes = System.Text.Encoding.Convert(textEncoding, System.Text.Encoding.UTF8, originalBytes);
// now, back to string form
decodedText = System.Text.Encoding.UTF8.GetString(utfBytes);

Thanks all.

标签： c# encoding twitter

2条回答

祖国的老花朵

2楼-- · 2019-06-06 07:40

Your understanding of how the text is encoded seems correct. In python

'GyRCJEtFajlGJEckLSRrGyhC'.decode('base64').decode('ISO-2022-JP')

returns the correct unicode string. Note that you need to decode base64 first in order to get the ISO-2022-JP-encoded text.

0人赞添加讨论(0) 举报

对你真心纯属浪费

3楼-- · 2019-06-06 07:56

This produced the output you are looking for:

using System;
using System.Text;

class Program {
  static void Main(string[] args) {
    string input = "に投稿できる";
    Console.WriteLine(EncodeTwit(input));
    Console.ReadLine();
  }
  public static string EncodeTwit(string txt) {
    var enc = Encoding.GetEncoding("iso-2022-jp");
    byte[] bytes = enc.GetBytes(txt);
    char[] chars = new char[(bytes.Length * 3 + 1) / 2];
    int len = Convert.ToBase64CharArray(bytes, 0, bytes.Length, chars, 0);
    return "=?ISO-2022-JP?B?" + new string(chars, 0, len) + "?=";
  }
}

Standards are great, there are so many to choose from. ISO never disappoints, there are no less than 3 ISO-2022-JP encodings. If you have trouble then also try encodings 50221 and 50222.

0人赞添加讨论(0) 举报

How to deal with ISO-2022-JP ( and other character

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间