Convert GB2312 to UTF-8

2019-01-12 00:41发布

问题:

I have a text file that contains localized language strings that is currently encoded in GB2312 (simplified Chinese), but all of my other language files are in UTF-8. I am finding it very difficult to work with this file, as none of my text editors will work properly with it and keep corrupting it. Are there any tools to convert this to UTF-8, and are there any downsides to doing this? Would it be better to just keep it as GB2312 and use a different editor (if so, can you recommend one)?

Update: I'm using Windows XP (English install).

Update #2: I've tried using Notepad++ and Notepad2 to edit the GB2312 files, but both are unable to read the files and corrupt them.

回答1:

You can try this online service that uses the Open Source iconv utility.
You can also install Charco, a command-line version of it on your machine.

For GB2312, you can use CP936 as the encoding.

If you are a .Net developer you can make a small tool that does just that.
I've struggled with this as well and found that it was actually simple to solve from a programmatic point of view.

All you need is something like this (I tested it and it works):

In C#

static void Main(string[] args) {
    string infile = args[0];
    string outfile = args[1];

    using (StreamReader sr = new StreamReader(infile, Encoding.GetEncoding(936))) {
        using (StreamWriter sw = new StreamWriter(outfile, false, Encoding.UTF8)) {
            sw.Write(sr.ReadToEnd());
            sw.Close();
        }
        sr.Close();
    }
}

In VB.Net

Private Shared Sub Main(ByVal args() As String)
    Dim infile As String = args(0)
    Dim outfile As String = args(1)
    Dim sr As StreamReader = New StreamReader(infile, Encoding.GetEncoding(936))
    Dim sw As StreamWriter = New StreamWriter(outfile, false, Encoding.UTF8)
    sw.Write(sr.ReadToEnd)
    sw.Close
    sr.Close
End Sub


回答2:

I might be thinking a bit too simple here, but if it's just this one plain text file, you could try the following:

  1. Replace all & by &amp;, all < by &lt; and all > by &gt; (to be on the safe side)
  2. Prepend the following to the text file:

    <html><head><meta http-equiv="Content-Type" content="text/html; charset=gb2312" /></head><body><pre>

  3. Open the file in your favorite browser

  4. Select and copy all text
  5. Paste it in Notepad and save as UTF-8.

You'd be done with this before you could have written any code to do the conversion or downloaded any programs that would do the conversion for you.

Of course, I'm not a hundred percent sure this'll work, and your browser would need the correct fonts and everything, but considering you're working with these kinds of files I'm assuming you already have those.



回答3:

GB 2312 is mostly compatible with GB 18030, so any tool able to deal with the latter should treat GB 2312 correctly as well. There are many tools for converting GB 18030 to UTF-8 (or some other Unicode encoding form), but I can't recommend any specific one for Windows, because I work on Unix. If you're wanting to write a bit of code, the iconv library, or ICU, springs to mind: you'll find all the conversion data readily available in these libraries.

Conversion from GB 2312 to UTF-8 is completely safe and lossless, you shouldn't worry about it.