How to read a UCS-2 file?

I'm writing a program to get the infomation in *.rc file encoding in UCS-2 Little Endian.

int _tmain(int argc, _TCHAR* argv[]) {
  wstring csvLine(wstring sLine);
  wifstream fin("en.rc");
  wofstream fout("table.csv");
  wofstream fout_rm("temp.txt");
  wstring sLine;
  fout << "en\n";
  while(getline(fin,sLine)) {
    if (sLine.find(L"IDS") == -1)
      fout_rm << sLine << endl;
    else
      fout << csvLine(sLine);
  }
  fout << flush;
  system("pause");
  return 0;
}

The first line in "en.rc" is #include <windows.h> but sLine shows as below:

[0]     255 L'ÿ'
[1]     254 L'þ'
[2]     35  L'#'
[3]     0
[4]     105 L'i'
[5]     0
[6]     110 L'n'
[7]     0
[8]     99  L'c'
.       .
.       .
.       .

This program can work out correctly for UTF-8. How can I do it to UCS-2?

标签： c++ unicode encoding character-encoding wofstream

1条回答

Bombasti

2楼-- · 2019-02-10 16:14

Wide streams use a wide stream buffer to access the file. The Wide stream buffer reads bytes from the file and uses its codecvt facet to convert these bytes to wide characters. The default codecvt facet is std::codecvt<wchar_t, char ,std::mbstate_t> which converts between the native character sets for wchar_t and char (i.e., like mbstowcs() does).

You're not using the native char character set, so what you want is a codecvt facet that reads UCS-2 as a multibyte sequence and converts it to wide characters.

#include <fstream>
#include <string>
#include <codecvt>
#include <iostream>

int main(int argc, char *argv[])
{
    wifstream fin("en.rc", std::ios::binary); // You need to open the file in binary mode

    // Imbue the file stream with a codecvt facet that uses UTF-16 as the external multibyte encoding
    fin.imbue(std::locale(fin.getloc(),
              new std::codecvt_utf16<wchar_t, 0xffff, consume_header>));

    // ^ We set 0xFFFF as the maxcode because that's the largest that will fit in a single wchar_t
    //   We use consume_header to detect and use the UTF-16 'BOM'

    // The following is not really the correct way to write Unicode output, but it's easy
    std::wstring sLine;
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert;
    while (getline(fin, sLine))
    {
        std::cout << convert.to_bytes(sLine) << '\n';
    }
}

Note that there's an issue with UTF-16 here. The purpose of wchar_t is for one wchar_t to represent one codepoint. However Windows uses UTF-16 which represents some codepoints as two wchar_ts. This means that the standard API doesn't work very well with Windows.

The consequence here is that when the file contains a surrogate pair, codecvt_utf16 will read that pair, convert it to a single codepoint value greater than 16 bits and have to truncate the value to 16 bits to stick it in a wchar_t. This means this code really is limited to UCS-2. I've set the maxcode template parameter to 0xFFFF to reflect this.

There are a number of other problems with wchar_t, and you might want to just avoid it entirely: What's “wrong” with C++ wchar_t?

0人赞添加讨论(0) 举报

How to read a UCS-2 file?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间