Why filename has different bytes after converting

2019-07-07 17:23发布

I have next file:

I use ReadDirectoryChangesW for reading changes in current folder. And I get path to this file: L"TEST Ӡ⬨☐.ipt":

Next, I want to convert this to utf8 and back:

std::string wstringToUtf8(const std::wstring& source) {
  const int size = WideCharToMultiByte(CP_UTF8, 0, source.data(), static_cast<int>(source.size()), NULL, 0, NULL, NULL);
  std::vector<char> buffer8(size);
  WideCharToMultiByte(CP_UTF8, 0, source.data(), static_cast<int>(source.size()), buffer8.data(), size, NULL, NULL);
}

std::wstring utf8ToWstring(const std::string& source) {
  const int size = MultiByteToWideChar(CP_UTF8, 0, source.data(), static_cast<int>(source.size()), NULL, 0);
  std::vector<wchar_t> buffer16(size);
  MultiByteToWideChar(CP_UTF8, 0, source.data(), static_cast<int>(source.size()), buffer16.data(), size);
}

int main() {
    // Some code with ReadDirectoryChangesW and 
    // ...
    // std::wstring fileName = "L"TEST Ӡ⬨☐.ipt""
    // ...

    std::string filenameUTF8 = wstringToUtf8(fileName);
    std::wstring filename2 = utf8ToWstring(filenameUTF8);
    assert(filenameUTF8 == filename2); // FAIL!
    return 0;
}

But I catch assert. filename2:

Different bits: [29]

Why?

标签： c++ c winapi encoding utf-8

1条回答

迷人小祖宗

2楼-- · 2019-07-07 17:40

57216 seems to fall in to surrogate pair range, used in UTF-16 to encode non-BMP code points. They need to be given in pairs, or decoding won't give you correct codepoint.

65533 is a special error character which decoder gives because other surrogate is missing.

To put it another way: Your original string is not valid UTF-16 string.

More info on Wikipedia.

0人赞添加讨论(0) 举报

Why filename has different bytes after converting

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间