VC++ compiler /source-charset:utf-8 doesn't wo

2019-07-25 15:26发布

问题:

While I am experimenting code units under utf-8 in Visual Studio, I entercountered many pitfalls:

  1. By default, VS save the source file with system region related encoding, for me , it's GB2312(codepage 936, a Chinese encoding).

    Solution: I use save as and save the file with UTF-8 without signature.

  2. Then I found that by default the compiler interpret the source file with system region related encoding too, which it's still GB2312, so I got puzzling warning and syntax error.

    Solution: I use /source-charset:utf-8 to compile, no warning and error. But the size result it's 2('知' in GB2312 is encoded with 2 code units). But it should be 3 under utf-8.

'知' Unicode reference https://unicode-table.com/en/77E5/

(I think one can use any character that both exist in your current system encoding and utf-8 but with different code unit size to make a similar test.)

Code:

#include <iostream>
#include <string>
using namespace std;

    int main(){
        string s = "知";
        cout << s.size() <<endl;
        cout << s << endl;
    }

Moreover, the Windows cmd as well as powershell use the system region related encoding too (type chcp in cmd). So I can't print characters like ə.

So there's three stuff I need to take care about:

  1. Source file encoding
  2. Whether the compiler interpret the source file as expected
  3. The cmd may not be able to display the character even if 1. and 2. are satisfied.

Besides, I have some confusion derived from this experience:

  1. Why Windows acts like this? Can it just set everything with utf-8? I copied the same file to Mac and everything works as expected. And it's very easy to set Mac's terminal encoding.
  2. Some posts I found said the reason is that some encoding standards (like this GB2312) are created before utf-8 come out. And many of them are not compatible with utf-8. So it continues to use for compatibility.

    But I wonder how the incompatibility would occur? e.g. I download NotePad++ and install all the language packages. My system's encoding is GB2312, but I can still change the display language of NotePad++ to Japanese and it displays well. Not such thing like ????.

回答1:

The term "source charset" is no coincidence here. The C++ standard explicitly differentiates between the (basic) source character set (96 common characters, all found in plain ASCII) and the execution character set.

Since you used UTF-8 as the source character set, is mapped to \u77E5.

At runtime, however, you're using the execution character set. The VC++ /source-charset option does not affect VC++'s execution character set; for that there is an /execution-charset

But as @Matteo Italia already notes, the VC++ runtime is known to be more than a little bit flaky when it comes to UTF-8 I/O. std::string.size should work but std::cout might not.