While I am experimenting code units under utf-8 in Visual Studio, I entercountered many pitfalls:
By default, VS save the source file with system region related encoding, for me , it's GB2312(codepage 936, a Chinese encoding).
Solution: I use save as and save the file with UTF-8 without signature.
Then I found that by default the compiler interpret the source file with system region related encoding too, which it's still GB2312, so I got puzzling warning and syntax error.
Solution: I use
/source-charset:utf-8
to compile, no warning and error. But the size result it's 2('知' in GB2312 is encoded with 2 code units). But it should be 3 under utf-8.
'知' Unicode reference https://unicode-table.com/en/77E5/
(I think one can use any character that both exist in your current system encoding and utf-8 but with different code unit size to make a similar test.)
Code:
#include <iostream>
#include <string>
using namespace std;
int main(){
string s = "知";
cout << s.size() <<endl;
cout << s << endl;
}
Moreover, the Windows cmd as well as powershell use the system region related encoding too (type chcp
in cmd). So I can't print characters like ə
.
So there's three stuff I need to take care about:
- Source file encoding
- Whether the compiler interpret the source file as expected
- The cmd may not be able to display the character even if 1. and 2. are satisfied.
Besides, I have some confusion derived from this experience:
- Why Windows acts like this? Can it just set everything with utf-8? I copied the same file to Mac and everything works as expected. And it's very easy to set Mac's terminal encoding.
Some posts I found said the reason is that some encoding standards (like this GB2312) are created before utf-8 come out. And many of them are not compatible with utf-8. So it continues to use for compatibility.
But I wonder how the incompatibility would occur? e.g. I download NotePad++ and install all the language packages. My system's encoding is GB2312, but I can still change the display language of NotePad++ to Japanese and it displays well. Not such thing like
????
.