What is the standard encoding of C++ source code? Does the C++ standard even say something about this? Can I write C++ source in Unicode?
For example, can I use non-ASCII characters such as Chinese characters in comments? If so, is full Unicode allowed or just a subset of Unicode? (e.g., that 16-bit first page or whatever it's called.)
Furthermore, can I use Unicode for strings? For example:
Wstring str=L"Strange chars: â Țđ ě €€";
Encoding in C++ is quite a bit complicated. Here is my understanding of it.
Every implementation has to support characters from the basic source character set. These include common characters listed in §2.2/1 (§2.3/1 in C++11). These characters should all fit into one
char
. In addition implementations have to support a way to name other characters using a way calleduniversal-character-names
and look like\uffff
or\Uffffffff
and can be used to refer to Unicode characters. A subset of them are usable in identifiers (listed in Annex E).This is all nice, but the mapping from characters in the file, to source characters (used at compile time) is implementation defined. This constitutes the encoding used. Here is what it says literally (C++98 version):
For gcc, you can change it using the option
-finput-charset=charset
. Additionally, you can change the execution character used to represet values at runtime. The proper option for this is-fexec-charset=charset
for char (it defaults toutf-8
) and-fwide-exec-charset=charset
(which defaults to eitherutf-16
orutf-32
depending on the size ofwchar_t
).For encoding in strings I think you are meant to use the \u notation, e.g.: