What is the standard encoding of C++ source code? Does the C++ standard even say something about this? Can I write C++ source in Unicode?
For example, can I use non-ASCII characters such as Chinese characters in comments? If so, is full Unicode allowed or just a subset of Unicode? (e.g., that 16-bit first page or whatever it's called.)
Furthermore, can I use Unicode for strings? For example:
Wstring str=L"Strange chars: â Țđ ě €€";
AFAIK It's not standardized as you can put any type of characters in wide strings. You just have to check that your compiler is set to Unicode source code to make it work right.
The C++ standard doesn't say anything about source-code file encoding, so far as I know.
The usual encoding is (or used to be) 7-bit ASCII -- some compilers (Borland's, for instance) would balk at ASCII characters that used the high-bit. There's no technical reason that Unicode characters can't be used, if your compiler and editor accept them -- most modern Linux-based tools, and many of the better Windows-based editors, handle UTF-8 encoding with no problem, though I'm not sure that Microsoft's compiler will.
EDIT: It looks like Microsoft's compilers will accept Unicode-encoded files, but will sometimes produce errors on 8-bit ASCII too:
In addition to litb's post, MSVC++ supports Unicode too. I understand it gets the Unicode encoding from the BOM. It definitely supports code like
int (*♫)();
orconst std::set<int> ∅;
If you're really into code obfuscuation:There are two issues at play here. The first is what characters are allowed in C++ code (and comments), such as variable names. The second is what characters are allowed in strings and string literals.
As noted, C++ compilers must support a very restricted ASCII-based character set for the characters allowed in code and comments. In practice, this character set didn't work very well with some European character sets (and especially with some European keyboards that didn't have a few characters -- like square brackets -- available), so the concept of digraphs and trigraphs was introduced. Many compilers accept more than this character set at this time, but there isn't any guarantee.
As for strings and string literals, C++ has the concept of a wide character and wide character string. However, the encoding for that character set is undefined. In practice it's almost always Unicode, but I don't think there's any guarantee here. Wide character string literals look like L"string literal", and these can be assigned to std::wstring's.
C++11 added explicit support for Unicode strings and string literals, encoded as UTF-8, UTF-16 big endian, UTF-16 little endian, UTF-32 big endian and UTF-32 little endian.
It's also worth noting that wide characters in C++ aren't really Unicode strings as such. They are just strings of larger characters, usually 16, but sometimes 32 bits. This is implementation-defined, though, IIRC you can have an 8-bit
wchar_t
You have no real guarantee as to the encoding in them, so if you are trying to do something like text processing, you will probably want a typedef to the most suitable integer type to your Unicode entity.C++1x has additional unicode support in the form of UTF-8 encoding string literals (
u8"text"
), and UTF-16 and UTF-32 data types (char16_t
andchar32_t
IIRC) as well as corresponding string constants (u"text"
andU"text"
). The encoding on characters specified without\uxxxx
or\Uxxxxxxxx
constants is still implementation-defined, though (and there is no encoding support for complex string types outside the literals)In this context, if you get MSVC++ warning C4819, just change the source file coding to "UTF-8 with Bom".
GCC 4.1 doesn't support this, but GCC 4.4 does, and the latest Qt version uses GCC 4.4, so use "UTF-8 with Bom" as source file coding.