Here is a beginner question on Unicode. I'm using Embarcadero C++ Builder 2009, where they supposedly changed the default strings to use Unicode.
- I type various symbols in my source editor, that aren't part of the standard "7-bit ASCII".
- My program is using the String type of C++ Builder to fetch user input.
- I am also adding input manually by setting a value to a wchar_t.
It would seem that there are conflicts in how the symbols are interpreted. Sometimes I get a symbol with for example the code 0x00C7 ('Ç'), but sometimes the same symbol is coded as 0xFFC7, for example in the source code editor. To my understanding, the former is proper Unicode, the latter is "something else". Can someone confirm this?
I wonder where this "something else" encoding is coming from, and how to get rid of it?
EDIT: Further research: it seems that one place where the 0xFF** encoding appears is when I do something like this:
string str = ...;
wchar_t wch = (wchar_t)str[i];
Same result no matter if it is std::string or VCL String. Is wchar_t
not the same as Unicode?
The wide character type w_type is implementation-defined and need not correspond to the concept of Unicode character. Check out the description of w_type in the Unicode Standard.
If “Ç” is changed to 0xFFC7, it looks very much like sign extension, i.e. the character is internally stored as byte 0xC7 which is then taken as a signed 8-bit integer and converted to a 16-bit integer with sign extension.
I'm guessing the problem is that in your compiler
char
is signed (the standard allows it to be either signed or unsigned, it's implementation-defined/specific). As such, whenever you convert chars that have bit 7 set to 1 (0x80 through 0xFF) into any larger integer type, it's treated as a negative value and it gets sign-extended to preserve the negative value, or, in other words, this bit 7 gets copied to bit 8, bit 9 and so on, into all higher bits of the bigger integer type. So, 0xC7 can turn into 0xFFC7 and 0xFFFFFFC7. To prevent that from happening, castchars
tounsigned chars
first.