I know the UTF-16 has two types of endiannesses: big endian and little endian.
Does the C++ standard define the endianness of std::wstring? or it is implementation-defined?
If it is standard-defined, which page of the C++ standard provide the rules on this issue?
If it is implementation-defined, how to determine it? e.g. under VC++. Does the compiler guarantee the endianness of std::wstring is strictly dependent on the processor?
I have to know this; because I want to send the UTF-16 string to others. I must add the correct BOM in the beginning of the UTF-16 string to indicate its endianness.
In short: Given a std::wstring, how should I reliably determine its endianness?
Endianess is MACHINE dependent, not language dependent. Endianess is defined by the processor and how it arranges data in and out of memory. When dealing with wchar_t (which is wider than a single byte), the processor itself upon a read or write aligns the multiple bytes as it needs to in order to read or write it back to RAM again. Code simply looks at it as the 16 bit (or larger) word as represented in a processor internal register.
For determining (if that is really what you want to do) endianess (on your own), you could try writing a KNOWN 32 bit (unsigned int) value out to ram, then read it back using a char pointer. Look for the ordering that is returned.
It would look something like this:
unsigned int aVal = 0x11223344;
char * myValReadBack = (char *)(&aVal);
if(*myValReadBack == 0x11) printf("Big endian\r\n");
else printf("Little endian\r\n");
Im sure there are other ways, but something like the above should work, check my little versus big though :-)
Further, until Windows RT, VC++ really only compiled to intel type processors. They really only have had 1 endianess type.
It is implementation-defined. wstring is just a string of wchar_t, and that can be any byte ordering, or for that matter, any old size.
wchar_t
is not required to be UTF-16 internally and UTF-16 endianness does not affect how wchar's are stored, it's a matter of saving and reading it.
You have to use an explicit procedure of converting wstring to a UTF-16 bytestream before sending it anywhere. Internal endianness of wchar is architecture-dependent and it's better to use some opaque interfaces for converting than try to convert it manually.
For the purposes of sending the correct BOM, you don't need to know the endianness. Just use the code \uFEFF. That will be bigendian or little-endian depending on the endianness of your implementation. You don't even need to know whether your implementation is UTF-16 or UTF-32. As long as it is some unicode encoding, you'll end up with the appropriate BOM.
Unfortunately, neither wchars nor wide streams are guaranteed to be unicode.