I am not able to understand the differences between std::string
and std::wstring
. I know wstring
supports wide characters such as Unicode characters. I have got the following questions:
- When should I use
std::wstring
overstd::string
? - Can
std::string
hold the entire ASCII character set, including the special characters? - Is
std::wstring
supported by all popular C++ compilers? - What is exactly a "wide character"?
Applications that are not satisfied with only 256 different characters have the options of either using wide characters (more than 8 bits) or a variable-length encoding (a multibyte encoding in C++ terminology) such as UTF-8. Wide characters generally require more space than a variable-length encoding, but are faster to process. Multi-language applications that process large amounts of text usually use wide characters when processing the text, but convert it to UTF-8 when storing it to disk.
The only difference between a
string
and awstring
is the data type of the characters they store. A string storeschar
s whose size is guaranteed to be at least 8 bits, so you can use strings for processing e.g. ASCII, ISO-8859-15, or UTF-8 text. The standard says nothing about the character set or encoding.Practically every compiler uses a character set whose first 128 characters correspond with ASCII. This is also the case with compilers that use UTF-8 encoding. The important thing to be aware of when using strings in UTF-8 or some other variable-length encoding, is that the indices and lengths are measured in bytes, not characters.
The data type of a wstring is
wchar_t
, whose size is not defined in the standard, except that it has to be at least as large as a char, usually 16 bits or 32 bits. wstring can be used for processing text in the implementation defined wide-character encoding. Because the encoding is not defined in the standard, it is not straightforward to convert between strings and wstrings. One cannot assume wstrings to have a fixed-length encoding either.If you don't need multi-language support, you might be fine with using only regular strings. On the other hand, if you're writing a graphical application, it is often the case that the API supports only wide characters. Then you probably want to use the same wide characters when processing the text. Keep in mind that UTF-16 is a variable-length encoding, meaning that you cannot assume
length()
to return the number of characters. If the API uses a fixed-length encoding, such as UCS-2, processing becomes easy. Converting between wide characters and UTF-8 is difficult to do in a portable way, but then again, your user interface API probably supports the conversion.A good question! I think DATA ENCODING (sometimes a CHARSET also involved) is a MEMORY EXPRESSION MECHANISM in order to save data to a file or transfer data via a network, so I answer this question as:
1. When should I use std::wstring over std::string?
If the programming platform or API function is a single-byte one, and we want to process or parse some Unicode data, e.g read from Windows'.REG file or network 2-byte stream, we should declare std::wstring variable to easily process them. e.g.: wstring ws=L"中国a"(6 octets memory: 0x4E2D 0x56FD 0x0061), we can use ws[0] to get character '中' and ws[1] to get character '国' and ws[2] to get character 'a', etc.
2. Can std::string hold the entire ASCII character set, including the special characters?
Yes. But notice: American ASCII, means each 0x00~0xFF octet stands for one character, including printable text such as "123abc&*_&" and you said special one, mostly print it as a '.' avoid confusing editors or terminals. And some other countries extend their own "ASCII" charset, e.g. Chinese, use 2 octets to stand for one character.
3.Is std::wstring supported by all popular C++ compilers?
Maybe, or mostly. I have used: VC++6 and GCC 3.3, YES
4. What is exactly a "wide character"?
a wide character mostly indicates using 2 octets or 4 octets to hold all countries' characters. 2 octet UCS2 is a representative sample, and further e.g. English 'a', its memory is 2 octet of 0x0061(vs in ASCII 'a's memory is 1 octet 0x61)
string
?wstring
?std::string
is abasic_string
templated on achar
, andstd::wstring
on awchar_t
.char
vs.wchar_t
char
is supposed to hold a character, usually an 8-bit character.wchar_t
is supposed to hold a wide character, and then, things get tricky:On Linux, a
wchar_t
is 4 bytes, while on Windows, it's 2 bytes.What about Unicode, then?
The problem is that neither
char
norwchar_t
is directly tied to unicode.On Linux?
Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:
outputs the following text:
You'll see the "olé" text in
char
is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I'll let you study thewchar_t
code as an exercise)So, when working with a
char
on Linux, you should usually end up using Unicode without even knowing it. And asstd::string
works withchar
, sostd::string
is already unicode-ready.Note that
std::string
, like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.On Windows?
On Windows, this is a bit different. Win32 had to support a lot of application working with
char
and on different charsets/codepages produced in all the world, before the advent of Unicode.So their solution was an interesting one: If an application works with
char
, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine. For example, "olé" would be "olé" in a French-localized Windows, but would be something different on an cyrillic-localized Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the same old way.For Unicode based applications, Windows uses
wchar_t
, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, the mostly compatible UCS-2, which is almost the same thing IIRC).Applications using
char
are said "multibyte" (because each glyph is composed of one or morechar
s), while applications usingwchar_t
are said "widechar" (because each glyph is composed of one or twowchar_t
. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.Thus, if you work on Windows, you badly want to use
wchar_t
(unless you use a framework hiding that, like GTK+ or QT...). The fact is that behind the scenes, Windows works withwchar_t
strings, so even historical applications will have theirchar
strings converted inwchar_t
when using API likeSetWindowText()
(low level API function to set the label on a Win32 GUI).Memory issues?
UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).
If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.
Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or slightly larger for UTF-8 than for UTF-16.
All in all, UTF-16 will mostly use 2 and occassionally 4 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.
See http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.
Conclusion
When I should use std::wstring over std::string?
On Linux? Almost never (§).
On Windows? Almost always (§).
On cross-platform code? Depends on your toolkit...
(§) : unless you use a toolkit/framework saying otherwise
Can
std::string
hold all the ASCII character set including special characters?Notice: A
std::string
is suitable for holding a 'binary' buffer, where astd::wstring
is not!On Linux? Yes.
On Windows? Only special characters available for the current locale of the Windows user.
Edit (After a comment from Johann Gerell):
a
std::string
will be enough to handle allchar
-based strings (eachchar
being a number from 0 to 255). But:char
s are NOT ASCII.char
from 0 to 127 will be held correctlychar
from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.Is
std::wstring
supported by almost all popular C++ compilers?Mostly, with the exception of GCC based compilers that are ported to Windows.
It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.
What is exactly a wide character?
On C/C++, it's a character type written
wchar_t
which is larger than the simplechar
character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...).There are some very good answers here, but I think there are a couple of things I can add regarding Windows/Visual Studio. Tis is based on my experience with VS2015. On Linux, basically the answer is to use UTF-8 encoded
std::string
everywhere. On Windows/VS it gets more complex. Here is why. Windows expects strings stored usingchar
s to be encoded using the locale codepage. This is almost always the ASCII character set followed by 128 other special characters depending on your location. Let me just state that this in not just when using the Windows API, there are three other major places where these strings interact with standard C++. These are string literals, output tostd::cout
using<<
and passing a filename tostd::fstream
.I will be up front here that I am a programmer, not a language specialist. I appreciate that USC2 and UTF-16 are not the same, but for my purposes they are close enough to be interchangeable and I use them as such here. I'm not actually sure which Windows uses, but I generally don't need to know either. I've stated UCS2 in this answer, so sorry in advance if I upset anyone with my ignorance of this matter and I'm happy to change it if I have things wrong.
String literals
If you enter string literals that contain only characters that can be represented by your codepage then VS stores them in your file with 1 byte per character encoding based on your codepage. Note that if you change your codepage or give your source to another developer using a different code page then I think (but haven't tested) that the character will end up different. If you run your code on a computer using a different code page then I'm not sure if the character will change too.
If you enter any string literals that cannot be represented by your codepage then VS will ask you to save the file as Unicode. The file will then be encoded as UTF-8. This means that all Non ASCII characters (including those which are on your codepage) will be represented by 2 or more bytes. This means if you give your source to someone else the source will look the same. However, before passing the source to the compiler, VS converts the UTF-8 encoded text to code page encoded text and any characters missing from the code page are replaced with
?
.The only way to guarantee correctly representing a Unicode string literal in VS is to precede the string literal with an
L
making it a wide string literal. In this case VS will convert the UTF-8 encoded text from the file into UCS2. You then need to pass this string literal into astd::wstring
constructor or you need to convert it to utf-8 and put it in astd::string
. Or if you want you can use the Windows API functions to encode it using your code page to put it in astd::string
, but then you may as well have not used a wide string literal.std::cout
When outputting to the console using
<<
you can only usestd::string
, notstd::wstring
and the text must be encoded using your locale codepage. If you have astd::wstring
then you must convert it using one of the Windows API functions and any characters not on your codepage get replaced by?
(maybe you can change the character, I can't remember).std::fstream filenames
Windows OS uses UCS2/UTF-16 for its filenames so whatever your codepage, you can have files with any Unicode character. But this means that to access or create files with characters not on your codepage you must use
std::wstring
. There is no other way. This is a Microsoft specific extension tostd::fstream
so probably won't compile on other systems. If you use std::string then you can only utilise filenames that only include characters on your codepage.Your options
If you are just working on Linux then you probably didn't get this far. Just use UTF-8
std::string
everywhere.If you are just working on Windows just use UCS2
std::wstring
everywhere. Some purists may say use UTF8 then convert when needed, but why bother with the hassle.If you are cross platform then it's a mess to be frank. If you try to use UTF-8 everywhere on Windows then you need to be really careful with your string literals and output to the console. You can easily corrupt your strings there. If you use
std::wstring
everywhere on Linux then you may not have access to the wide version ofstd::fstream
, so you have to do the conversion, but there is no risk of corruption. So personally I think this is a better option. Many would disagree, but I'm not alone - it's the path taken by wxWidgets for example.Another option could be to typedef
unicodestring
asstd::string
on Linux andstd::wstring
on Windows, and have a macro called UNI() which prefixes L on Windows and nothing on Linux, then the codewould be fine on either platform I think.
Answers
So To answer your questions
1) If you are programming for Windows, then all the time, if cross platform then maybe all the time, unless you want to deal with possible corruption issues on Windows or write some code with platform specific
#ifdefs
to work around the differences, if just using Linux then never.2)Yes. In addition on Linux you can use it for all Unicode too. On Windows you can only use it for all unicode if you choose to manually encode using UTF-8. But the Windows API and standard C++ classes will expect the
std::string
to be encoded using the locale codepage. This includes all ASCII plus another 128 characters which change depending on the codepage your computer is setup to use.3)I believe so, but if not then it is just a simple typedef of a 'std::basic_string' using
wchar_t
instead ofchar
4)A wide character is a character type which is bigger than the 1 byte standard
char
type. On Windows it is 2 bytes, on Linux it is 4 bytes.So, every reader here now should have a clear understanding about the facts, the situation. If not, then you must read paercebal's outstandingly comprehensive answer [btw: thanks!].
My pragmatical conclusion is shockingly simple: all that C++ (and STL) "character encoding" stuff is substantially broken and useless. Blame it on Microsoft or not, that will not help anyway.
My solution, after in-depth investigation, much frustration and the consequential experiences is the following:
accept, that you have to be responsible on your own for the encoding and conversion stuff (and you will see that much of it is rather trivial)
use std::string for any UTF-8 encoded strings (just a
typedef std::string UTF8String
)accept that such an UTF8String object is just a dumb, but cheap container. Do never ever access and/or manipulate characters in it directly (no search, replace, and so on). You could, but you really just really, really do not want to waste your time writing text manipulation algorithms for multi-byte strings! Even if other people already did such stupid things, don't do that! Let it be! (Well, there are scenarios where it makes sense... just use the ICU library for those).
use std::wstring for UCS-2 encoded strings (
typedef std::wstring UCS2String
) - this is a compromise, and a concession to the mess that the WIN32 API introduced). UCS-2 is sufficient for most of us (more on that later...).use UCS2String instances whenever a character-by-character access is required (read, manipulate, and so on). Any character-based processing should be done in a NON-multibyte-representation. It is simple, fast, easy.
add two utility functions to convert back & forth between UTF-8 and UCS-2:
The conversions are straightforward, google should help here ...
That's it. Use UTF8String wherever memory is precious and for all UTF-8 I/O. Use UCS2String wherever the string must be parsed and/or manipulated. You can convert between those two representations any time.
Alternatives & Improvements
conversions from & to single-byte character encodings (e.g. ISO-8859-1) can be realized with help of plain translation tables, e.g.
const wchar_t tt_iso88951[256] = {0,1,2,...};
and appropriate code for conversion to & from UCS2.if UCS-2 is not sufficient, than switch to UCS-4 (
typedef std::basic_string<uint32_t> UCS2String
)ICU or other unicode libraries?
For advanced stuff.