Having a variable length encoding is indirectly forbidden in the standard.
So I have several questions:
How is the following part of the standard handled?
17.3.2.1.3.3 Wide-character sequences
A wide-character sequence is an array object (8.3.4) A that can be declared as T A[N], where T is type wchar_t (3.9.1), optionally qualified by any combination of const or volatile. The initial elements of the array have defined contents up to and including an element determined by some predicate. A character sequence can be designated by a pointer value S that designates its first element.
The length of an NTWCS is the number of elements that precede the terminating null wide character. An empty NTWCS has a length of zero.
Questions:
basic_string<wchar_t>
- How is
operator[]
implemented and what does it return?- standard:
If pos < size(), returns data()[pos]. Otherwise, if pos == size(), the const version returns charT(). Otherwise, the behavior is undefined.
- standard:
- Does
size()
return the number of elements or the length of the string?- standard:
Returns: a count of the number of char-like objects currently in the string.
- standard:
- How does
resize()
work?- unrelated to standard, just what does it do
- How are the position in
insert()
,erase()
and others handled?
cwctype
- Pretty much everything in here. How is the variable encoding handled?
cwchar
getwchar()
obviously can't return a whole platform-character, so how does this work?
Plus all the rest of the character function (the theme is the same).
Edit: I will be opening a bounty to get some confirmation. I want to get some clear answers or at least a clearer distribution of votes.
Edit: This is starting to get pointless. This is full of totally conflicting answers. Some of you talk about external encodings (I don't care about those, UTF-8 encoded will still be stored as UTF-16 once read into the string, the same for output), the rest simply contradicts each other. :-/
Here's how Microsoft's STL implementation handles the variable-length encoding:
basic_string<wchar_t>::operator[])(
can return a low or a high surrogate, in isolation.basic_string<wchar_t>::size()
returns the number ofwchar_t
objects. A surrogate pair (one Unicode character) uses two wchar_t's and therefore adds two to the size.basic_string<wchar_t>::resize()
can truncate a string in the middle of a surrogate pair.basic_string<wchar_t>::insert()
can insert in the middle of a surrogate pair.basic_string<wchar_t>::erase()
can erase either half of a surrogate pair.In general, the pattern should be clear: the STL does not assume that a
std::wstring
is in UTF-16, nor enforce that it remains UTF-16.STL deals with strings as simply a wrapper for an array of characters therefore size() or length() on an STL string will tell you how many char or wchar_t elements it contains and not necessarily the number of printable characters it would be in a string.
Assuming that you're talking about the
wstring
type, there would be no handling of the encoding - it just deals withwchar_t
elements without knowing anything about the encoding. It's just a sequence ofwchar_t
's. You'll need to deal with encoding issues using functionality of other functions.MSVC stores
wchar_t
inwstring
s. These can be interpreted as unicode 16 bit words, or anything else really.If you want to get access to unicode characters or glyphs, you'll have to process said raw string by the unicode standard. You probably also want to handle common corner cases without breaking.
Here is a sketch of such a library. It is about half as memory efficient as it could be, but it does give you in-place access to unicode glyphs in a
std::string
. It relies on having a decentarray_view
class, but you want to write one of those anyhow:a smarter bit of code would generate the
unicode_char
s andunicode_glyph
s on the fly with a factory iterator of some kind. A more compact implementation would keep track of the fact that the end pointer of the previous and begin pointer of the next are always identical, and alias them together. Another optimization would be to use a small object optimization on glyph based off the assumption that most glyphs are one character, and use dynamic allocation if they are two.Note that I treat CGJ as a standard diacrit, and the double-diacrits as a set of 3 characters that form one (unicode), but half-diacrits don't merge things into one glyph. These are all questionable choices.
This was written in a bout of insomnia. Hope it at least somewhat works.
Two things: