How does Microsoft handle the fact that UTF-16 is

2020-07-04 05:23发布

Having a variable length encoding is indirectly forbidden in the standard.

So I have several questions:

How is the following part of the standard handled?

17.3.2.1.3.3 Wide-character sequences

A wide-character sequence is an array object (8.3.4) A that can be declared as T A[N], where T is type wchar_t (3.9.1), optionally qualified by any combination of const or volatile. The initial elements of the array have defined contents up to and including an element determined by some predicate. A character sequence can be designated by a pointer value S that designates its first element.

The length of an NTWCS is the number of elements that precede the terminating null wide character. An empty NTWCS has a length of zero.

Questions:

basic_string<wchar_t>

  • How is operator[] implemented and what does it return?
    • standard: If pos < size(), returns data()[pos]. Otherwise, if pos == size(), the const version returns charT(). Otherwise, the behavior is undefined.
  • Does size() return the number of elements or the length of the string?
    • standard: Returns: a count of the number of char-like objects currently in the string.
  • How does resize() work?
    • unrelated to standard, just what does it do
  • How are the position in insert(), erase() and others handled?

cwctype

  • Pretty much everything in here. How is the variable encoding handled?

cwchar

  • getwchar() obviously can't return a whole platform-character, so how does this work?

Plus all the rest of the character function (the theme is the same).

Edit: I will be opening a bounty to get some confirmation. I want to get some clear answers or at least a clearer distribution of votes.

Edit: This is starting to get pointless. This is full of totally conflicting answers. Some of you talk about external encodings (I don't care about those, UTF-8 encoded will still be stored as UTF-16 once read into the string, the same for output), the rest simply contradicts each other. :-/

标签: c++ utf-16
5条回答
Bombasti
2楼-- · 2020-07-04 05:39

Here's how Microsoft's STL implementation handles the variable-length encoding:

basic_string<wchar_t>::operator[])( can return a low or a high surrogate, in isolation.

basic_string<wchar_t>::size() returns the number of wchar_t objects. A surrogate pair (one Unicode character) uses two wchar_t's and therefore adds two to the size.

basic_string<wchar_t>::resize() can truncate a string in the middle of a surrogate pair.

basic_string<wchar_t>::insert() can insert in the middle of a surrogate pair.

basic_string<wchar_t>::erase() can erase either half of a surrogate pair.

In general, the pattern should be clear: the STL does not assume that a std::wstring is in UTF-16, nor enforce that it remains UTF-16.

查看更多
再贱就再见
3楼-- · 2020-07-04 05:42

STL deals with strings as simply a wrapper for an array of characters therefore size() or length() on an STL string will tell you how many char or wchar_t elements it contains and not necessarily the number of printable characters it would be in a string.

查看更多
时光不老,我们不散
4楼-- · 2020-07-04 05:48

Assuming that you're talking about the wstring type, there would be no handling of the encoding - it just deals with wchar_t elements without knowing anything about the encoding. It's just a sequence of wchar_t's. You'll need to deal with encoding issues using functionality of other functions.

查看更多
家丑人穷心不美
5楼-- · 2020-07-04 05:53

MSVC stores wchar_t in wstrings. These can be interpreted as unicode 16 bit words, or anything else really.

If you want to get access to unicode characters or glyphs, you'll have to process said raw string by the unicode standard. You probably also want to handle common corner cases without breaking.

Here is a sketch of such a library. It is about half as memory efficient as it could be, but it does give you in-place access to unicode glyphs in a std::string. It relies on having a decent array_view class, but you want to write one of those anyhow:

struct unicode_char : array_view<wchar_t const> {
  using array_view<wchar_t const>::array_view<wchar_t const>;

  uint32_t value() const {
    if (size()==1)
      return front();
    Assert(size()==2);
    if (size()==2)
    {
      wchar_t high = front()-0xD800;
      wchar_T low = back()-0xDC00;
      return (uint32_t(high)<<10) + uint32_t(low);
    }
    return 0; // error
  }
  static bool is_high_surrogate( wchar_t c ) {
    return (c >= 0xD800 && c <= 0xDBFF);
  }
  static bool is_low_surrogate( wchar_t c ) {
    return (c >= 0xDC00 && c <= 0xDFFF);
  }
  static unicode_char extract( array_view<wchar_t const> raw )
  {
    if (raw.empty())
      return {};
    if (raw.size()==1)
      return raw;
    if (is_high_surrogate(raw.front()) && is_low_surrogate(*std::next(raw.begin())))
      return {raw.begin(), raw.begin()+2);
    return {raw.begin(), std::next(raw.begin())};
  }
};
static std::vector<unicode_char> as_unicode_chars( array_view<wchar_t> raw )
{
  std::vector<unicode_char> retval;
  retval.reserve( raw.size() ); // usually 1:1
  while(!raw.empty())
  {
    retval.push_back( unicode_char::extract(raw) );
    Assert( retval.back().size() <= raw.size() );
    raw = {raw.begin() + retval.back().size(), raw.end()};
  }
  return retval;
}
struct unicode_glyph {
  std::array< unicode_char, 3 > buff;
  std::size_t count=0;
  unicode_char const* begin() const {
    return buff.begin();
  }
  unicode_char const* end() const {
    return buff.begin()+count;
  }
  std::size_t size() const { return count; }
  bool empty() { return size()==0; }
  unicode_char const& front() const { return *begin(); }
  unicode_char const& back() const { return *std::prev(end()); }
  array_view< unicode_char const > chars() const { return {begin(), end()}; }
  array_view< wchar_t const > wchars() const {
    if (empty()) return {};
    return { front().begin(), back().end() };
  }

  void append( unicode_char next ) {
    Assert(count<3);
    buff[count++] = next;
  }
  unicode_glyph() {}

  static bool is_diacrit(unicode_char c) const {
    auto v = c.value();
    return is_diacrit(v);
  }
  static bool is_diacrit(uint32_t v) const {
    return
      ((v >= 0x0300) && (v <= 0x0360))
    || ((v >= 0x1AB0) && (v <= 0x1AFF))
    || ((v >= 0x1DC0) && (v <= 0x1DFF))
    || ((v >= 0x20D0) && (v <= 0x20FF))
    || ((v >= 0xFE20) && (v <= 0xFE2F));
  }
  static size_t diacrit_count(unicode_char c) const {
    auto v = c.value();
    if (is_diacrit(v))
      return 1 + ((v >= 0x035C)&&(v<=0x0362));
    else
      return 0;
  }
  static unicode_glyph extract( array_view<const unicode_char> raw ) {
    unicode_glyph retval;
    if (raw.empty())
      return retval;
    if (raw.size()==1)
    {
      retval.append(raw.front());
      return retval;
    }
    retval.count = diacrit_count( *std::next(raw.begin()) )+1;
    std::copy( raw.begin(), raw.begin()+retval.count, retval.buff.begin() );
    return retval;
  }
};
static std::vector<unicode_glyph> as_unicode_glyphs( array_view<unicode_char> raw )
{
  std::vector<unicode_glyph> retval;
  retval.reserve( raw.size() ); // usually 1:1
  while(!raw.empty())
  {
    retval.push_back( unicode_glyph::extract(raw) );
    Assert( retval.back().size() <= raw.size() );
    raw = {raw.begin() + retval.back().size(), raw.end()};
  }
  return retval;
}
static std::vector<unicode_glyph> as_unicode_glyphs( array_view<wchar_t> raw )
{
  return as_unicode_glyphs( as_unicode_chars( raw ) );
}

a smarter bit of code would generate the unicode_chars and unicode_glyphs on the fly with a factory iterator of some kind. A more compact implementation would keep track of the fact that the end pointer of the previous and begin pointer of the next are always identical, and alias them together. Another optimization would be to use a small object optimization on glyph based off the assumption that most glyphs are one character, and use dynamic allocation if they are two.

Note that I treat CGJ as a standard diacrit, and the double-diacrits as a set of 3 characters that form one (unicode), but half-diacrits don't merge things into one glyph. These are all questionable choices.

This was written in a bout of insomnia. Hope it at least somewhat works.

查看更多
祖国的老花朵
6楼-- · 2020-07-04 05:57

Two things:

  1. There is no "Microsoft STL implementation". The C++ Standard Library shipped with Visual C++ is licensed from Dinkumware.
  2. The current C++ Standard knows nothing about Unicode and its encoding forms. std::wstring is merely a container for wchar_t units which happen to be 16-bit on Windows. In practice, if you want to store a UTF-16 encoded string into a wstring, just take into account that you are really storing code units and not code points.
查看更多
登录 后发表回答