trouble with std::codecvt_utf8 facet

2020-06-23 06:00发布

问题:

Here is a snippet of a code that is using std::codecvt_utf8<> facet to convert from wchar_t to UTF-8. With Visual Studio 2012, my expectations are not met (see the condition at the end of the code). Are my expectations wrong? Why? Or is this a Visual Studio 2012 library issue?

#include <locale>
#include <codecvt>
#include <cstdlib>

int main ()
{
    std::mbstate_t state = std::mbstate_t ();
    std::locale loc (std::locale (), new std::codecvt_utf8<wchar_t>);
    typedef std::codecvt<wchar_t, char, std::mbstate_t> codecvt_type;
    codecvt_type const & cvt = std::use_facet<codecvt_type> (loc);

    wchar_t ch = L'\u5FC3';
    wchar_t const * from_first = &ch;
    wchar_t const * from_mid = &ch;
    wchar_t const * from_end = from_first + 1;

    char out_buf[1];
    char * out_first = out_buf;
    char * out_mid = out_buf;
    char * out_end = out_buf + 1;

    std::codecvt_base::result cvt_res
        = cvt.out (state, from_first, from_end, from_mid,
            out_first, out_end, out_mid);

    // This is what I expect:
    if (cvt_res == std::codecvt_base::partial
        && out_mid == out_end
        && state != 0)
        ;
    else
        abort ();
}

The expectation here is that the out() function output one byte of the UTF-8 conversion at a time but the middle of the if conditional above is false with Visual Studio 2012.

UPDATE

What fails is the out_mid == out_end and state != 0 conditions. Basically, I expect at least one byte to be produced and the necessary state, for next byte of the UTF-8 sequence to be producible, to be stored in the state variable.

回答1:

The standard description of partial return code of codecvt::do_out says exactly this:

in Table 83:

partial not all source characters converted

In 22.4.1.4.2[locale.codecvt.virtuals]/5:

Returns: An enumeration value, as summarized in Table 83. A return value of partial, if (from_next==from_end), indicates that either the destination sequence has not absorbed all the available destination elements, or that additional source elements are needed before another destination element can be produced.

In your case, not all (zero) source characters were converted, which technically says nothing of the contents of the output sequence (the 'if' clause in the sentence is not entered), but speaking generally, "the destination sequence has not absorbed all the available destination elements" here talks about valid multibyte characters. They are the elements of the multibyte character sequence produced by codecvt_utf8.

It would be nice to have a more explicit standard wording, but here are two circumstantial pieces of evidence:

One: the old C's wide-to-multibyte conversion function std::wcsrtombs (whose locale-specific variants are usually called by the existing implementations of codecvt::do_out for system-supplied locales) is defined as follows:

Conversion stops [...] when the next multibyte character would exceed the limit of len total bytes to be stored into the array pointed to by dst.

And two, look at the existing implementations of codecvt_utf8: you've already explored Microsoft's, and here's what's in libc++: codecvt_utf8::do_out here calls ucs2_to_utf8 on Windows and ucs4_to_utf8 on other systems, and ucs2_to_utf8 does the following (comments mine):

        else if (wc < 0x0800)
        {
            // not relevant
        }
        else // if (wc <= 0xFFFF)
        {
            if (to_end-to_nxt < 3)
                return codecvt_base::partial; // <- look here
            *to_nxt++ = static_cast<uint8_t>(0xE0 |  (wc >> 12));
            *to_nxt++ = static_cast<uint8_t>(0x80 | ((wc & 0x0FC0) >> 6));
            *to_nxt++ = static_cast<uint8_t>(0x80 |  (wc & 0x003F));
        }

nothing is written to the output sequence if it cannot fit a multibyte character that results from consuming one input wide character.



回答2:

Although there is no direct reference of it, I'd think that is most logical behavior of std::codecvt::out. Consider following scenario:

  • You use std::codecvt::out in the same manner as you did - not translating any characters (possibly without knowing) into your out_buf.
  • You now want to translate another string into your out_buf (again using std::codecvt::out) such that it appends the content which is already inside
  • To do so, you decide to use your buf_mid as you know it points directly after your string that you translated in the first step.
  • Now, if std::codecvt::out worked according to your expectations (buf_mid pointing to the character after first) then the first character of your out_buf would never be written which would not be exactly what you would want/expect in this case.

In essence, extern_type*& to_next (last parameter of std::codecvt::out) is here for you as a reference of where you left of - so you know where to continue - which is in your case indeed the same position as where you started (extern_type* to) parameter.

  • cppreferece.com on std::codecvt::out

  • cpulusplus.com on std::codecvt::out