Here is a snippet of a code that is using std::codecvt_utf8<>
facet to convert from wchar_t
to UTF-8. With Visual Studio 2012, my expectations are not met (see the condition at the end of the code). Are my expectations wrong? Why? Or is this a Visual Studio 2012 library issue?
#include <locale>
#include <codecvt>
#include <cstdlib>
int main ()
{
std::mbstate_t state = std::mbstate_t ();
std::locale loc (std::locale (), new std::codecvt_utf8<wchar_t>);
typedef std::codecvt<wchar_t, char, std::mbstate_t> codecvt_type;
codecvt_type const & cvt = std::use_facet<codecvt_type> (loc);
wchar_t ch = L'\u5FC3';
wchar_t const * from_first = &ch;
wchar_t const * from_mid = &ch;
wchar_t const * from_end = from_first + 1;
char out_buf[1];
char * out_first = out_buf;
char * out_mid = out_buf;
char * out_end = out_buf + 1;
std::codecvt_base::result cvt_res
= cvt.out (state, from_first, from_end, from_mid,
out_first, out_end, out_mid);
// This is what I expect:
if (cvt_res == std::codecvt_base::partial
&& out_mid == out_end
&& state != 0)
;
else
abort ();
}
The expectation here is that the out()
function output one byte of the UTF-8 conversion at a time but the middle of the if
conditional above is false with Visual Studio 2012.
UPDATE
What fails is the out_mid == out_end
and state != 0
conditions. Basically, I expect at least one byte to be produced and the necessary state, for next byte of the UTF-8 sequence to be producible, to be stored in the state
variable.
The standard description of partial
return code of codecvt::do_out
says exactly this:
in Table 83:
partial
not all source characters converted
In 22.4.1.4.2[locale.codecvt.virtuals]/5:
Returns: An enumeration value, as summarized in Table 83. A return value of partial
, if (from_next==from_end)
, indicates that either the destination sequence
has not absorbed all the available destination elements, or that additional source elements are needed before another destination element can be produced.
In your case, not all (zero) source characters were converted, which technically says nothing of the contents of the output sequence (the 'if' clause in the sentence is not entered), but speaking generally, "the destination sequence has not absorbed all the available destination elements" here talks about valid multibyte characters. They are the elements of the multibyte character sequence produced by codecvt_utf8
.
It would be nice to have a more explicit standard wording, but here are two circumstantial pieces of evidence:
One: the old C's wide-to-multibyte conversion function std::wcsrtombs
(whose locale-specific variants are usually called by the existing implementations of codecvt::do_out
for system-supplied locales) is defined as follows:
Conversion stops [...] when the next multibyte character would exceed the limit of len total bytes to be stored into the array pointed to by dst.
And two, look at the existing implementations of codecvt_utf8
: you've already explored Microsoft's, and here's what's in libc++: codecvt_utf8::do_out
here calls ucs2_to_utf8
on Windows and ucs4_to_utf8
on other systems, and ucs2_to_utf8 does the following (comments mine):
else if (wc < 0x0800)
{
// not relevant
}
else // if (wc <= 0xFFFF)
{
if (to_end-to_nxt < 3)
return codecvt_base::partial; // <- look here
*to_nxt++ = static_cast<uint8_t>(0xE0 | (wc >> 12));
*to_nxt++ = static_cast<uint8_t>(0x80 | ((wc & 0x0FC0) >> 6));
*to_nxt++ = static_cast<uint8_t>(0x80 | (wc & 0x003F));
}
nothing is written to the output sequence if it cannot fit a multibyte character that results from consuming one input wide character.
Although there is no direct reference of it, I'd think that is most logical behavior of std::codecvt::out
. Consider following scenario:
- You use
std::codecvt::out
in the same manner as you did - not translating any characters (possibly without knowing) into your out_buf
.
- You now want to translate another string into your
out_buf
(again using std::codecvt::out
) such that it appends the content which is already inside
- To do so, you decide to use your
buf_mid
as you know it points directly after your string that you translated in the first step.
- Now, if
std::codecvt::out
worked according to your expectations (buf_mid
pointing to the character after first) then the first character of your out_buf
would never be written which would not be exactly what you would want/expect in this case.
In essence, extern_type*& to_next
(last parameter of std::codecvt::out
) is here for you as a reference of where you left of - so you know where to continue - which is in your case indeed the same position as where you started (extern_type* to
) parameter.