why does mbstowcs return “invalid multibyte charac

"קמ"ד חיר!" is the input string copy pasted from a print of the variable in gdb. Calling mbstowcs returns -1 with the other input as NULL. Any ideas on what's wrong/how to fix this?

"\327\247\327\236"\327\223 \327\227\327\231\327\250!\000\000\000" is the string with non ascii characters in octal

The programs locale is C.

回答1:

The mbtowcs function doesn't handle UTF-8 encoding, there isn't a locale you can set to have it translate UTF-8 to wchar_t. Therefore, I'll use Windows examples but the general idea is the same on most OS.

In the multi-byte character set world there may not be one meaning for a given octal value and there may not be one octal value for any given character. What a particular octal value means and how a character is represented (or even if it can be represented) is determined by locale.

When mbstowcs returns an error it is basically telling you that there is no wide character equivalent to the multibyte character passed in to it. That might mean there is no UNICODE character (unlikely but not impossible) or it might mean that the locale does not define a character for a given octal value (or sequence of octal values in the case of multi-byte characters).

If you don't explicitly set your locale (by calling setlocale) then you get a locale based on your system configuration. To retrieve your current locale you can call _get_current_locale. Once you know your locale, you can figure out what character (if any) a particular octal value represents and then you can figure out what the UNICODE equivalent would be (if any).

One way to identify a problem character is to vary the length passed in to mbstowcs until you find a single character that causes the error. A brute force approach might be to start at length=1 and increase it until mbstowcs returns -1.

Update July 25th

From the comments discussion we discovered that the input string is (most likely) encoded as UTF-8. While the original answer is correct (so far as it goes) it doesn't go far enough. On Windows you cannot create a locale that will handle characters encoded in UTF-8.

When faced with UTF-8, instead of calling mbtowcs, we can call MultiByteToWideChar using the code page CP_UTF8 but that code will only work on Windows...

BYTE bytes [] = {0xD7,0x99,0xD7,0x95,0xD7,0x97,0xD7,0x90,0xD7,0x99,0x20,0xD7,0x95,0xD7,0x9B,0xD7,0x98,0xD7,0xA8, 0x00};

int result;

// get length of converted string in characters
result = MultiByteToWideChar (CP_UTF8, MB_ERR_INVALID_CHARS, (char *)bytes, 
    sizeof (bytes), NULL, 0);

wchar_t * name = new wchar_t [result];

// convert string
result = MultiByteToWideChar (CP_UTF8, MB_ERR_INVALID_CHARS, (char *)bytes, 
    sizeof (bytes), name, result);

回答2:

I bet it will work if you set UTF-8 like so: