Create UTF-16 string from char*

2019-03-02 16:41发布

问题:

So I have standard C string:

char* name = "Jakub";

And I want to convert it to UTF-16. I figured out, that UTF-16 will be twice as long - one character takes two chars.
So I create another string:

char name_utf_16[10];  //"Jakub" is 5 characters

Now, I believe with ASCII characters I will only use lower bytes, so for all of them it will be like 74 00 for J and so on. With that belief, I can make such code:

void charToUtf16(char* input, char* output, int length) {
    /*Todo: how to check if output is long enough?*/
    for(int i=0; i<length; i+=2)  //Step over 2 bytes
    {
        //Lets use little-endian - smallest bytes first
        output[i] = input[i];
        output[i+1] = 0;  //We will never have any data for this field
    }
}

But, with this process, I ended with "Jkb". I know no way to test this properly - I've just sent the string to Minecraft Bukkit Server. And this is what it said upon disconnecting:

13:34:19 [INFO] Disconnecting jkb?? [/127.0.0.1:53215]: Outdated server!

Note: I'm aware that Minecraft uses big-endian. Code above is just an example, in fact, I have my conversion implemented in class.

回答1:

output[i] = input[i];

This will assign every other byte of the input, because you increment i by 2. So no wonder that you obtain "Jkb". You probably wanted to write:

output[i] = input[i / 2];


回答2:

Why do you want to make your own Unicode conversion functionality when theres's existing C/C++ functions for this, like mbstowcs() which is included in <cstdlib>.

If you still want to make your own stuff, then have a look at Unicode Consortium's open source code which can be found here:

Convert UTF-16 to UTF-8 under Windows and Linux, in C



回答3:

Before I answer your question, consider this:

This area of programming is full of man traps. It makes a lot of sense to understand the differences between ASCII, UTF7/8 and ANSI/'MultiByte Character Strings (MBCS)', all of which to an english speaking programmer will look and feel identical, but need very different handling if they are introduced to a european or asian user.

ASCII: Characters are in range 32-127. only ever one byte. The clue is in the name, they are great for Americans, but not fit for purpose in the rest of the world.

ANSI/MBCS: This is the reason for 'code pages'. Characters 32-127 are the same as ASCII, but it is possible to have characters in the range of 128-255 as well for additional characters, and some of the 128-255 range can be used as a flag to mark that the character continues into a second, third or even fourth byte. To process the string correctly, you need both the string bytes and the correct code page. If you try processing the string using the wrong code page you will not have the right characters, and misinterpret whether a character is a one, two or even 4 byte character.

UTF7/8: These are 8-bit wide formatting of 21-bit unicode character points. in UTF-7 and UTF-8 unicode characters can be between one and four bytes long. The advantage that UTF encodings have over ANSI/MBCS is that there is no ambiguity caused by code pages. Each glyph in every script has a unique unicode code point, which means it is not possible to mangle the character sets by interpreting the data on a different computer with different regional settings.

So to to start to answer your question:

  1. Whilst you are making the assumption that your char* will only point to an ASCII string, that is a really dangerous choice to make, users are in control of data that is typed in, not the programmer. Windows programs will be storing this as MBCS by default.

  2. You are making the second assumption is that a UTF-16 encoding will be twice the size of an 8 bit encoding. That is not generally a safe assumption. depending on the source encoding the UTF-16 encoding may be twice the size, may be less than twice the size, and in an extreme example may actually be shorter in length.

So, what is the safe solution?

The safe option is to implement your application internally as unicode. On windows, this is a compiler option, and then means your windows controls all use wchar_t* strings for their data type. On linux I'm less sure that you can always use unicide graphics and OS libraries. You must also use the wcslen() functions to get the length of strings etc. When you interact with the outside world, be precise in the character encodings used.

To answer to your question then becomes changing the question to, what do i do when I receive non UTF-16 data?

Firstly, be very clear about what assumptions on its formatting are you making? and secondly, accept the fact that sometimes the conversion to UTF-16 may fail.

If you are clear on the source formatting, you can then choose the appropriate win32 or the stl converter to convert the format, and you should then look for evidence the conversion failed before using the result. e.g. mbstowcs in or MultiByteToWideChar() on windows. However the use of both of these approaches safely means you need to understand ALL of the above answer.

All other options introduce risk. Use mbcs strings and you will have data strings mangled by being entered using one code page, and processed using a different code page. Assume ASCII data, and when you encounter a non ascii character your code will break, and you will 'blame' the user for your short comings.