char vs wchar_t

2020-02-26 09:55发布

问题:

I'm trying to print out a wchar_t* string. Code goes below:

#include <stdio.h>
#include <string.h>
#include <wchar.h>

char *ascii_ = "中日友好";  //line-1
wchar_t *wchar_ = L"中日友好";  //line-2

int main()
{
    printf("ascii_: %s\n", ascii_);  //line-3
    wprintf(L"wchar_: %s\n", wchar_);  //line-4
    return 0;
}

//Output
ascii_: 中日友好

Question:

  1. Apparently I should not assign CJK characters to char* pointer in line-1, but I just did it, and the output of line-3 is correct, So why? How could printf() in line-3 give me the non-ascii characters? Does it know the encoding somehow?

  2. I assume the code in line-2 and line-4 are correct, but why I didn't get any output of line-4?

回答1:

First of all, it's usually not a good idea to use non-ascii characters in source code. What's probably happening is that the chinese characters are being encoded as UTF-8 which works with ascii.

Now, as for why the wprintf() isn't working. This has to do with stream orientation. Each stream can only be set to either normal or wide. Once set, it cannot be changed. It is set the first time it is used. (which is ascii due to the printf). After that the wprintf will not work due the incorrect orientation.

In other words, once you use printf() you need to keep on using printf(). Similarly, if you start with wprintf(), you need to keep using wprintf().

You cannot intermix printf() and wprintf(). (except on Windows)

EDIT:

To answer the question about why the wprintf line doesn't work even by itself. It's probably because the code is being compiled so that the UTF-8 format of 中日友好 is stored into wchar_. However, wchar_t needs 4-byte unicode encoding. (2-bytes in Windows)

So there's two options that I can think of:

  1. Don't bother with wchar_t, and just stick with multi-byte chars. This is the easy way, but may break if the user's system is not set to the Chinese locale.
  2. Use wchar_t, but you will need to encode the Chinese characters using unicode escape sequences. This will obviously make it unreadable in the source code, but it will work on any machine that can print Chinese character fonts regardless of the locale.


回答2:

Line 1 is not ascii, it's whatever multibyte encoding is used by your compiler at compile-time. On modern systems that's probably UTF-8. printf does not know the encoding. It's just sending bytes to stdout, and as long as the encodings match, everything is fine.

One problem you should be aware of is that lines 3 and 4 together invoke undefined behavior. You cannot mix character-based and wide-character io on the same FILE (stdout). After the first operation, the FILE has an "orientation" (either byte or wide), and after that any attempt to perform operations of the opposite orientation results in UB.



回答3:

You are omitting one step and therefore think the wrong way.

You have a C file on disk, containing bytes. You have a "ASCII" string and a wide string.

The ASCII string takes the bytes exactly like they are in line 1 and outputs them. This works as long as the encoding of the user's side is the same as the one on the programmer's side.

The wide string first decodes the given bytes into unicode codepoints and stored in the program- maybe this goes wrong on your side. On output they are encoded again according to the encoding on the user's side. This ensures that these characters are emitted as they are intended to, not as they are entered.

Either your compiler assumes the wrong encoding, or your output terminal is set up the wrong way.