Reading UTF-8 from stdin using fgets() on Windows

2019-07-18 13:33发布

问题:

I'm trying to read a UTF-8 string from stdin using fgets(). The console input mode has been set to CP_UTF8 before. I've also set the console font to Lucida Console in PowerShell. Finally, I've verified that UTF-8 output is working by printing a German Ä (in UTF-8: 0xC3,0x84) to the console using printf(). This is working correctly but fgets() doesn't seem to be able to read UTF-8 from the console. Here is a small test program:

#include <stdio.h>  
#include <windows.h>

int main(int argc, char *argv[])
{
    unsigned char s[64];

    memset(s, 0, 64);

    SetConsoleOutputCP(CP_UTF8);    
    SetConsoleCP(CP_UTF8);

    printf("UTF-8 Test: %c%c\n", 0xc3, 0x84);  // print Ä

    fgets(s, 64, stdin);

    printf("Result: %d %d\n", s[0], s[1]);

    return 0;
}

When running this program and entering "Ä" and then hitting ENTER, it just prints the following:

Result: 0 0

i.e. nothing has been written to s. When typing "A", however, I get the following correct result:

Result: 65 10

So how can I make fgets() work with UTF-8 characters on Windows please?

EDIT

Based on Barmak's explanations, I've now updated my code to use wchar_t functions instead of the ANSI ones. However, it still doesn't work. Here is my code:

#include <stdio.h>
#include <io.h>
#include <fcntl.h>

#include <windows.h>

int main(int argc, char *argv[])
{
    wchar_t s[64];

    memset(s, 0, 64 * sizeof(wchar_t));

    _setmode(_fileno(stdin), _O_U16TEXT);       
    fgetws(s, 64, stdin);

    wprintf(L"Result: %d\n", s[0]);

    return 0;
}   

When entering A the program prints Result: 3393 but I'd expect it to be 65. When entering Ä the program prints Result: 0 but I'd expect it to be 196. What the heck is going on there? Why isn't even working for ASCII characters now? My old program using just fgets() worked correctly for ASCII characters like A, it only failed for non-ASCII characters like Ä. But the new version doesn't even work for ASCII characters or is 3393 the correct result for A? I'd expect it to be 65. I'm pretty confused now... help please!

回答1:

All windows native string manipulations (with very rarely exceptions) was in UNICODE (UTF-16) - so we must use unicode functions anywhere. use ANSI variant - very bad practice. if you will be use unicode functions in your example - all will be work correct. with ANSI this not work by .. windows bug ! i can cover this with all details (researched on win 8.1):

1) in console server process exist 2 global variables:

UINT gInputCodePage, gOutputCodePage;

it can be read/write by GetConsoleCP/SetConsoleCP and GetConsoleOutputCP/SetConsoleOutputCP. they used as first argument for WideCharToMultiByte/MultiByteToWideChar when need convert. if you use only unicode functions - they never used

2.a) when you write to console UNICODE text - it will be writen as is without any conversions. on server side this done in SB_DoSrvWriteConsole function. look picture: 2.b) when you write to console ANSI text - SB_DoSrvWriteConsole also will be called, but with one additional step - MultiByteToWideChar(gOutputCodePage, ...) - your text will be converted to UNICODE first. but here one moment. look: in MultiByteToWideChar call cchWideChar == cbMultiByte. if we use only 'english' charset (chars < 0x80) length of UNICODE and multibyte strings in chars always equal, but with another languages - usual Multibyte version use more chars than UNICODE but here this is not problem, simply size of out buffer more then need, but it is ok. so you printf in general will be work correct. one note only - if you hardcode multibyte string in source code - faster of all it will be in CP_ACP form, and conversion to UNICODE with CP_UTF8 - give incorrect result. so this is depended in which format your source file saved on disk :)

3.a) when you read from console with UNICODE functions - you got exactly UNICODE text as is. here no any problem. if need - you can then direct by self convert it to multibyte

3.b) when you read from console with ANSI functions - server first convert UNICODE string to ANSI, and then return to you ANSI form. this done by function

int ConvertToOem(UINT CodePage /*=gInputCodePage*/, PCWSTR lpWideCharStr, int cchWideChar, PSTR lpMultiByteStr, int cbMultiByte)
{
    if (CodePage == g_OEMCP)
    {
        ULONG BytesInOemString;
        return 0 > RtlUnicodeToOemN(lpMultiByteStr, cbMultiByte, &BytesInOemString, lpWideCharStr, cchWideChar * sizeof(WCHAR)) ? 0 : BytesInOemString;
    }
    return WideCharToMultiByte(CodePage, 0, lpWideCharStr, cchWideChar, lpMultiByteStr, cbMultiByte, 0, 0);
}

but let look more close, how ConvertToOem called: here again cbMultiByte == cchWideChar, but this is 100% bug ! multibyte string can be longer than UNICODE (in chars of course) . for example "Ä" - this is 1 UNICODE char and 2 UTF8 chars. as result WideCharToMultiByte return 0. (ERROR_INSUFFICIENT_BUFFER )



回答2:

Windows uses UTF16. Most likely Windows' console doesn't support UTF8.

Use UTF16 along with wide string functions (wcsxxx instead of strxxx). You can then use WideCharToMultiByte to convert UTF16 to UTF8. Example:

#include <stdio.h>
#include <string.h>
#include <io.h> //for _setmode
#include <fcntl.h> //for _O_U16TEXT

int main()
{
    _setmode(_fileno(stdout), _O_U16TEXT);
    _setmode(_fileno(stdin), _O_U16TEXT);
    wchar_t s[64];
    fgetws(s, 64, stdin);
    _putws(s);
    return 0;
}

Note that you can't use ANSI print functions after calling _setmode(_fileno(stdout), _O_U16TEXT), it has to be reset. You may try something like the functions below which reset the text mode.

char* mygets(int wlen)
{
    //may require fflush here, see _setmode documentation
    int save = _setmode(_fileno(stdin), _O_U16TEXT);
    wchar_t *wstr = malloc(wlen * sizeof(wchar_t));
    fgetws(wstr, wlen, stdin);

    //make UTF-8:
    int len = WideCharToMultiByte(CP_UTF8, 0, wstr, -1, 0, 0, 0, 0);
    if (!len) return NULL;
    char* str = malloc(len);
    WideCharToMultiByte(CP_UTF8, 0, wstr, -1, str, len, 0, 0);
    free(wstr);

    _setmode(_fileno(stdin), save);
    return str;
}

void myputs(const char* str)
{
    //may require fflush here, see _setmode documentation
    int save = _setmode(_fileno(stdout), _O_U16TEXT);

    //make UTF-16
    int wlen = MultiByteToWideChar(CP_UTF8, 0, str, -1, 0, 0);
    if (!wlen) return;
    wchar_t* wstr = malloc(wlen * sizeof(wchar_t));
    memset(wstr, 0, wlen * 2);
    MultiByteToWideChar(CP_UTF8, 0, str, -1, wstr, wlen);

    _putws(wstr);
    _setmode(_fileno(stdout), save);
}

int main()
{
    char* utf8 = mygets(100);
    if (utf8)
    {
        myputs(utf8);
        free(utf8);
    }
    return 0;
}