I'm trying to read a UTF-8 string from stdin
using fgets()
. The console input mode has been set to CP_UTF8
before. I've also set the console font to Lucida Console in PowerShell. Finally, I've verified that UTF-8 output is working by printing a German Ä
(in UTF-8: 0xC3,0x84) to the console using printf()
. This is working correctly but fgets()
doesn't seem to be able to read UTF-8 from the console. Here is a small test program:
#include <stdio.h>
#include <windows.h>
int main(int argc, char *argv[])
{
unsigned char s[64];
memset(s, 0, 64);
SetConsoleOutputCP(CP_UTF8);
SetConsoleCP(CP_UTF8);
printf("UTF-8 Test: %c%c\n", 0xc3, 0x84); // print Ä
fgets(s, 64, stdin);
printf("Result: %d %d\n", s[0], s[1]);
return 0;
}
When running this program and entering "Ä" and then hitting ENTER, it just prints the following:
Result: 0 0
i.e. nothing has been written to s
. When typing "A", however, I get the following correct result:
Result: 65 10
So how can I make fgets()
work with UTF-8 characters on Windows please?
EDIT
Based on Barmak's explanations, I've now updated my code to use wchar_t
functions instead of the ANSI ones. However, it still doesn't work. Here is my code:
#include <stdio.h>
#include <io.h>
#include <fcntl.h>
#include <windows.h>
int main(int argc, char *argv[])
{
wchar_t s[64];
memset(s, 0, 64 * sizeof(wchar_t));
_setmode(_fileno(stdin), _O_U16TEXT);
fgetws(s, 64, stdin);
wprintf(L"Result: %d\n", s[0]);
return 0;
}
When entering A
the program prints Result: 3393
but I'd expect it to be 65
. When entering Ä
the program prints Result: 0
but I'd expect it to be 196
. What the heck is going on there? Why isn't even working for ASCII characters now? My old program using just fgets()
worked correctly for ASCII characters like A
, it only failed for non-ASCII characters like Ä
. But the new version doesn't even work for ASCII characters or is 3393
the correct result for A
? I'd expect it to be 65
. I'm pretty confused now... help please!
Windows uses UTF16. Most likely Windows' console doesn't support UTF8.
Use UTF16 along with wide string functions (
wcsxxx
instead ofstrxxx
). You can then useWideCharToMultiByte
to convert UTF16 to UTF8. Example:Note that you can't use ANSI print functions after calling
_setmode(_fileno(stdout), _O_U16TEXT)
, it has to be reset. You may try something like the functions below which reset the text mode.All windows native string manipulations (with very rarely exceptions) was in UNICODE (UTF-16) - so we must use unicode functions anywhere. use ANSI variant - very bad practice. if you will be use unicode functions in your example - all will be work correct. with ANSI this not work by .. windows bug ! i can cover this with all details (researched on win 8.1):
1) in console server process exist 2 global variables:
it can be read/write by GetConsoleCP/SetConsoleCP and GetConsoleOutputCP/SetConsoleOutputCP. they used as first argument for WideCharToMultiByte/MultiByteToWideChar when need convert. if you use only unicode functions - they never used
2.a) when you write to console UNICODE text - it will be writen as is without any conversions. on server side this done in SB_DoSrvWriteConsole function. look picture: 2.b) when you write to console ANSI text - SB_DoSrvWriteConsole also will be called, but with one additional step - MultiByteToWideChar(gOutputCodePage, ...) - your text will be converted to UNICODE first. but here one moment. look: in MultiByteToWideChar call cchWideChar == cbMultiByte. if we use only 'english' charset (chars < 0x80) length of UNICODE and multibyte strings in chars always equal, but with another languages - usual Multibyte version use more chars than UNICODE but here this is not problem, simply size of out buffer more then need, but it is ok. so you printf in general will be work correct. one note only - if you hardcode multibyte string in source code - faster of all it will be in CP_ACP form, and conversion to UNICODE with CP_UTF8 - give incorrect result. so this is depended in which format your source file saved on disk :)
3.a) when you read from console with UNICODE functions - you got exactly UNICODE text as is. here no any problem. if need - you can then direct by self convert it to multibyte
3.b) when you read from console with ANSI functions - server first convert UNICODE string to ANSI, and then return to you ANSI form. this done by function
but let look more close, how ConvertToOem called: here again cbMultiByte == cchWideChar, but this is 100% bug ! multibyte string can be longer than UNICODE (in chars of course) . for example "Ä" - this is 1 UNICODE char and 2 UTF8 chars. as result WideCharToMultiByte return 0. (ERROR_INSUFFICIENT_BUFFER )