This is the way I try to do it:
#include <stdio.h>
#include <windows.h>
using namespace std;
int main() {
SetConsoleOutputCP(CP_UTF8);
//german chars won't appear
char const* text = "aäbcdefghijklmnoöpqrsßtuüvwxyz";
int len = MultiByteToWideChar(CP_UTF8, 0, text, -1, 0, 0);
wchar_t *unicode_text = new wchar_t[len];
MultiByteToWideChar(CP_UTF8, 0, text, -1, unicode_text, len);
wprintf(L"%s", unicode_text);
}
And the effect is that only us ascii chars are displayed. No errors are shown. The source file is encoded in utf8.
So, what I'm doing wrong here ?
to WouterH:
int main() {
SetConsoleOutputCP(CP_UTF8);
const wchar_t *unicode_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", unicode_text);
}
- this also doesn't work. Effect is just the same. My font is of course Lucida Console.
third take:
#include <stdio.h>
#define _WIN32_WINNT 0x05010300
#include <windows.h>
#define _O_U16TEXT 0x20000
#include <fcntl.h>
using namespace std;
int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
const wchar_t *u_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", u_text);
}
ok, something begins to work, but the output is: ańbcdefghijklmno÷pqrs▀tuŘvwxyz
.
Console can be set to display UTF-8 chars: @vladasimovic answers
SetConsoleOutputCP(CP_UTF8)
can be used for that. Alternatively, you can prepare your console by DOS commandchcp 65001
or by system callsystem("chcp 65001 > nul")
in the main program. Don't forget to save the source code in UTF-8 as well.To check the UTF-8 support, run
65001
should appear in the list.Windows console uses OEM codepages by default and most default raster fonts support only national characters. Windows XP and newer also supports TrueType fonts, which should display missing chars (@Devenec suggests Lucida Console in his answer).
Why printf fails
As @bames53 points in his answer, Windows console is not a stream device, you need to write all bytes of multibyte character. Sometimes
printf
messes the job, putting the bytes to output buffer one by one. Try usesprintf
and thenputs
the result, or force to fflush only accumulated output buffer.If everything fails
Note the UTF-8 format: one character is displayed as 1-5 bytes. Use this function to shift to next character in the string:
...and this function to transform the bytes into unicode number:
Then you can try to use some wild/ancient/non-standard winAPI function like MultiByteToWideChar (don't forget to call
setlocale()
before!)or you can use your own mapping from Unicode table to your active working codepage. Example:
This should print
If your codepage doesn't support that Czech interpunction, you could map 345=>r, 237=>i, 353=>s, 283=>e. There are at least 5(!) different charsets just for Czech. To display readable characters on different Windows locale is a horror.
By default the wide print functions on Windows do not handle characters outside the ascii range.
There are a few ways to get Unicode data to the Windows console.
use the console API directly, WriteConsoleW. You'll have to ensure you're actually writing to a console and use other means when the output is to something else.
set the mode of the standard output file descriptors to one of the 'Unicode' modes, _O_U16TEXT or _O_U8TEXT. This causes the wide character output functions to correctly output Unicode data to the Windows console. If they're used on file descriptors that don't represent a console then they cause the output stream of bytes to be UTF-16 and UTF-8 respectively. N.B. after setting these modes the non-wide character functions on the corresponding stream are unusable and result in a crash. You must use only the wide character functions.
UTF-8 text can be printed directly to the console by setting the console output codepage to CP_UTF8, if you use the right functions. Most of the higher level functions such as
basic_ostream<char>::operator<<(char*)
don't work this way, but you can either use lower level functions or implement your own ostream that works around the problem the standard functions have.The problem with the third method is this:
Unlike most operating systems, the console on Windows is not simply another file that accepts a stream of bytes. It's a special device created and owned by the program and accessed via its own unique WIN32 API. The issue is that when the console is written to, the API sees exactly the extent of the data passed in that use of its API, and the conversion from narrow characters to wide characters occurs without considering that the data may be incomplete. When a multibyte character is passed using more than one call to the console API, each separately passed piece is seen as an illegal encoding, and is treated as such.
It ought to be easy enough to work around this, but the CRT team at Microsoft views it as not their problem whereas whatever team works on the console probably doesn't care.
You might solve it by implementing your own streambuf subclass which handles doing the conversion to wchar_t correctly. I.e. accounting for the fact that bytes of multibyte characters may come separately, maintaining conversion state between writes (e.g.,
std::mbstate_t
).Another trick, instead of
SetConsoleOutputCP
, would be using _setmode onstdout
:Don't forget to remove the call to
SetConsoleOutputCP(CP_UTF8);
Result:
aäbcdefghijklmnoöpqrsßtuüvwxyz
I solved the problem in the following way:
Lucida Console doesn't seem to support umlauts, so changing the console font to Consolas, for example, works.
EDIT: fixed stupid typos and the decoding of the string literal, sorry about those.
UTF-8 doesn't work for Windows console. Period. I have tried all combinations with no success. Problems arise due to different ANSI/OEM character assignment so some answers say that there is no problem but such answers may come from programmers using 7-bit plain ASCII or have identical ANSI/OEM code pages (Chinese, Japanese).
Either you stick to use UTF-16 and the wide-char functions (but you are still restricted to the 256 characters of your OEM code page - except for Chinese/Japanese), or you use OEM code page ASCII strings in your source file.
Yes, it is a mess at all.
For multilingual programs I use string resources, and wrote a
LoadStringOem()
function that auto-translates the UTF-16 resource to OEM string usingWideCharToMultiByte()
without intermediate buffer. As Windows auto-selects the right language out of the resource, it will hopefully load a string in a language that is convertible to the target OEM code page.As a consequence, you should not use 8-bit typographic characters for English-US language resource (as ellipsis … and quotes “”) as English-US is selected by Windows when no language match has been detected (i.e. fallback). As an example you have resources in German, Czech, Russian, and English-US, and the user has Chinese, he/she will see English plus garbage instead of your nicely made typographic if you made your text nice-looking.