It is my understanding that by default, Character
is Latin_1, Wide_Character
is UCS-2, and Wide_Wide_Character
is UCS-4, but that GNAT can have specified pragma Wide_Character_Encoding(UTF8);
or -gnatW8
and that those characters and their strings will be UTF-8 encoded instead.
At least on Linux and FreeBSD, the results fit with my expectations. But on Windows the results are odd.
For either Wide or Wide_Wide variants, once a character moves beyond the ASCII set, I get a garbled mess. I beleive this is called emojibake by some. So I figured it was a codepage issue. After all, the default codepage in Windows, and therefore what the Console Host would load with, is 437 which isn't the UTF-8 codepage. chcp 65001
and now instead of the mess of extra characters, there's an immediate exception raised ADA.IO_EXCEPTIONS.DEVICE_ERROR : a-ztexio.adb:1295
. Looking at where the exception occurred, it seems to be in the putc
binding of fputc()
. But this is Standard_Output, shouldn't an EOF never happen?
Is there some kind of special consideration Windows needs? How can I get UTF-8 output?
edit:
I tried piping the output into a text file. The supposed UTF-8 encoded program still generates emojibake in the file. Not sure why this would immediately throw an exception in the console though.
So then I tried directly opening and writing to a file instead of the console/pipe. Oddly this works exactly as it should. The text is completely correct.
I've never seen this kind of behavior with any other language, so it should still be possible to get proper UTF-8 at the console, right?
The deficiency so many others, not just here, describe in the Windows Console Host has either been fixed or never existed in the first place. Based on this document, I feel it was probably always very misunderstood. Windows doesn't treat the console like files, and it's easy to fall into that trap.
Using this very straight forward code, along with what Windows needs and expects behind the scenes...
It correctly produces the following, as long as either
pragma Wide_Character_Encoding(UTF8);
or-gnatW8
is used.Piping the output of this test program into a file works as it should. Similarly, piping the output of this test program into another program works as it should. And also similarly, taking the file from piped output, and piping it into another program works as it should.
Full UTF-8 behavior as one would expect under Linux, on Windows.
What needs to be done is twofold. In the package initializer, the Console Host needs to be told what it's working with, which can be done like this.
Character output is then done through
fputwc
. According to MS Docsfputc
should never be used for UNICODE on Windows, which is part of the problem GNAT has. String output and character/string input is all similar.Based on others comments and some further research to confirm, I'm pretty sure this is a deficiency of the Windows Console Host.
edit: don't listen to this