UTF-8 on Windows with Ada

2019-07-18 11:19发布

It is my understanding that by default, Character is Latin_1, Wide_Character is UCS-2, and Wide_Wide_Character is UCS-4, but that GNAT can have specified pragma Wide_Character_Encoding(UTF8); or -gnatW8 and that those characters and their strings will be UTF-8 encoded instead.

At least on Linux and FreeBSD, the results fit with my expectations. But on Windows the results are odd.

For either Wide or Wide_Wide variants, once a character moves beyond the ASCII set, I get a garbled mess. I beleive this is called emojibake by some. So I figured it was a codepage issue. After all, the default codepage in Windows, and therefore what the Console Host would load with, is 437 which isn't the UTF-8 codepage. chcp 65001 and now instead of the mess of extra characters, there's an immediate exception raised ADA.IO_EXCEPTIONS.DEVICE_ERROR : a-ztexio.adb:1295. Looking at where the exception occurred, it seems to be in the putc binding of fputc(). But this is Standard_Output, shouldn't an EOF never happen?

Is there some kind of special consideration Windows needs? How can I get UTF-8 output?

edit:
I tried piping the output into a text file. The supposed UTF-8 encoded program still generates emojibake in the file. Not sure why this would immediately throw an exception in the console though.

So then I tried directly opening and writing to a file instead of the console/pipe. Oddly this works exactly as it should. The text is completely correct.

I've never seen this kind of behavior with any other language, so it should still be possible to get proper UTF-8 at the console, right?

2条回答
不美不萌又怎样
2楼-- · 2019-07-18 11:44

The deficiency so many others, not just here, describe in the Windows Console Host has either been fixed or never existed in the first place. Based on this document, I feel it was probably always very misunderstood. Windows doesn't treat the console like files, and it's easy to fall into that trap.

Using this very straight forward code, along with what Windows needs and expects behind the scenes...

enter image description here

It correctly produces the following, as long as either pragma Wide_Character_Encoding(UTF8); or -gnatW8 is used.

enter image description here

Piping the output of this test program into a file works as it should. Similarly, piping the output of this test program into another program works as it should. And also similarly, taking the file from piped output, and piping it into another program works as it should.

Full UTF-8 behavior as one would expect under Linux, on Windows.

What needs to be done is twofold. In the package initializer, the Console Host needs to be told what it's working with, which can be done like this.

enter image description here

Character output is then done through fputwc. According to MS Docs fputc should never be used for UNICODE on Windows, which is part of the problem GNAT has. String output and character/string input is all similar.

enter image description here

查看更多
甜甜的少女心
3楼-- · 2019-07-18 11:57

Based on others comments and some further research to confirm, I'm pretty sure this is a deficiency of the Windows Console Host.

edit: don't listen to this

查看更多
登录 后发表回答