How do I convert a decimal number, 225 for example, to its corresponding Unicode character when it's being output? I can convert ASCII characters from decimal to the character like this:
int a = 97;
char b = a;
cout << b << endl;
And it output the letter "a", but it just outputs a question mark when I use the number 225, or any non-ascii character.
To start with, it's not your C++ program which converts strings of bytes written to standard output into visible characters; it's your terminal (or, more commonly these days, your terminal emulator). Unfortunately, there is no way to ask the terminal how it expects characters to be encoded, so that needs to be configured into your environment; normally, that's done by setting appropriate
locale
environment variables.Like most things which have to do with terminals, the locale configuration system would probably have been done very differently if it hadn't developed with a history of many years of legacy software and hardware, most of which were originally designed without much consideration for niceties like accented letters, syllabaries or ideographs. C'est la vie.
Unicode is pretty cool, but it also had to be deployed in the face of the particular history of computer representation of writing systems, which meant making a lot of compromises in the face of the various firmly-held but radically contradictory opinions in the software engineering community, dicho sea de paso a community in which head-butting is rather more common that compromise. The fact that Unicode has eventually become more or less the standard is a testimony to its solid technical foundations and the perseverance and political skills of its promoters and designers -- particularly Mark Davis --, and I say this despite the fact that it basically took more than two decades to get to this point.
One of the aspects of this history of negotiation and compromise is that there is more than one way to encode a Unicode string into bits. There are at least three ways, and two of those have two different versions depending on endianness; moreover, each of these coding systems has its dedicated fans (and consequently, its dogmatic detractors). In particular, Windows made an early decision to go with a mostly-16-bit encoding, UTF-16, while most unix(-like) systems use a variable-length 8-to-32-bit encoding, UTF-8. (Technically, UTF-16 is also a 16- or 32-bit encoding, but that's beyond the scope of this rant.)
Pre-Unicode, every country/language used their own idiosyncratic 8-bit encoding (or, at least, those countries whose languages are written with an alphabet of less than 194 characters). Consequently, it made sense to configure the encoding as part of the general configuration of local presentation, like the names of months, the currency symbol, and what character separates the integer part of a number from its decimal fraction. Now that there is widespread (but still far from universal) convergence on Unicode, it seems odd that locales include the particular flavour of Unicode encoding, given that all flavours can represent the same Unicode strings and that the encoding is more generally specific to the particular software being used than the national idiosyncrasy. But it is, and that's why on my Ubuntu box, the environment variable
LANG
is set toes_ES.UTF-8
and not justes_ES
. (Ores_PE
, as it should be, except that I keep running into little issues with that locale.) If you're using a linux system, you might find something similar.In theory, that means that my terminal emulator (
konsole
, as it happens, but there are various) expects to see UTF-8 sequences. And, indeed,konsole
is clever enough to check the locale setting and set up its default encoding to match, but I'm free to change the encoding (or the locale settings), and confusion is likely to result.So let's suppose that your locale settings and the encoding used by your terminal are actually in synch, which they should be on a well-configure workstation, and go back to the C++ program. Now, the C++ program needs to figure out which encoding it's supposed to use, and then transform from whatever internal representation it uses to the external encoding.
Fortunately, the C++ standard library should handle that correctly, if you cooperate by:
Telling the standard library to use the configured locale, instead of the default
C
(i.e. only unaccented characters, as per English) locale; andUsing strings and iostreams based on
wchar_t
(or some other wide character format).If you do that, in theory you don't need to know either what
wchar_t
means to your standard library, nor what a particular bit pattern means to your terminal emulator. So let's try that:That works on my system. YMMV. Good luck.
Small side-note: I've run into lots of people who think that
wcout
automatically writes "wide characters", so that using it will produce UTF-16 or UTF-32 or something. It doesn't. It produces exactly the same encoding ascout
. The difference is not what it outputs but what it accepts as input. In fact, it can't really be different fromcout
because both of them are connected to the same OS stream, which can only have one encoding (at a time).You might ask why it is necessary to have two different
iostream
s. Why couldn'tcout
have just acceptedwchar_t
andstd::wstring
values? I don't actually have an answer for that, but I suspect it is part of the philosophy of not paying for features you don't need. Or something like that. If you figure it out, let me know.If for some reason you want to handle this entirely on your own:
And then to use it:
Note that this is just a standard UTF-8 encoding algorithm, so if your platform does not assume UTF-8 it might not render correctly. (Thanks, @EmilioGaravaglia)