Decimal to Unicode Char in C++

2019-07-21 09:27发布

问题:

How do I convert a decimal number, 225 for example, to its corresponding Unicode character when it's being output? I can convert ASCII characters from decimal to the character like this:

int a = 97;
char b = a;
cout << b << endl;

And it output the letter "a", but it just outputs a question mark when I use the number 225, or any non-ascii character.

回答1:

To start with, it's not your C++ program which converts strings of bytes written to standard output into visible characters; it's your terminal (or, more commonly these days, your terminal emulator). Unfortunately, there is no way to ask the terminal how it expects characters to be encoded, so that needs to be configured into your environment; normally, that's done by setting appropriate locale environment variables.

Like most things which have to do with terminals, the locale configuration system would probably have been done very differently if it hadn't developed with a history of many years of legacy software and hardware, most of which were originally designed without much consideration for niceties like accented letters, syllabaries or ideographs. C'est la vie.

Unicode is pretty cool, but it also had to be deployed in the face of the particular history of computer representation of writing systems, which meant making a lot of compromises in the face of the various firmly-held but radically contradictory opinions in the software engineering community, dicho sea de paso a community in which head-butting is rather more common that compromise. The fact that Unicode has eventually become more or less the standard is a testimony to its solid technical foundations and the perseverance and political skills of its promoters and designers -- particularly Mark Davis --, and I say this despite the fact that it basically took more than two decades to get to this point.

One of the aspects of this history of negotiation and compromise is that there is more than one way to encode a Unicode string into bits. There are at least three ways, and two of those have two different versions depending on endianness; moreover, each of these coding systems has its dedicated fans (and consequently, its dogmatic detractors). In particular, Windows made an early decision to go with a mostly-16-bit encoding, UTF-16, while most unix(-like) systems use a variable-length 8-to-32-bit encoding, UTF-8. (Technically, UTF-16 is also a 16- or 32-bit encoding, but that's beyond the scope of this rant.)

Pre-Unicode, every country/language used their own idiosyncratic 8-bit encoding (or, at least, those countries whose languages are written with an alphabet of less than 194 characters). Consequently, it made sense to configure the encoding as part of the general configuration of local presentation, like the names of months, the currency symbol, and what character separates the integer part of a number from its decimal fraction. Now that there is widespread (but still far from universal) convergence on Unicode, it seems odd that locales include the particular flavour of Unicode encoding, given that all flavours can represent the same Unicode strings and that the encoding is more generally specific to the particular software being used than the national idiosyncrasy. But it is, and that's why on my Ubuntu box, the environment variable LANG is set to es_ES.UTF-8 and not just es_ES. (Or es_PE, as it should be, except that I keep running into little issues with that locale.) If you're using a linux system, you might find something similar.

In theory, that means that my terminal emulator (konsole, as it happens, but there are various) expects to see UTF-8 sequences. And, indeed, konsole is clever enough to check the locale setting and set up its default encoding to match, but I'm free to change the encoding (or the locale settings), and confusion is likely to result.

So let's suppose that your locale settings and the encoding used by your terminal are actually in synch, which they should be on a well-configure workstation, and go back to the C++ program. Now, the C++ program needs to figure out which encoding it's supposed to use, and then transform from whatever internal representation it uses to the external encoding.

Fortunately, the C++ standard library should handle that correctly, if you cooperate by:

  1. Telling the standard library to use the configured locale, instead of the default C (i.e. only unaccented characters, as per English) locale; and

  2. Using strings and iostreams based on wchar_t (or some other wide character format).

If you do that, in theory you don't need to know either what wchar_t means to your standard library, nor what a particular bit pattern means to your terminal emulator. So let's try that:

#include <iostream>
#include <locale>

int main(int argc, char** argv) {
  // std::locale()   is the "global" locale
  // std::locale("") is the locale configured through the locale system
  // At startup, the global locale is set to std::locale("C"), so we need
  // to change that if we want locale-aware functions to use the configured
  // locale.
  // This sets the global" locale to the default locale. 
  std::locale::global(std::locale(""));

  // The various standard io streams were initialized before main started,
  // so they are all configured with the default global locale, std::locale("C").
  // If we want them to behave in a locale-aware manner, including using the
  // hopefully correct encoding for output, we need to "imbue" each iostream
  // with the default locale.
  // We don't have to do all of these in this simple example,
  // but it's probably a good idea.
  std::cin.imbue(std::locale());
  std::cout.imbue(std::locale());
  std::cerr.imbue(std::locale());
  std::wcin.imbue(std::locale());
  std::wcout.imbue(std::locale());
  std::wcerr.imbue(std::locale());

  // You can't write a wchar_t to cout, because cout only accepts char. wcout, on the
  // other hand, accepts both wchar_t and char; it will "widen" char. So it's
  // convenient to use wcout:
  std::wcout << "a acute: " << wchar_t(225) << std::endl;
  std::wcout << "pi:      " << wchar_t(960) << std::endl;
  return 0;
}

That works on my system. YMMV. Good luck.


Small side-note: I've run into lots of people who think that wcout automatically writes "wide characters", so that using it will produce UTF-16 or UTF-32 or something. It doesn't. It produces exactly the same encoding as cout. The difference is not what it outputs but what it accepts as input. In fact, it can't really be different from cout because both of them are connected to the same OS stream, which can only have one encoding (at a time).

You might ask why it is necessary to have two different iostreams. Why couldn't cout have just accepted wchar_t and std::wstring values? I don't actually have an answer for that, but I suspect it is part of the philosophy of not paying for features you don't need. Or something like that. If you figure it out, let me know.



回答2:

If for some reason you want to handle this entirely on your own:

void GetUnicodeChar(unsigned int code, char chars[5]) {
    if (code <= 0x7F) {
        chars[0] = (code & 0x7F); chars[1] = '\0';
    } else if (code <= 0x7FF) {
        // one continuation byte
        chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[0] = 0xC0 | (code & 0x1F); chars[2] = '\0';
    } else if (code <= 0xFFFF) {
        // two continuation bytes
        chars[2] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[0] = 0xE0 | (code & 0xF); chars[3] = '\0';
    } else if (code <= 0x10FFFF) {
        // three continuation bytes
        chars[3] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[2] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[0] = 0xF0 | (code & 0x7); chars[4] = '\0';
    } else {
        // unicode replacement character
        chars[2] = 0xEF; chars[1] = 0xBF; chars[0] = 0xBD;
        chars[3] = '\0';
    }
}

And then to use it:

char chars[5];
GetUnicodeChar(225, chars);
cout << chars << endl; // á

GetUnicodeChar(0x03A6, chars);
cout << chars << endl; // Φ

GetUnicodeChar(0x110000, chars);
cout << chars << endl; // �

Note that this is just a standard UTF-8 encoding algorithm, so if your platform does not assume UTF-8 it might not render correctly. (Thanks, @EmilioGaravaglia)