Writing Unicode to a file in C++

2019-03-16 20:05发布

问题:

I have a problem with writing unicode to a file in C++. I want to write to a file with my own extension a few smiley faces that you can get by typing ALT+NUMPAD(2). I can display it on CMD by making a char and assigning the value of '\2' to it and it will display a smiley face, but it won't write it to a file.

Here is a snippet of code for my program:

ofstream myfile;
myfile.open("C:\Users\My Username\test.exampleCodeFile");
myfile << "\2";
myfile.close();

It will write to the file, but it wont display what I want. I would show you what it displays but StackOverflow won't let me display the character. Thanks in advance.

回答1:

ALT+NUMPAD2 is not the same thing as ASCII character 2, which is what your code is writing to file. ALT codes are how DOS handles non-ASCII characters. The glyph that CMD.COM displays for ALT+NUMPAD2 is actually Unicode codepoint U+263B "BLACK SMILING FACE". Being a Unicode character, you are best off encoding the file using UTF-8 or UTF-16, eg:

ofstream myfile;
myfile.open("C:\\Users\My Username\\test.txt");
myfile << "\xEF\xBB\xBF"; // UTF-8 BOM
myfile << "\xE2\x98\xBB"; // U+263B
myfile.close();

.

ofstream myfile;
myfile.open("C:\\Users\\My Username\\test.txt");
myfile << "\xFF\xFE"; // UTF-16 BOM
myfile << "\x3B\x26"; // U+263B
myfile.close();

Both approaches show a smiley face in Notepad (provided you use a Font that supports smileys), as it reads the BOM first and then decodes the Unicode codepoint accordingly based on that.



回答2:

You have to use Unicode to specify the characters you want to display. The character represented by byte 02h in the console is translated by code page 437 (cp437) to the Unicode character U+263B. Using a source file saved in UTF-8 with BOM makes using Unicode easier, because you can paste or type the characters you want without resorting to Unicode escape codes.

For a file stream the stream needs to be configured for UTF-8. There are various ways to do this and it depends on the compiler, but using Visual Studio 2012, source saved in UTF-8 w/ BOM, and a bit of Googling:

#include <locale>
#include <codecvt>
#include <fstream>
#include <iostream>
#include <io.h>
#include <fcntl.h>
using namespace std;

int main()
{
    const std::locale utf8_locale = std::locale(std::locale(), new std::codecvt_utf8<wchar_t>());
    wofstream f(L"sample.txt");
    f.imbue(utf8_locale);
    f << L"\u263b我是美国人。我叫马克。" << endl;

    _setmode(_fileno(stdout),_O_U16TEXT);
    wcout << L"\u263b我是美国人。我叫马克。" << endl;
}

Content of sample.txt as viewed in Notepad:

☻我是美国人。我叫马克。

Hex dump (correct UTF-8):

E68891E698AFE7BE8EE59BBDE4BABAE38082E68891E58FABE9A9ACE5858BE380820D0A

Output to console cut-and-pasted here. The visual display was � for each Chinese character without the right font, but the characters display correctly pasted into SO or Notepad.

☻我是美国人。我叫马克。


回答3:

You are using the exact opposite of Unicode. The console operates with an 8-bit code page, the default one on Western machines is code page 437. Which matches the character set of the old IBM PC character ROM and is the code page that most legacy DOS programs expect. The first set of character codes, codes 0 through 8 look like this:

Note the smiley face for code 0x02, the one you saw on your console. You can see the rest of the glyphs in this Wikipedia article. A nasty problem with 8-bit character encodings is that there so many of them. Notepad reads your file with a different code page. By default that's Windows-1252 on machines in Western Europe and the Americas. That page doesn't have any glyphs for the control codes which is why you didn't see the smiley in Notepad.

Dealing with code pages is a major headache. That's why Unicode was invented.

Switching the console to a Unicode code page is possible. It however has to be still an 8-bit encoding, another legacy hang-over from console programs supporting output redirection. Which makes the right choice utf-8. You can switch from the console itself by typing chcp 65001 before starting your program. Or you can do it in your code, call SetConsoleOutputCP(CP_UTF8);.

One more unfortunate detail you have to take care of, you also need to change the font that's used for the console. The default font is TERMINAL, a legacy font that was designed to display the IBM PC glyphs but doesn't know beans about Unicode. Use the system menu to switch (press Alt+Space, Properties), not much to choose from but Consolas or Lucinda Console are suitable.

Now you can display Unicode, that's a whole other story that Remy introduced.