Can std::cout work with UTF-8 on Windows?

2019-04-13 06:25发布

问题:

I want to make std::cout print an UTF-8 literal. This seems to be an easy task with gcc, but an extremely difficult one with Windows.

The code that I'm trying to get to work is:

std::cout << "Ελληνικά Русский 你好";

Environment:

  • Windows 10, Visual Studio 2015
  • Default encoding: 1251
  • Console encoding: 866
  • Source encoding: UTF-8 with BOM

Requirements:

  • No changes to the line of code itself must be made
  • Full Unicode range support
  • Some setup code may be added in the beginning of main()

What I've tried:

  • #pragma execution_character_set("utf-8")
  • SetConsoleCP(CP_UTF8); SetConsoleOutputCP(CP_UTF8);
  • Set console font to Lucida Console system-wide
  • Use Unicode character set in project properties
  • Setup code from this blog

Nothing helped, and no StackOverflow answer solved the problem.

Edit

To get Unicode partially working, do the following:

  • Call initStreams() from the listing below at the start
  • Turn on Use Unicode Character Set in Project Settings
  • Add /utf-8 option

Not working:

  • wprintf
  • cin/wcin
  • Chinese characters

initStreams() implementation:

#include <cassert>         // assert
#include <codecvt>          // std::codecvt_utf8 (C++11)
#include <stdexcept>        // std::exception
#include <streambuf>        // std::basic_streambuf
#include <iostream>         // std::cout, std::endl
#include <locale>           // std::locale
#include <memory>           // std::unique_ptr (C++11)

#undef  UNICODE
#define UNICODE
#undef  STRICT
#define STRING
#include <windows.h>    // MultiByteToWideChar

class OutputForwarderBuffer : public std::basic_streambuf<char>
{
public:
    using Base = std::basic_streambuf<char>;
    using Traits = Base::traits_type;
    using StreamBuffer = std::basic_streambuf<char>;
    using WideStreamBuffer = std::basic_streambuf<wchar_t>;
    using Base::int_type;
    using Base::char_type;

    OutputForwarderBuffer(
        StreamBuffer& existingBuffer,
        WideStreamBuffer* pWideStreamBuffer
    )
        : Base(existingBuffer)
        , pWideStreamBuffer_(pWideStreamBuffer)
    {
    }

    OutputForwarderBuffer(OutputForwarderBuffer const&) = delete;
    void operator=(OutputForwarderBuffer const&) = delete;

protected:
    std::streamsize xsputn(char const* s, std::streamsize n) override
    {
        if (n == 0) { return 0; }

        int const sourceSize = static_cast<int>(n);
        int const destinationSize = MultiByteToWideChar(CP_UTF8, 0, s, sourceSize, nullptr, 0);
        wideCharBuffer_.resize(static_cast<size_t>(sourceSize));

        int const nWideCharacters = MultiByteToWideChar(CP_UTF8, 0, s, sourceSize, &wideCharBuffer_[0], destinationSize);
        assert(nWideCharacters > 0 && nWideCharacters == destinationSize);

        return pWideStreamBuffer_->sputn(&wideCharBuffer_[0], destinationSize);
    }

    int_type overflow(int_type c) override
    {
        bool const cIsEOF = Traits::eq_int_type(c, Traits::eof());
        int_type const failureValue = Traits::eof();
        int_type const successValue = (cIsEOF ? Traits::not_eof(c) : c);

        if (!cIsEOF) {
            char_type const ch = Traits::to_char_type(c);
            std::streamsize const nCharactersWritten = xsputn(&ch, 1);

            return (nCharactersWritten == 1 ? successValue : failureValue);
        }
        return successValue;
    }

private:
    WideStreamBuffer* pWideStreamBuffer_;
    std::wstring wideCharBuffer_;
};

void setUtf8Conversion(std::basic_ios<wchar_t>& stream)
{
    stream.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8_utf16<wchar_t>()));
}

bool isConsole(HANDLE streamHandle)
{
    DWORD consoleMode;
    return !!GetConsoleMode(streamHandle, &consoleMode);
}

bool isConsole(DWORD stdStreamId)
{
    return isConsole(GetStdHandle(stdStreamId));
}

void initStreams()
{
    SetConsoleCP(CP_UTF8);
    SetConsoleOutputCP(CP_UTF8);

    setUtf8Conversion(std::wcout);
    setUtf8Conversion(std::wcerr);
    setUtf8Conversion(std::wclog);

    static OutputForwarderBuffer coutBuffer(*std::cout.rdbuf(), std::wcout.rdbuf());
    static OutputForwarderBuffer cerrBuffer(*std::cerr.rdbuf(), std::wcerr.rdbuf());
    static OutputForwarderBuffer clogBuffer(*std::clog.rdbuf(), std::wclog.rdbuf());

    std::cout.rdbuf(&coutBuffer);
    std::cerr.rdbuf(&cerrBuffer);
    std::clog.rdbuf(&clogBuffer);
}

回答1:

Here is what I'd do:

  1. make sure your source files are utf-8 encoded and have correct content (open them in another editor, check glyphs and file encoding)

  2. remove console from equation -- redirect output to a file and check it's content with utf-8-aware editor (just like with source code)

  3. use /utf-8 cmdline option with MSVC2015+ -- this will force compiler to treat all source files as utf-8 encoded once and your string literals stored in resulting binary will be utf-8 encoded.

  4. remove iostreams from equation (can't wait until for this library to die, tbh) -- use cstdio

  5. at this point output should work (it does for me)

  6. to get console output to work -- use SetConsoleOutputCP(CP_UTF8) and get it to use TrueType font that supports your Unicode plane (I suspect that for chinese characters to work in console you need a font installed in your system that supports related Unicode plane and your console should be configured to use it)

  7. not sure about console input (never had to deal with that), but I suspect that SetConsoleCP(CP_UTF8) should make it work with non-wide i/o

  8. discard the idea of using wide i/o (wcout/etc) -- why would you do it anyway? Unicode works just fine with utf-8 encoded char const*

  9. once you reached this stage -- time to deal with iostreams (if you insist on using it). I'd disregard wcin/wcout for now. If they don't already work -- try imbue'ing related cin/cout with utf-8 locale.

  10. the idea promoted by http://utf8everywhere.org/ is to convert to UCS-2 only when you make Windows API call. This makes your OutputForwarderBuffer unnecessary.

  11. I guess (if you REALLY insist) now you can try getting wide iostreams to work. Good luck, I guess you'll have to reconfigure console (which will break non-wide i/o) or somehow get your wcout/wcin performing UCS2-to-UTF8 conversion on the fly (and only if it is connected to console).

Edit: Starting from Windows 10 you also need this:

setvbuf(stderr, NULL, _IOFBF, 1024);    // on Windows 10+ we need buffering or console will get 1 byte at a time (screwing up utf-8 encoding)
setvbuf(stdout, NULL, _IOFBF, 1024);

Unfortunately this also means that there is still a chance of screwing up your output if you fill buffer completely before next flush. Proper solution -- flush it manually (endl or fflush()) after every string sent to output (assuming each string is less than 1024). If only MS supported line-buffering...