How to portably write std::wstring to file?

2019-02-04 07:18发布

问题:

I have a wstring declared as such:

// random wstring
std::wstring str = L"abcàdëefŸg€hhhhhhhµa";

The literal would be UTF-8 encoded, because my source file is.

[EDIT: According to Mark Ransom this is not necessarily the case, the compiler will decide what encoding to use - let us instead assume that I read this string from a file encoded in e.g. UTF-8]

I would very much like to get this into a file reading (when text editor is set to the correct encoding)

abcàdëefŸg€hhhhhhhµa

but ofstream is not very cooperative (refuses to take wstring parameters), and wofstream supposedly needs to know locale and encoding settings. I just want to output this set of bytes. How does one normally do this?

EDIT: It must be cross platform, and should not rely on the encoding being UTF-8. I just happen to have a set of bytes stored in a wstring, and want to output them. It could very well be UTF-16, or plain ASCII.

回答1:

Why not write the file as a binary. Just use ofstream with the std::ios::binary setting. The editor should be able to interpret it then. Don't forget the Unicode flag 0xFEFF at the beginning. You might be better of writing with a library, try one of these:

http://www.codeproject.com/KB/files/EZUTF.aspx

http://www.gnu.org/software/libiconv/

http://utfcpp.sourceforge.net/



回答2:

For std::wstring you need std::wofstream

std::wofstream f(L"C:\\some file.txt");
f << str;
f.close();


回答3:

std::wstring is for something like UTF-16 or UTF-32, not UTF-8. For UTF-8, you probably just want to use std::string, and write out via std::cout. Just FWIW, C++0x will have Unicode literals, which should help clarify situations like this.



回答4:

C++ has means to perform a conversion from wide character to localized ones on output or file write. Use codecvt facet for that purpose.

You may use standard std::codecvt_byname, or a non-standard codecvt_facet implementation.

#include <locale>
using namespace std;
typedef codecvt_facet<wchar_t, char, mbstate_t> Cvt;
locale utf8locale(locale(), new codecvt_byname<wchar_t, char, mbstate_t> ("en_US.UTF-8"));
wcout.imbue(utf8locale);
wcout << L"Hello, wide to multybyte world!" << endl;

Beware that on some platforms codecvt_byname can only emit conversion only for locales that are installed in the system. I therefore recommend to search stackoverflow for "utf8 codecvt " and make a choice from many referenes of custom codecvt implementations listed.

EDIT: As OP states that the string is already encoded, all he should do is to remove prefixes L and "w" from every token of his code.



回答5:

There is a (Windows-specific) solution that should work for you here. Basically, convert wstring to UTF-8 codepage and then use ofstream.

#include < windows.h >

std::string to_utf8(const wchar_t* buffer, int len)
{
        int nChars = ::WideCharToMultiByte(
                CP_UTF8,
                0,
                buffer,
                len,
                NULL,
                0,
                NULL,
                NULL);
        if (nChars == 0) return "";

        string newbuffer;
        newbuffer.resize(nChars) ;
        ::WideCharToMultiByte(
                CP_UTF8,
                0,
                buffer,
                len,
                const_cast< char* >(newbuffer.c_str()),
                nChars,
                NULL,
                NULL); 

        return newbuffer;
}

std::string to_utf8(const std::wstring& str)
{
        return to_utf8(str.c_str(), (int)str.size());
}

int main()
{
        std::ofstream testFile;

        testFile.open("demo.xml", std::ios::out | std::ios::binary); 

        std::wstring text =
                L"< ?xml version=\"1.0\" encoding=\"UTF-8\"? >\n"
                L"< root description=\"this is a naïve example\" >\n< /root >";

        std::string outtext = to_utf8(text);

        testFile << outtext;

        testFile.close();

        return 0;
}


回答6:

Note that wide streams output only char * variables, so maybe you should try using the c_str() member function to convert a std::wstring and then output it to the file. Then it should probably work?



回答7:

You should not use UTF-8 encoded source file if you want to write portable code. Sorry.

  std::wstring str = L"abcàdëefŸg€hhhhhhhµa";

(I am not sure if this actually hurts the standard, but I think it is. But even if, to be safe you should not.)

Yes, purely using std::ostream will not work. There are many ways to convert a wstring to UTF-8. My favorite is using the International Components for Unicode. It's a big lib, but it's great. You get a lot of extras and things you might need in the future.



回答8:

From my experience of working with different character encodings I would recommend that you only deal with UTF-8 at load and save time. You're in for a world of pain if you try and store the internal representation in UTF-8 since a single character could be anything from 1 byte to 4. So simple operations like strlen require looking at every byte to decide len rather than the allocated buffer (although you can optimize by looking at the first byte in the char sequence, e.g. 00..7f is a single byte char, c2..df indicates a 2 byte char etc).

People quite often refer to 'Unicode strings' when they mean UTF-16 and on Windows a wchar_t is a fixed 2 bytes. In Windows I think wchar_t is simply:

typedef SHORT wchar_t;

The full UTF-32 4 byte representation is rarely required and very wasteful, here what the Unicode Standard (5.0) has to say on it:

"On average more than 99% of all UTF-16 is expressed using single code units... UTF-16 provides the right mix of compact size with the ability to handle the occassional character outside the BMP"

In short, use whcar_t as your internal representation and do conversions when loading and saving (and don't worry about full Unicode unless you know you need it).

With regard to performing the actual conversion have a look at the ICU project:

http://site.icu-project.org/



回答9:

I hade the same problem some time ago, and wrote down the solution I found on my blog. You might want to check it out to see if it might help, especially the function wstring_to_utf8.

http://pileborg.org/b2e/blog5.php/2010/06/13/unicode-utf-8-and-wchar_t