problem using getline with a unicode file

2019-05-25 17:02发布

问题:

UPDATE: Thank you to @Potatoswatter and @Jonathan Leffler for comments - rather embarrassingly I was caught out by the debugger tool tip not showing the value of a wstring correctly - however it still isn't quite working for me and I have updated the question below:

If I have a small multibyte file I want to read into a string I use the following trick - I use getline with a delimeter of '\0' e.g.

std::string contents_utf8;
std::ifstream inf1("utf8.txt");
getline(inf1, contents_utf8, '\0');

This reads in the entire file including newlines.
However if I try to do the same thing with a wide character file it doesn't work - my wstring only reads to the the first line.

std::wstring contents_wide;
std::wifstream inf2(L"ucs2-be.txt");
getline( inf2, contents_wide, wchar_t(0) ); //doesn't work

For example my if unicode file contains the chars A and B seperated by CRLF, the hex looks like this:

FE FF 00 41 00 0D 00 0A 00 42

Based on the fact that with a multibyte file getline with '\0' reads the entire file I believed that getline( inf2, contents_wide, wchar_t(0) ) should read in the entire unicode file. However it doesn't - with the example above my wide string would contain the following two wchar_ts: FF FF

(If I remove the wchar_t(0) it reads in the first line as expected (ie FE FF 00 41 00 0D 00)

Why doesn't wchar_t(0) work as a delimiting wchar_t so that getline stops on 00 00 (or reads to the end of the file which is what I want)?
Thank you

回答1:

Your UCS-2 decoder is misbehaving. The result of getline( inf2, contents_wide ) on FE FF 00 41 00 0D 00 0A 00 42 should be 0041 0000 = L"A". Assuming you're on Windows, the line ending should be properly converted, and the byte-order mark shouldn't appear in the output.

Suggest double-checking your OS documentation with respect to how you set the locale.

EDIT: Did you set the locale?

locale::global( locale( "something if your system supports UCS-2" ) );

or

locale::global( encoding_support::ucs2_bigendian_encoding );

where encoding_support is some library.



回答2:

See this question: Why does wide file-stream in C++ narrow written data by default?, where the poster is surprised by the wchar_t -> char conversion when writing.

The answers given to that question apply to the reading case also. In a nutshell: at the lowest level, file I/O is always done in terms of bytes. A basic_filebuf (what the fstream uses to actually perform the I/O) uses a codecvt facet to translate between the "internal" encoding (the char type seen by the program, and used to instantiate the stream, wchar_t in your case) and the "external" encoding of the file (which is always char).

The codecvt is obtained from the stream's locale. If no locale is imbue()-d on the stream, the global locale is used. By default, the global locale is the "classic" (or "C") locale. That locale's codecvt facet is pretty basic. I don't know what the standard says about it but, in my experience on Windows, it simply "casts" between char and wchar_t, one by one. On Linux, it does this too but fails if the character's value is outside the ASCII range.

So, if you don't touch the locale (either by imbue()-ing one on the stream or changing the global one), what probably happens in your case is that chars are read from the file and cast to wchar_t one by one. It thus first reads FF, then FE, then 00, and getline(..., 0) stops right there.



回答3:

L"ucs2-be.txt" looks to me like a flag for big endian, but the array FE FF 00 41 00 0D 00 0A 00 42 looks like little endian. I guess this is why the FE FF character was read into your array instead of being skipped over. I can't figure out why the presence or absence of wchar(0) affects the results though.