UPDATE: Thank you to @Potatoswatter and @Jonathan Leffler for comments - rather embarrassingly I was caught out by the debugger tool tip not showing the value of a wstring correctly - however it still isn't quite working for me and I have updated the question below:
If I have a small multibyte file I want to read into a string I use the following trick - I use getline
with a delimeter of '\0'
e.g.
std::string contents_utf8;
std::ifstream inf1("utf8.txt");
getline(inf1, contents_utf8, '\0');
This reads in the entire file including newlines.
However if I try to do the same thing with a wide character file it doesn't work - my wstring
only reads to the the first line.
std::wstring contents_wide;
std::wifstream inf2(L"ucs2-be.txt");
getline( inf2, contents_wide, wchar_t(0) ); //doesn't work
For example my if unicode file contains the chars A and B seperated by CRLF, the hex looks like this:
FE FF 00 41 00 0D 00 0A 00 42
Based on the fact that with a multibyte file getline with '\0' reads the entire file I believed that getline( inf2, contents_wide, wchar_t(0) )
should read in the entire unicode file. However it doesn't - with the example above my wide string would contain the following two wchar_ts: FF FF
(If I remove the wchar_t(0) it reads in the first line as expected (ie FE FF 00 41 00 0D 00
)
Why doesn't wchar_t(0) work as a delimiting wchar_t so that getline stops on 00 00
(or reads to the end of the file which is what I want)?
Thank you
Your UCS-2 decoder is misbehaving. The result of getline( inf2, contents_wide )
on FE FF 00 41 00 0D 00 0A 00 42
should be 0041 0000
= L"A"
. Assuming you're on Windows, the line ending should be properly converted, and the byte-order mark shouldn't appear in the output.
Suggest double-checking your OS documentation with respect to how you set the locale.
EDIT: Did you set the locale?
locale::global( locale( "something if your system supports UCS-2" ) );
or
locale::global( encoding_support::ucs2_bigendian_encoding );
where encoding_support is some library.
See this question: Why does wide file-stream in C++ narrow written data by default?, where the poster is surprised by the wchar_t
-> char
conversion when writing.
The answers given to that question apply to the reading case also. In a nutshell: at the lowest level, file I/O is always done in terms of bytes. A basic_filebuf
(what the fstream
uses to actually perform the I/O) uses a codecvt
facet to translate between the "internal" encoding (the char type seen by the program, and used to instantiate the stream, wchar_t
in your case) and the "external" encoding of the file (which is always char
).
The codecvt
is obtained from the stream's locale
. If no locale is imbue()
-d on the stream, the global locale is used. By default, the global locale is the "classic" (or "C") locale. That locale's codecvt
facet is pretty basic. I don't know what the standard says about it but, in my experience on Windows, it simply "casts" between char
and wchar_t
, one by one. On Linux, it does this too but fails if the character's value is outside the ASCII range.
So, if you don't touch the locale (either by imbue()
-ing one on the stream or changing the global one), what probably happens in your case is that char
s are read from the file and cast to wchar_t
one by one. It thus first reads FF
, then FE
, then 00
, and getline(..., 0)
stops right there.
L"ucs2-be.txt" looks to me like a flag for big endian, but the array FE FF 00 41 00 0D 00 0A 00 42 looks like little endian. I guess this is why the FE FF character was read into your array instead of being skipped over. I can't figure out why the presence or absence of wchar(0) affects the results though.