I want to implement a codecvt
facet using ICU to convert from any character encoding (that ICU supports) to UTF-8 internally. I'm aware that codecvt_byname
exists and that it can be used to do part of what I want as shown in this example. The problems with that example are that it (1) uses wide character streams (I want to use "regular", byte-oriented streams) and (2) requires 2 streams to perform the conversion. Instead, I want a single stream like:
locale loc( locale(), new icu_codecvt( "ISO-8859-1" ) );
ifstream ifs;
ifs.imbue( loc );
ifs.open( "/path/to/some/file.txt" );
// data read from ifs here will have been converted from ISO-8859-1 to UTF-8
Hence, I wand to do an implementation like this but using ICU rather than iconv
.
Given that, my implementation of do_in()
is:
icu_codecvt::result icu_codecvt::do_in( state_type &state,
extern_type const *from, extern_type const *from_end,
extern_type const *&from_next, intern_type *to,
intern_type *to_end, intern_type *&to_next ) const {
from_next = from;
to_next = to;
if ( always_noconv_ )
return noconv;
our_state *const s = state_store_.get( state );
UErrorCode err = U_ZERO_ERROR;
ucnv_convertEx(
s->utf8_conv_, s->extern_conv_, &to_next, to_end, &from_next, from_end,
nullptr, nullptr, nullptr, nullptr, false, false, &err
);
if ( err == U_TRUNCATED_CHAR_FOUND )
return partial;
return U_SUCCESS( err ) ? ok : error;
}
The our_state
object maintains two UConverter*
pointers, one for the "external" encoding (in this example, ISO-8859-1) and one for the UTF-8 encoding.
My questions are:
- Should I specify
nullptr
for the "pivot" buffer as above, or supply my own? - I'm not sure when, if ever, I should set the
reset
argument (currently the firstfalse
above) totrue
. - It's not clear how I would know when to set the
flush
argument (currently the secondfalse
above) totrue
, i.e., how I know when the end of the input has been reached.
A little help?