Using ICU to implement my own codecvt facet

2019-08-23 10:03发布

问题:

I want to implement a codecvt facet using ICU to convert from any character encoding (that ICU supports) to UTF-8 internally. I'm aware that codecvt_byname exists and that it can be used to do part of what I want as shown in this example. The problems with that example are that it (1) uses wide character streams (I want to use "regular", byte-oriented streams) and (2) requires 2 streams to perform the conversion. Instead, I want a single stream like:

locale loc( locale(), new icu_codecvt( "ISO-8859-1" ) );
ifstream ifs;
ifs.imbue( loc );
ifs.open( "/path/to/some/file.txt" );
// data read from ifs here will have been converted from ISO-8859-1 to UTF-8

Hence, I wand to do an implementation like this but using ICU rather than iconv. Given that, my implementation of do_in() is:

icu_codecvt::result icu_codecvt::do_in( state_type &state,
                                        extern_type const *from, extern_type const *from_end,
                                        extern_type const *&from_next, intern_type *to,
                                        intern_type *to_end, intern_type *&to_next ) const {
  from_next = from;
  to_next = to;
  if ( always_noconv_ )
    return noconv;

  our_state *const s = state_store_.get( state );
  UErrorCode err = U_ZERO_ERROR;
  ucnv_convertEx(
    s->utf8_conv_, s->extern_conv_, &to_next, to_end, &from_next, from_end,
    nullptr, nullptr, nullptr, nullptr, false, false, &err
  );
  if ( err == U_TRUNCATED_CHAR_FOUND )
    return partial;
  return U_SUCCESS( err ) ? ok : error;
}

The our_state object maintains two UConverter* pointers, one for the "external" encoding (in this example, ISO-8859-1) and one for the UTF-8 encoding.

My questions are:

  1. Should I specify nullptr for the "pivot" buffer as above, or supply my own?
  2. I'm not sure when, if ever, I should set the reset argument (currently the first false above) to true.
  3. It's not clear how I would know when to set the flush argument (currently the second false above) to true, i.e., how I know when the end of the input has been reached.

A little help?

回答1:

The codecvt facet is not intended to convert between different encodings. Instead, it converts from an external encoding where one character is possibly encoded using multiple external word (typically bytes) into an internal representation where each character is represented by exactly one word (e.g. char, wchar_t, char16_t, etc.).

From this perspective it doesn't make sense to "end" an internal character sequence. If there are no more external words available the conversion is done and if the last character remained incomplete this is an error in the transfer. Thus, there is no need to indicate that the conversion is complete and, correspondingly, no interface. This should clarify that the "flush" argument indeed should always be "false".

I realize that UTF-8 doesn't quite fit the bill of having one word represent one character. However, this will haunt you enire UTF-8 processing using standard types processing strings. As long as you stay clear of syring modifications things typically work OK, though.

The "reset" parameter is probably intended to deal with seeking within a stream. I think filebuf is supposed to provide a fresh state_type object when seeking. This would probably be an indication that the ICU internals want to be reset. However, I don't know about the ICU interface. Thus, I also don't know if you want to supply a pivot buffer.