Are there any updates of localization support in C

2019-03-25 11:09发布

问题:

The more I work with C++ locale facets, more I understand --- they are broken.

  • std::time_get -- is not symmetric with std::time_put (as it in C strftime/strptime) and does not allow easy parsing of times with AM/PM marks.
  • I discovered recently that simple number formatting may produce illegal UTF-8 under certain locales (like ru_RU.UTF-8).
  • std::ctype is very simplistic assuming that to upper/to lower can be done on per-character base (case conversion may change number of characters and it is context dependent).
  • std::collate -- does not support collation strength (case sensitive or insensitive).
  • There is not way to specify timezone different from global timezone in time formatting.

And much more...

  • Does anybody knows whether any changes are expected in standard facets in C++0x?
  • Is there any way to bring an importance of such changes?

Thanks.

EDIT: Clarifications in case the link is not accessible:

std::numpunct defines thousands separator as char. So when separator in U+2002 -- different kind of space it can't be reproduced as single char in UTF-8 but as multiple byte sequence.

In C API struct lconv defines thousands separator as string and does not suffers from this problem. So, when you try to format numbers with separators outside of ASCII with UTF-8 locale, invalid UTF-8 is produced.

To reproduce this bug write 1234 to std:ostream with imbued ru_RU.UTF-8 locale

EDIT2: I must admit that POSIX C localization API works much smoother:

  • There is inverse of strftime -- strptime (strftime does same as std::time_put::put)
  • No problems with number formatting because of the point I mentioned above.

However it is still for from being perfecet.

EDIT3: According to the latest notes about C++0x I can see that std::time_get::get -- similar to strptime and opposite of std::time_put::put.

回答1:

I agree with you, C++ is lacking proper i18n support.

Does anybody knows whether any changes are expected in standard facets in C++0x?

It is too late in the game, so probably not.

Is there any way to bring an importance of such changes?

I am very pessimistic about this.

When asked directly, Stroustrup claimed that he does not see any problems with the current status. And another one of the big C++ guys (book author and all) did not even realize that wchar_t can be one byte, if you read the standard.

And some threads in boost (which seems to drive the direction in the future) show so little understanding on how this works that is outright scary.

C++0x barely added some Unicode character data types, late in the game and after a lot of struggle. I am not holding my breath for more too soon.

I guess the only chance to see something better is if someone really good/respected in the i18n and C++ worlds gets directly involved with the next version of the standard. No clue who that might be though :-(



回答2:

std::numpunct is a template. All specializations try to return the decimal seperator character. Obviously, in any locale where that is a wide character, you should use std::numpunct<wchar_t>, as the <char specialization can't do that.

That said, C++0x is pretty much done. However, if good improvements continue, the C++ committee is likely to start C++1x. The ISO C++ committee on is very likely to accept your help, if offered through your national ISO member organization. I see that Pavel Minaev suggested a Defect Report. That's technically possible, but the problems you describe are in general design limitations. In that case, the most reliable course of action is to design a Boost library for this, have it pass the Boost review, submit it for inclusion in the standard, and participate in the ISO C++ meetings to deal with any issues cropping up there.