可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm looking for suggestions regarding unicode aware std::string library replacements. I have a bunch of code that uses std::string, its iterators etc, and would like to now support unicode strings (free or open source implementations preferred, regex capabilities would be great!).

I'm not sure at this point if I require a complete rewrite or if I can get away with dropping in a new string library that supports all of the std::string interfaces. The unicode world seems very complex and I'm just wanting to enable it in my applications not have to learn every single aspect of it.

btw how does the index operator work when it has to pass back a reference to either a 1, 2,3 or 4 structure which could in theory change to either a 1,2,3 or 4 byte structure. if a larger or smaller sized value is passed, does the shifting back and forth of the internal data representation occur insitu?

回答1:

You don't need a complete rewrite if you make sure about what your std::string contains. For example, you could assume (and convert inputs to be sure) that your std::string contain UTF8 encoded strings (for those that need localization). Don't forget that std::string is only a container of raw data, it's not associated with an encoding (even in C++0x, it's only a possibility, not a requirement).

Then when you pass text to other libraries that require different encodings, you can use libraries like UTF8CPP to convert to the required encoding (but most of the time such libraries will do it themselves).

That way makes it simple. UTF8 with standard std::string in your code, enabling passing unicode string to everything else (with conversion if necessary).

There have been a lot of discussions about this in the boost community mailing list. Maybe reading it (if you have enough time...) can help you understand other possible solutions.

回答2:

Depending on your needs, use std::wstring or the larger and more complex (but de facto standard) ICU: http://site.icu-project.org/

回答3:

what unicode encoding do you need? If utf-8 is ok you can have a look at Glib::ustring

Glib::ustring has much the same interface as std::string, but contains Unicode characters encoded as UTF-8.

回答4:

Asking for "a type like std::string, but for Unicode" is like asking for "a type like unsigned, but for primes." std::string is perfectly capable of storing Unicode, in many encodings - the most generally useful being UTF-8.

What you need to replace is your iterators, not your storage type. The iterators should iterate over the codepoints of the string rather than the bytes. That is, ++i should advance one codepoint, and *i should return a codepoint (via uint32_t) rather than a char.

回答5:

I've written my own C++ UTF-8 library, which is a drop-in replacement of std::wstring/string. The data type that is showed to the user is char32_t, but internally the wide characters are all packed into utf8 char's.

The whole thing is quite fast and its performance is best with few unicode codepoints within many ascii codepoints. All operations that are known from std::string are available with this class (except for substring find) and operate on codepoint indices, in contrast to byte indices.

As a bonus of defensive programming, the whole ANSI range of 0-255 can be used without multibytes :)

Hope this helps!