Unicode Processing in C++

2019-01-03 22:25发布

What is the best practice of Unicode processing in C++?

标签: c++ unicode
9条回答
等我变得足够好
2楼-- · 2019-01-03 22:30

Although this may not be best practice for everyone, you can write your own C++ UNICODE routines if you want!

I just finished doing it over a weekend. I learned a lot, though I don't guarantee it's 100% bug free, I did a lot of testing and it seems to work correctly.

My code is under the New BSD license and can be found here:

http://code.google.com/p/netwidecc/downloads/list

It is called WSUCONV and comes with a sample main() program that converts between UTF-8, UTF-16, and Standard ASCII. If you throw away the main code, you've got a nice library for reading / writing UNICODE.

查看更多
看我几分像从前
3楼-- · 2019-01-03 22:40

Our company (and others) use the open source Internation Components for Unicode (ICU) library originally developed by Taligent.

It handles strings, locales, conversions, date/times, collation, transformations, et. al.

Start with the ICU Userguide

查看更多
一纸荒年 Trace。
4楼-- · 2019-01-03 22:40

Look at Case insensitive string comparison in C++

That question has a link to the Microsoft documentation on Unicode: http://msdn.microsoft.com/en-us/library/cc194799.aspx

If you look on the left-hand navigation side on MSDN next to that article, you should find a lot of information pertaining to Unicode functions. It is part of a chapter on "Encoding Characters" (http://msdn.microsoft.com/en-us/library/cc194786.aspx)

It has the following subsections:

  • The Code-Page Model
  • Double-Byte Character Sets in Windows
  • Unicode
  • Compatibility Issues in Mixed Environments
  • Unicode Data Conversion
  • Migrating Windows-Based Programs to Unicode
  • Summary
查看更多
够拽才男人
5楼-- · 2019-01-03 22:42

Here is a checklist for Windows programming:

  • All strings enclosed in _T("my string")
  • strlen() etc. functions replaced with _tcslen() etc.
  • Use LPTSTR and LPCTSTR instead of char * and const char *
  • When starting new projects in Dev Studio, religiously make sure the Unicode option is selected in your project properties.
  • For C++ strings, use std::wstring instead of std::string
查看更多
祖国的老花朵
6楼-- · 2019-01-03 22:43
  • Use ICU for dealing with your data (or a similar library)
  • In your own data store, make sure everything is stored in the same encoding
  • Make sure you are always using your unicode library for mundane tasks like string length, capitalization status, etc. Never use standard library builtins like is_alpha unless that is the definition you want.
  • I can't say it enough: never iterate over the indices of a string if you care about correctness, always use your unicode library for this.
查看更多
相关推荐>>
7楼-- · 2019-01-03 22:46

If you don't care about backwards compatibility with previous C++ standards, the current C++11 standard has built in Unicode support: http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2011/n3242.pdf

So the truly best practice for Unicode processing in C++ would be to use the built in facilities for it. That isn't always a possibility with older code bases though, with the standard being so new at present.

EDIT: To clarify, C++11 is Unicode aware in that it now has support for Unicode literals and Unicode strings. However, the standard library has only limited support for Unicode processing and conversion. For your current needs this may be enough. However, if you need to do a large amount of heavy lifting right now then you may still need to use something like ICU for more in-depth processing. There are some proposals currently in the works to include more robust support for text conversion between different encodings. My guess (and hope) is that this will be part of the next technical report.

查看更多
登录 后发表回答