How well is Unicode supported in C++11?

I've read and heard that C++11 supports Unicode. A few questions on that:

How well does the C++ standard library support Unicode?
Does std::string do what it should?
How do I use it?
Where are potential problems?

标签： c++ unicode c++11

5条回答

2楼-- · 2019-01-01 07:08

However, there is a pretty useful library called tiny-utf8, which is basically a drop-in replacement for std::string/std::wstring. It aims to fill the gap of the still missing utf8-string container class.

This might be the most comfortable way of 'dealing' with utf8 strings (that is, without unicode normalization and similar stuff). You comfortably operate on codepoints, while your string stays encoded in run-length-encoded chars.

0人赞添加讨论(0) 举报

与君花间醉酒

3楼-- · 2019-01-01 07:09

You can safely store UTF-8 in a std::string (or in a char[] or char*, for that matter), due to the fact that a Unicode NUL (U+0000) is a null byte in UTF-8 and that this is the sole way a null byte can occur in UTF-8. Hence, your UTF-8 strings will be properly terminated according to all of the C and C++ string functions, and you can sling them around with C++ iostreams (including std::cout and std::cerr, so long as your locale is UTF-8).

What you cannot do with std::string for UTF-8 is get length in code points. std::string::size() will tell you the string length in bytes, which is only equal to the number of code points when you're within the ASCII subset of UTF-8.

If you need to operate on UTF-8 strings at the code point level---not just store and print them---or if you're dealing with UTF-16, which is likely to have many internal null bytes, you need to look into the wide character string types.

0人赞添加讨论(0) 举报

还给你的自由

4楼-- · 2019-01-01 07:11

Unicode is not supported by Standard Library (for any reasonable meaning of supported).

std::string is no better than std::vector<char>: it is completely oblivious to Unicode (or any other representation/encoding) and simply treat its content as a blob of bytes.

If you only need to store and catenate blobs, it works pretty well; but as soon as you wish for Unicode functionality (number of code points, number of graphemes, ...) you are out of luck.

The only comprehensive library I know of for this is ICU. The C++ interface was derived from the Java one though, so it's far from being idiomatic.

0人赞添加讨论(0) 举报

听够珍惜

5楼-- · 2019-01-01 07:13

How well does the C++ standard library support unicode?

Terribly.

A quick scan through the library facilities that might provide Unicode support gives me this list:

Strings library
Localization library
Input/output library
Regular expressions library

I think all but the first one provide terrible support. I'll get back to it in more detail after a quick detour through your other questions.

Does std::string do what it should?

Yes. According to the C++ standard, this is what std::string and its siblings should do:

The class template basic_string describes objects that can store a sequence consisting of a varying number of arbitrary char-like objects with the first element of the sequence at position zero.

Well, std::string does that just fine. Does that provide any Unicode-specific functionality? No.

Should it? Probably not. std::string is fine as a sequence of char objects. That's useful; the only annoyance is that it is a very low-level view of text and standard C++ doesn't provide a higher-level one.

How do I use it?

Use it as a sequence of char objects; pretending it is something else is bound to end in pain.

Where are potential problems?

All over the place? Let's see...

Strings library

The strings library provides us basic_string, which is merely a sequence of what the standard calls "char-like objects". I call them code units. If you want a high-level view of text, this is not what you are looking for. This is a view of text suitable for serialization/deserialization/storage.

It also provides some tools from the C library that can be used to bridge the gap between the narrow world and the Unicode world: c16rtomb/mbrtoc16 and c32rtomb/mbrtoc32.

Localization library

The localization library still believes that one of those "char-like objects" equals one "character". This is of course silly, and makes it impossible to get lots of things working properly beyond some small subset of Unicode like ASCII.

Consider, for example, what the standard calls "convenience interfaces" in the <locale> header:

template <class charT> bool isspace (charT c, const locale& loc);
template <class charT> bool isprint (charT c, const locale& loc);
template <class charT> bool iscntrl (charT c, const locale& loc);
// ...
template <class charT> charT toupper(charT c, const locale& loc);
template <class charT> charT tolower(charT c, const locale& loc);
// ...

How do you expect any of these functions to properly categorize, say, U+1F34C ʙᴀɴᴀɴᴀ, as in u8"


             
            
                                  
            
            
            
            
            
            爱死公子算了                          
            
             
             6楼-- · 2019-01-01 07:13
             
             
             
                          
             
                                                                          
C++11 has a couple of new literal string types for Unicode.

Unfortunately the support in the standard library for non-uniform encodings (like UTF-8) is still bad. For example there is no nice way to get the length (in code-points) of an UTF-8 string.
    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...


     
                      登录 后发表回答



   
   
   
  
   相关问题
      
    
    
   
   

     


   
   Sorting 3 numbers without branching [closed]   

   



     


   
   How to compile C++ code in GDB?   

   



     


   
   Why does const allow implicit conversion of refere   

   



     


   
   thread_local variables initialization   

   



     


   
   What uses more memory in c++? An 2 ints or 2 funct   

   



        
      
    查看全部
   
   
  
   相关文章
 
   
   

     


   
   Class layout in C++: Why are members sometimes ord   

     


   
   How to mock methods return object with deleted cop   

     


   
   Which is the best way to multiply a large and spar   

     


   
   C++ default constructor does not initialize pointe   

     


   
   Selecting only the first few characters in a strin   

     


   
   What exactly do pointers store? (C++)   

     


   
   Converting glm::lookat matrix to quaternion and ba   

     


   
   What is the correct way to declare and use a FILE    

        
        
    查看全部
                 收藏的人(5)

How well is Unicode supported in C++11?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间