UnicodeString w/ String Literals vs Hex Values

2019-08-22 02:39发布

问题:

Is there any conceivable reason why I would see different results using unicode string literals versus the actual hex value for the UChar.

UnicodeString s1(0x0040); // @ sign
UnicodeString s2("\u0040");

s1 isn't equivalent to s2. Why?

回答1:

The \u escape sequence AFAIK is implementation defined, so it's hard to say why they are not equivalent without knowing details on your particular compiler. That said, it's simply not a safe way of doing things.

UnicodeString has a constructor taking a UChar and one for UChar32. I'd be explicit when using them:

UnicodeString s(static_cast<UChar>(0x0040));

UnicodeString also provide an unescape() method that's fairly handy:

UnicodeString s = UNICODE_STRING_SIMPLE("\\u4ECA\\u65E5\\u306F").unescape(); // 今日は


回答2:

couldn't reproduce on ICU 4.8.1.1

#include <stdio.h>
#include "unicode/unistr.h"

int main(int argc, const char *argv[]) {
  UnicodeString s1(0x0040); // @ sign
  UnicodeString s2("\u0040");
  printf("s1==s2: %s\n", (s1==s2)?"T":"F");
  //  printf("s1.equals s2: %d\n", s1.equals(s2));
  printf("s1.length: %d  s2.length: %d\n", s1.length(), s2.length());
  printf("s1.charAt(0)=U+%04X s2.charAt(0)=U+%04X\n", s1.charAt(0), s2.charAt(0));
  return 0;
}

=>

s1==s2: T

s1.length: 1 s2.length: 1

s1.charAt(0)=U+0040 s2.charAt(0)=U+0040

gcc 4.4.5 RHEL 6.1 x86_64



回答3:

For anyone else who find's this, here's what I found (in ICU's documentation).

The compiler's and the runtime character set's codepage encodings are not specified by the C/C++ language standards and are usually not a Unicode encoding form. They typically depend on the settings of the individual system, process, or thread. Therefore, it is not possible to instantiate a Unicode character or string variable directly with C/C++ character or string literals. The only safe way is to use numeric values. It is not an issue for User Interface (UI) strings that are translated.

[1] http://userguide.icu-project.org/strings



回答4:

The double quotes in your \u constant are the problem. This evaluated properly:

wchar_t m1( 0x0040 );
wchar_t m2( '\u0040' );
bool equal = ( m1 == m2 );

equal was true.