C / C++ UTF-8 upper/lower case conversions

2019-01-23 13:38发布

The Problem: There is a method with a corresponding test-case that works on one machine and fails on the other (details below). I assume there's something wrong with the code, causing it to work by chance on the one machine. Unfortunately I cannot find the problem.

Please note that the usage of std::string and utf-8 encoding are requirements I have no real influence on. Using C++ methods would be totally fine, but unfortunately I failed to find anything. Hence the use of C-functions.

The method:

std::string firstCharToUpperUtf8(const string& orig) {
  std::string retVal;
  retVal.reserve(orig.size());
  std::mbstate_t state = std::mbstate_t();
  char buf[MB_CUR_MAX + 1];
  size_t i = 0;
  if (orig.size() > 0) {
    if (orig[i] > 0) {
      retVal += toupper(orig[i]);
      ++i;
    } else {
      wchar_t wChar;
      int len = mbrtowc(&wChar, &orig[i], MB_CUR_MAX, &state);
      // If this assertion fails, there is an invalid multi-byte character.
      // However, this usually means that the locale is not utf8.
      // Note that the default locale is always C. Main classes need to set them
      // To utf8, even if the system's default is utf8 already.
      assert(len > 0 && len <= static_cast<int>(MB_CUR_MAX));
      i += len;
      int ret = wcrtomb(buf, towupper(wChar), &state);
      assert(ret > 0 && ret <= static_cast<int>(MB_CUR_MAX));
      buf[ret] = 0;
      retVal += buf;
    }
  }
  for (; i < orig.size(); ++i) {
    retVal += orig[i];
  }
  return retVal;
}

The test:

TEST(StringUtilsTest, firstCharToUpperUtf8) {
  setlocale(LC_CTYPE, "en_US.utf8");
  ASSERT_EQ("Foo", firstCharToUpperUtf8("foo"));
  ASSERT_EQ("Foo", firstCharToUpperUtf8("Foo"));
  ASSERT_EQ("#foo", firstCharToUpperUtf8("#foo"));
  ASSERT_EQ("ßfoo", firstCharToUpperUtf8("ßfoo"));
  ASSERT_EQ("Éfoo", firstCharToUpperUtf8("éfoo"));
  ASSERT_EQ("Éfoo", firstCharToUpperUtf8("Éfoo"));
}

The failed test (only happens on one of two machines):

Failure
Value of: firstCharToUpperUtf8("ßfoo")
  Actual: "\xE1\xBA\x9E" "foo"
Expected: "ßfoo"

Both machine have the locale en_US.utf8 installed. They however use different versions of libc. It works on the machine with GLIBC_2.14 independent of where it was compiled and doesn't work on the other machine, while it can only be compiled there, because otherwise it lacks the proper libc version.

Either way, there is a machine that compiles this code and runs it while it fails. There has to be something wrong with the code and I wonder what. Pointing to C++ methods (STL in particular), would also be great. Boost and other libraries should be avoided due to other outside requirements.

5条回答
Melony?
2楼-- · 2019-01-23 14:17

What do you expect the upper-case version of the German ß character to be, for that test case?

In other words, your basic assumptions are wrong.

Note that the Wikipedia in the comment states:

Sharp s is nearly unique among the letters of the Latin alphabet in that it has no traditional upper case form (one of the few other examples is kra, ĸ, which was used in Greenlandic). This is because it never occurs initially in German text, and traditional German printing (which used blackletter) never used all-caps. When using all-caps, the current spelling rules require the replacement of ß with SS.[1] However, in 2010 its use became mandatory in official documentation when writing geographical names in all-caps.[2]

So, the basic test case, with the sharp s occuring as an initial, is violating the rules of German. I still think I have a point, in that the original posters premise is wrong, strings cannot in general be freely converted between upper and lower case, for all languages.

查看更多
ら.Afraid
3楼-- · 2019-01-23 14:22

The following C++11 code works for me (disregarding for a moment the question of how the sharp s should be translated---it's left unchanged. It's slowly being phased out from German anyway).

Optimizations and uppercasing the first letter only are left as an exercise.

Edit: As pointed out, codecvt appears to have been deprecated. It should remain in the standard, however, until a suitable replacement is defined. See Deprecated header <codecvt> replacement

#include <codecvt>
#include <iostream>
#include <locale>

std::locale const utf8("en_US.UTF-8");

// Convert UTF-8 byte string to wstring
std::wstring to_wstring(std::string const& s) {
  std::wstring_convert<std::codecvt_utf8<wchar_t> > conv;
  return conv.from_bytes(s);
}

// Convert wstring to UTF-8 byte string
std::string to_string(std::wstring const& s) {
  std::wstring_convert<std::codecvt_utf8<wchar_t> > conv;
  return conv.to_bytes(s);
}

// Converts a UTF-8 encoded string to upper case
std::string tou(std::string const& s) {
  auto ss = to_wstring(s);
  for (auto& c : ss) {
    c = std::toupper(c, utf8);
  }
  return to_string(ss);
}

void test_utf8(std::ostream& os) {
  os << tou("foo" ) << std::endl;
  os << tou("#foo") << std::endl;
  os << tou("ßfoo") << std::endl;
  os << tou("Éfoo") << std::endl;
}    

int main() {
  test_utf8(std::cout);
}
查看更多
forever°为你锁心
4楼-- · 2019-01-23 14:33

small case sharp s : ß; upper case sharp s : ẞ. Did you use the uppercase version in your assert ? Seems like glibg 2.14 follows implements pre unicode5.1 no upper case version of sharp s, and on the other machine the libc uses unicode 5.1 ẞ=U1E9E ...

查看更多
祖国的老花朵
5楼-- · 2019-01-23 14:34

Maybe someone would use it (maybe for tests)

With this you could make simple converter :) No additional libs :)

http://pastebin.com/fuw4Uizk

1482 letters

Example

Ь <> ь
Э <> э
Ю <> ю
Я <> я
Ѡ <> ѡ
Ѣ <> ѣ
Ѥ <> ѥ
Ѧ <> ѧ
Ѩ <> ѩ
Ѫ <> ѫ
Ѭ <> ѭ
Ѯ <> ѯ
Ѱ <> ѱ
Ѳ <> ѳ
Ѵ <> ѵ
Ѷ <> ѷ
Ѹ <> ѹ
Ѻ <> ѻ
Ѽ <> ѽ
Ѿ <> ѿ
Ҁ <> ҁ
Ҋ <> ҋ
Ҍ <> ҍ
Ҏ <> ҏ
Ґ <> ґ
Ғ <> ғ
Ҕ <> ҕ
Җ <> җ
Ҙ <> ҙ
Қ <> қ
Ҝ <> ҝ
Ҟ <> ҟ
Ҡ <> ҡ
Ң <> ң
查看更多
姐就是有狂的资本
6楼-- · 2019-01-23 14:36

The issue is your locales that do not assert are compliant, your locales on which the assert does fire are non-compliant.

Technical Report N897 required in B.1.2[LC_CTYPE Rationale]:

As the LC_CTYPE character classes are based on the C Standard character-class definition, the category does not support multicharacter elements. For instance, the German character is traditionally classified as a lowercase letter. There is no corresponding uppercase letter; in proper capitalization of German text the will be replaced by SS; i.e., by two characters. This kind of conversion is outside the scope of the toupper and tolower keywords.

This Technical Report was published in Dec-25-'01. But according to: https://en.wikipedia.org/wiki/Capital_%E1%BA%9E

In 2010, the use of the capital ẞ became mandatory in official documentation in Germany when writing geographical names in all-caps

But the topic has not been revisited by the standard committee, so technically independent of what the German government says, the standardized behavior of toupper should be to make no changes to the ß character.

The reason this works inconsistently over machines is setlocale:

Installs the specified system locale or its portion as the new C locale

So it is non-compliant system locale, en_US.utf8 that is instructing toupper to modify the ß character. Unfortunately, the specialization ctype<char>::clasic_table, is not available on ctype<wchar_t> so you cannot modify the behavior. Leaving you with 2 options:

  1. Create a const map<wchar_t, wchar_t> for conversion from every possible lowercase wchar_t to the corresponding uppercase wchar_t
  2. Add a check for an L'ß' like this:

    int ret = wcrtomb(buf, wChar == L'ß' ? L'ẞ' : towupper(wChar), &state);
    

Live Example

查看更多
登录 后发表回答