The Problem: There is a method with a corresponding test-case that works on one machine and fails on the other (details below). I assume there's something wrong with the code, causing it to work by chance on the one machine. Unfortunately I cannot find the problem.
Please note that the usage of std::string and utf-8 encoding are requirements I have no real influence on. Using C++ methods would be totally fine, but unfortunately I failed to find anything. Hence the use of C-functions.
The method:
std::string firstCharToUpperUtf8(const string& orig) {
std::string retVal;
retVal.reserve(orig.size());
std::mbstate_t state = std::mbstate_t();
char buf[MB_CUR_MAX + 1];
size_t i = 0;
if (orig.size() > 0) {
if (orig[i] > 0) {
retVal += toupper(orig[i]);
++i;
} else {
wchar_t wChar;
int len = mbrtowc(&wChar, &orig[i], MB_CUR_MAX, &state);
// If this assertion fails, there is an invalid multi-byte character.
// However, this usually means that the locale is not utf8.
// Note that the default locale is always C. Main classes need to set them
// To utf8, even if the system's default is utf8 already.
assert(len > 0 && len <= static_cast<int>(MB_CUR_MAX));
i += len;
int ret = wcrtomb(buf, towupper(wChar), &state);
assert(ret > 0 && ret <= static_cast<int>(MB_CUR_MAX));
buf[ret] = 0;
retVal += buf;
}
}
for (; i < orig.size(); ++i) {
retVal += orig[i];
}
return retVal;
}
The test:
TEST(StringUtilsTest, firstCharToUpperUtf8) {
setlocale(LC_CTYPE, "en_US.utf8");
ASSERT_EQ("Foo", firstCharToUpperUtf8("foo"));
ASSERT_EQ("Foo", firstCharToUpperUtf8("Foo"));
ASSERT_EQ("#foo", firstCharToUpperUtf8("#foo"));
ASSERT_EQ("ßfoo", firstCharToUpperUtf8("ßfoo"));
ASSERT_EQ("Éfoo", firstCharToUpperUtf8("éfoo"));
ASSERT_EQ("Éfoo", firstCharToUpperUtf8("Éfoo"));
}
The failed test (only happens on one of two machines):
Failure
Value of: firstCharToUpperUtf8("ßfoo")
Actual: "\xE1\xBA\x9E" "foo"
Expected: "ßfoo"
Both machine have the locale en_US.utf8 installed. They however use different versions of libc. It works on the machine with GLIBC_2.14 independent of where it was compiled and doesn't work on the other machine, while it can only be compiled there, because otherwise it lacks the proper libc version.
Either way, there is a machine that compiles this code and runs it while it fails. There has to be something wrong with the code and I wonder what. Pointing to C++ methods (STL in particular), would also be great. Boost and other libraries should be avoided due to other outside requirements.
What do you expect the upper-case version of the German ß character to be, for that test case?
In other words, your basic assumptions are wrong.
Note that the Wikipedia in the comment states:
So, the basic test case, with the sharp s occuring as an initial, is violating the rules of German. I still think I have a point, in that the original posters premise is wrong, strings cannot in general be freely converted between upper and lower case, for all languages.
The following C++11 code works for me (disregarding for a moment the question of how the sharp s should be translated---it's left unchanged. It's slowly being phased out from German anyway).
Optimizations and uppercasing the first letter only are left as an exercise.
Edit: As pointed out, codecvt appears to have been deprecated. It should remain in the standard, however, until a suitable replacement is defined. See Deprecated header <codecvt> replacement
small case sharp s : ß; upper case sharp s : ẞ. Did you use the uppercase version in your assert ? Seems like glibg 2.14 follows implements pre unicode5.1 no upper case version of sharp s, and on the other machine the libc uses unicode 5.1 ẞ=U1E9E ...
Maybe someone would use it (maybe for tests)
With this you could make simple converter :) No additional libs :)
http://pastebin.com/fuw4Uizk
1482 letters
Example
The issue is your locales that do not assert are compliant, your locales on which the assert does fire are non-compliant.
Technical Report N897 required in B.1.2[
LC_CTYPE
Rationale]:This Technical Report was published in Dec-25-'01. But according to: https://en.wikipedia.org/wiki/Capital_%E1%BA%9E
But the topic has not been revisited by the standard committee, so technically independent of what the German government says, the standardized behavior of
toupper
should be to make no changes to the ß character.The reason this works inconsistently over machines is
setlocale
:So it is non-compliant system locale,
en_US.utf8
that is instructingtoupper
to modify the ß character. Unfortunately, the specializationctype<char>::clasic_table
, is not available onctype<wchar_t>
so you cannot modify the behavior. Leaving you with 2 options:const map<wchar_t, wchar_t>
for conversion from every possible lowercasewchar_t
to the corresponding uppercasewchar_t
Add a check for an
L'ß'
like this:Live Example