I know this question has been asked quite a few times here, and i did read some of the answers, But there are a few suggested solutions and im trying to figure out the best of them.
I'm writing a C99 app that basically receives XML text encoded in UTF-8.
Part of it's job is to copy and manipulate that string (finding a substr, cat it, ex..)
As i would rather not to use an outside not-standard library right now, im trying to implement it using wchar_t.
Currently, im using mbstowcs to convert it to wchar_t for easy manipulation, and for some input i tried in different languages - it worked fine.
Thing is, i did read some people out there had some issues with UTF-8 and mbstowcs, so i would like to hear out about whether this use is permitted/acceptable.
Other option i faced was using iconv with WCHAR_T parameter. Thing is, im working on a platform(not a PC) which it's locale is very very limit to only ANSI C locale. How about that?
I did also encounter some C++ library which is very popular. but im limited for C99 implementation.
Also, i would be compiling this code on another platform, which the sizeof of wchar_t is different (2 bytes versus 4 bytes on my machine). How can i overcome that? using fixed-size char containers? but then, which manipulation functions should i use instead?
Happy to hear some thoughts. thanks.
C does not define what encoding the
char
andwchar_t
types are and the standard library only mandates some functions that translate between the two without saying how. If the implementation-dependent encoding ofchar
is not UTF-8 thenmbstowcs
will result in data corruption.As noted in the rationale for the C99 standard:
Sourced from here.
So, if you have UTF-8 data in your
char
s there isn't a standard API way to convert that towchar_t
s.In my opinion
wchar_t
should usually be avoided unless necessary - you might need it if you're using WIN32 APIs for example. I am not convinced it will simplify string manipulation.wchar_t
is always UTF-16LE on Windows so you may still need to have more than onewchar_t
to represent a single Unicode code point anyway.I suggest you investigate the ICU project - at least from an educational standpoint.
You could do that with conditional typedefs like this:
This will define the typedefs
CHAR16
andCHAR32
to use the new C++11 character types if available, but otherwise fall back to usingwchar_t
when possible and fixed-width unsigned integers otherwise.