Cross-platform unicode in C/C++: Which encoding to

I'm currently working on a hobby project (C/C++) which is supposed to work on both Windows and Linux, with full support for Unicode. Sadly, Windows and Linux use different encodings making our lives more difficult.

In my code I'm trying to use the data as universal as possible, making it easy for both Windows and Linux. In Windows, wchar_t is encoded as UTF-16 by default, and as UCS-4 in Linux (correct me if I'm wrong).

My software opens ({_wfopen, UTF-16, Windows},{fopen, UTF-8, Linux}) and writes data to files in UTF-8. So far it's all doable. Until I decided to use SQLite.

SQLite's C/C++ interface allows for one or two-byte encoded strings (click). Ofcourse this does not work with wchar_t in Linux, as the wchar_t in Linux is 4 bytes by default. Therefore, writing and reading from sqlite requires conversion for Linux.

Currently the code is cluttering up with exceptions for Windows/Linux. I was hoping to stick to the standard idea of storing data in wchar_t:

wchar_t in Windows: Filepaths without a problem, reading/writing to sqlite without a problem. Writing data to a file should be done in UTF-8 anyway.
wchar_t in Linux: Exception for the filepaths due to UTF-8 encoding, conversion before reading/writing to sqlite (wchar_t), and the same for windows when writing data to a file.

After reading (here) I was convinced I should stick to wchar_t in Windows. But after getting all that to work, the trouble began with porting to Linux.

Currently I'm thinking of redoing it all to stick with simple char(UTF-8) because it works with both Windows and Linux, keeping the fact in mind that I need to 'WideCharToMultiByte' every string in Windows to achieve UTF-8. Using simple char* based strings will greatly reduce the number of exceptions for Linux/Windows.

Do you have any experience with unicode for cross-platform? Any thoughts about the idea of simply storing data in UTF-8 instead of using wchar_t?

回答1:

UTF-8 on all platforms, with just-in-time conversion to UTF-16 for Windows is a common tactic for cross-platform Unicode.

回答2:

Our software is cross-platform as well, and we faced similar problems. We decided that our goal is to have the least amount of conversions possible. This means that we use wchar_t on Windows and char on Unix/Mac.

We do this by supporting _T and LPCTSTR and similar on Unix and by having generic functions that easily convert between std::string and std::wstring. We also have a generic std::basic_string<TCHAR> (tstring) which we use in most cases.

So far this works quite well. Basicly most functions take a tstring or a LPCTSTR and those which don't will get their parameters converted from a tstring. That means that most of the time we don't convert our strings and pass through most parameters.