Visualisation of uft-8 (Polish) not working proper

2019-07-25 14:46发布

问题:

My software supports multiple languages (English, German, Polish, Russian, ...).
For this reason I have some language specific files with the dialog texts in the specific language (Encoded as UTF-8). In my mfc application I open and read those files and insert the text into my AfxMessageBoxes and other UI-Windows.

// Get the codepage number. 65001 = UTF-8
// In the real code this is a parameter in the function I call (just for clarification)
LANGID languageID = 65001;
TCHAR szCodepage[10];
GetLocaleInfo (MAKELCID (languageID, SORT_DEFAULT), LOCALE_IDEFAULTANSICODEPAGE, szCodepage, 10);
int nAnsiCodePage = _ttoi (szCodepage);

// Open the file
CFile file;
CString filename = getName();

if (!file.Open(FileName, CFile::modeRead, NULL))
{
    //Check if everything is fine, else break
}

// Read the file
CString inString;
int len = file.GetLength ();
UINT n = file.Read (inString.GetBuffer(len), len);
inString.ReleaseBuffer ();
int size = MultiByteToWideChar (CP_ACP, 0, strAllItems, -1, NULL, 0);

WCHAR *ubuf = new WCHAR[size + 1]; 
MultiByteToWideChar ((UINT) nAnsiCodePage, (nAnsiCodePage == CP_UTF8 ?
                                     0 : MB_PRECOMPOSED), inString, -1, ubuf, (int) size);

outString = ubuf;
file.Close ();


Result:

This mechanism is working fine for special letters of russian and german, but not for polish. I already checked the utf-8 site (http://www.utf8-chartable.de/unicode-utf8-table.pl?number=1024) and the polish characters are part of it.
I also checked the hex values of my CString and everything seems to be alright, but it is not visualized in the correct way. Just for testing I changed the used codepage from utf-8 to 1250 (Eastern Europe, Polish included) and it also did not work. What am I doing wrong?

EDIT:
When I use:

MultiByteToWideChar (CP_UTF8 , 0, inString, -1, ubuf, (int) size);

The hex-values are shortend to the "best match" letters. Meaning my result is: mezczyzna

I am using windows 7 with the english language selected.

回答1:

Well, you have two options:

A. Make your application Unicode. You don't tell us whether it actually is, but I conclude it's not. This is the 'best" solution technically, but it may require a lot of effort, and it may even not be feasible at all (eg use of non-Unicode libraries).

B. If your app is non-Unicode, you have some limitations:
- Your application will only be capable of displaying correctly one codepage using the non-unicode APIs & messages, and this unfortunately cannot be set per application, it's globally set in Windows with the "Language for non-Unicode programs" option, and requires a reboot.
- To display correctly strings containing characters not in the default codepage, you need to convert them to Unicode and use the "wide" versions of APIs & messages explicitly, to display them (eg MessageBoxW()). A little cumbersome, but doable, if the operation concerns only a small number of controls.

The machine you're working on has some western european language as the "Language for non-Unicode programs", and I come to this conclusion because "This mechanism is working fine for special letters of russian and german" and "Using MessageBoxA(0, "mężczyzna", 0, 0) does not work", as you said (though i'm not sure at all about russian, as it's a different codepage).

Apart from this, as IInspectable said, int size = MultiByteToWideChar (CP_ACP, 0, strAllItems, -1, NULL, 0); makes not sense at all, as the string is known to be UTF-8, and not of the default codepage. You may also need to remove the UTF-8 BOM header, if your file contains it.