Windows API: ANSI and Wide-Character Strings — Is

2019-01-30 19:20发布

问题:

I'm not quite pro with encodings, but here's what I think I know (though it may be wrong):

  1. ASCII is a 7-bit, fixed-length encoding, with the characters you can find in ASCII charts.
  2. UTF8 is an 8-bit, variable-length encoding. All characters can be written in UTF8.
  3. UCS-2 LE/BE are fixed-length, 16-bit encodings that support most common characters.
  4. UTF-16 is a 16-bit, variable-length encoding. All characters can be written in UTF16.

Are those above all correct?

Now, for the questions:

  1. Do the Windows "A" functions (like SetWindowTextA) take in ASCII strings? Or "multi-byte strings" (more questions on this below)?
  2. Do the Windows "W" functions take in UTF-16 strings or UCS-2 strings? I thought they take in UCS-2, but the names confuse me.
  3. In WideCharToMultiByte, Microsoft uses the word "wide-character string" to mean UTF-16. In that context, then what is considered a "multi-byte string"? UTF-8?
  4. Is LPWSTR a "wide-character string"? I would say it is, but then, wouldn't that mean it's UTF-16? And wouldn't that mean that it could be used to display, say, 4-byte characters? If not, then... is displaying 4-byte characters impossible? (Windows doesn't seem to have APIs for those.)
  5. Is the functionality of WideCharToMultiByte a superset of that of wcstombs, and do they both work on the same type of string? Or does one, say, work on UTF-16 while the other works on UCS-2?
  6. Are file paths in UTF-16 or UCS-2? I know Windows treats it as an "opaque array of characters" from Microsoft's documentation, but per the C standard for functions like fwprintf, is there any standardized encoding?
  7. What is "ANSI" encoding? Is that even a correct term? And how does it relate to ASCII?
  8. (I had more questions, but this is enough... I forgot some of them anyway...)

These are a lot of questions, so any links to explanations about how all these connect (aside from reading the Unicode standard, which won't help with the Windows API anyway) would also be greatly appreciated.

Thank you!

回答1:

Are those above all correct?

Yes, if you don't assume the existence of characters not encoded in Unicode (for most practical applications, this assumption is fine).

Do the Windows "A" functions (like SetWindowTextA) take in ASCII strings? Or "multi-byte strings" (more questions on this below)?

They take byte strings (i.e., strings whose code unit is a byte, which is always an octet on Windows) encoded in the current "ANSI"/MBCS/legacy encoding. "ANSI" is the historical terms for these encodings, but not correct. For Western Windows systems, this encoding is usually Windows-1252.

Do the Windows "W" functions take in UTF-16 strings or UCS-2 strings? I thought they take in UCS-2, but the names confuse me.

Since Windows 2000, most of them support UTF-16. The name "wide" and the rest of the Microsoft terminology (e.g., "Unicode" meaning "UTF-16" or "UCS") were chosen before the modern Unicode standard unified the terminology.

In WideCharToMultiByte, Microsoft uses the word "wide-character string" to mean UTF-16. In that context, then what is considered a "multi-byte string"? UTF-8?

Every other encoding that WideCharToMultiByte supports is a "multi-byte encoding" in this context, including Windows-1251 and UTF-8.

Is LPWSTR a "wide-character string"? I would say it is, but then, wouldn't that mean it's UTF-16? And wouldn't that mean that it could be used to display, say, 4-byte characters? If not, then... is displaying 4-byte characters impossible? (Windows doesn't seem to have APIs for those.)

LPWSTR is a pointer to wchar_t which is always a 16-bit unsigned integer on Windows. Which characters can be displayed is unrelated to the encoding as long as that encoding can encode all Unicode characters. Windows is generally able to display non-BMP characters, but not everywhere (e.g., the console cannot).

Is the functionality of WideCharToMultiByte a superset of that of wcstombs, and do they both work on the same type of string? Or does one, say, work on UTF-16 while the other works on UCS-2?

Don't really know, but I don't think they differ too much. I suppose you just try to convert some non-BMP character to UTF-8 and look whether the result is correct.

Are file paths in UTF-16 or UCS-2? I know Windows treats it as an "opaque array of characters" from Microsoft's documentation, but per the C standard for functions like fwprintf, is there any standardized encoding?

File paths are indeed opaque arrays of UTF-16 characters, meaning that Windows doesn't perform any kind of translation when storing or reading file names (like Linux and unlike Mac OS X). But Windows still has its weird mostly-undefined case insensitive behavior which causes much trouble because file names that are treated equivalent aren't necessarily equal. That breaks many invariants; for example, on Linux without interference from other threads, if you successfully create two files A and a in some directory, you'll end up with two distinct files, while on Windows you get only one file (and in general, an unpredictable number of files).

What is "ANSI" encoding? Is that even a correct term? And how does it relate to ASCII?

ANSI is the American standardization organization. Using this word when referring to encodings is a misnomer, but a frequent one, so you should be aware of it. I prefer the term legacy 8-bit encoding, because I think that's essentially what it is: a non-Unicode encoding that is kept only for compatibility with legacy (Windows 9x) applications. On Western systems, this is usually Windows-1252, which is a proper superset of ASCII.



回答2:

  1. *A functions used the active ANSI codepage.

  2. *W function use UTF-16.

  3. Multi-byte refers to whatever is passed in the CodePage parameter. It is most commonly either the active ANSI codepage or UTF-8.

  4. LPWSTR is a UTF-16 string which may or may not be null-terminated (see MSDN)

  5. I don't know anything about wcstombs, I always use WideCharToMultiByte.

  6. File paths are in UTF-16. In fact all text is UTF-16 internally in Windows.

  7. For ANSI encoding you will need to read up on that in some detail. You could do worse than to start with Wikipedia and follow the links from there.

I hope that helps and that if I've got anything wrong, anyone who knows more please do edit this to correct any errors!



回答3:

Wide strings used to be UCS-2. From Windows 2000, wide strings are UTF-16. Good to know if you need to maintain some old legacy system.



回答4:

First of all you'll find plenty of information in this SO topic.

ASCII is a charset, not encoding. Now, there's a number of 8-bit charsets, one of them being set as default in the system (you can change it in Regional Settings). *A functions accept 8-bit characters in that charset. UTF-8 is not a charset, but encoding of Unicode charset. *W functions, as I understand, use UTF-16 rather than UCS-2.