I'm a bit confused with differences between unsigned char
(which is also BYTE
in WinAPI) and char
pointers.
Currently I'm working with some ATL-based legacy code and I see a lot of expressions like the following:
CAtlArray<BYTE> rawContent;
CALL_THE_FUNCTION_WHICH_FILLS_RAW_CONTENT(rawContent);
return ArrayToUnicodeString(rawContent);
// or return ArrayToAnsiString(rawContent);
Now, the implementations of ArrayToXXString
look the following way:
CStringA ArrayToAnsiString(const CAtlArray<BYTE>& array)
{
CAtlArray<BYTE> copiedArray;
copiedArray.Copy(array);
copiedArray.Add('\0');
// Casting from BYTE* -> LPCSTR (const char*).
return CStringA((LPCSTR)copiedArray.GetData());
}
CStringW ArrayToUnicodeString(const CAtlArray<BYTE>& array)
{
CAtlArray<BYTE> copiedArray;
copiedArray.Copy(array);
copiedArray.Add('\0');
copiedArray.Add('\0');
// Same here.
return CStringW((LPCWSTR)copiedArray.GetData());
}
So, the questions:
Is the C-style cast from BYTE*
to LPCSTR
(const char*
) safe for all possible cases?
Is it really necessary to add double null-termination when converting array data to wide-character string?
The conversion routine CStringW((LPCWSTR)copiedArray.GetData())
seems invalid to me, is that true?
Any way to make all this code easier to understand and to maintain?
The C standard is kind of weird when it comes to the definition of a byte. You do have a couple of guarantees though.
- A byte will always be one char in size
- sizeof(char) always returns 1
- A byte will be at least 8 bits in size
This definition doesn't mesh well with older platforms where a byte was 6 or 7 bits long, but it does mean BYTE*,
and char *
are guaranteed to be equivalent.
Multiple nulls are needed at the end of a Unicode string because there are valid Unicode characters that start with a zero (null) byte.
As for making the code easier to read, that is completely a matter of style. This code appears to be written in a style used by a lot of old C Windows code, which has definitely fallen out of favor. There are probably a ton of ways to make it clearer for you, but how to make it clearer has no clear answer.
- If the BYTE* behaves like a proper string (i.e. the last BYTE is 0), you can cast a BYTE* to a LPCSTR, yes. Functions working with LPCSTR assume zero-terminated strings.
- I think the multiple zeroes are only necessary when dealing with some multibyte character sets. The most common 8-bit encodings (like ordinary Windows Western and also UTF-8) don't require them.
- The
CString
is Microsoft's best attempt at user-friendly strings. For instance, its constructor can handle both char
and wchar_t
type input, regardless of whether the CString itself is wide or not, so you don't have to worry about the conversion much.
Edit: wait, now I see that they are abusing a BYTE array for storing wide chars in. I couldn't recommend that.
An LPCWSTR is a String with 2 Bytes per character, a "char" is one Byte per character. That means you cannot cast it in C-style, because you have to adjust the memory (add a "0" before each standard-ASCII), and not just read the Data in a different way from the memory (what a C-Cast would do).
So the cast is not so safe i would say.
The Double-Nulltermination: You have always 2 Bytes as one Character, so your "End-of-string" sign must be 2 Bytes long.
To make that code easier to understand look after lexical_cast in Boost (http://www.boost.org/doc/libs/1_48_0/doc/html/boost_lexical_cast.html)
Another way would be using the std::strings (using like std::basic_string; ), and you can perform on String operations.