Difference between unsigned char and char pointers

2019-06-26 07:17发布

问题:

I'm a bit confused with differences between unsigned char (which is also BYTE in WinAPI) and char pointers.

Currently I'm working with some ATL-based legacy code and I see a lot of expressions like the following:

CAtlArray<BYTE> rawContent;
CALL_THE_FUNCTION_WHICH_FILLS_RAW_CONTENT(rawContent);
return ArrayToUnicodeString(rawContent);
// or return ArrayToAnsiString(rawContent);

Now, the implementations of ArrayToXXString look the following way:

CStringA ArrayToAnsiString(const CAtlArray<BYTE>& array)
{
    CAtlArray<BYTE> copiedArray;
    copiedArray.Copy(array);
    copiedArray.Add('\0');

    // Casting from BYTE* -> LPCSTR (const char*).
    return CStringA((LPCSTR)copiedArray.GetData());
}

CStringW ArrayToUnicodeString(const CAtlArray<BYTE>& array)
{
    CAtlArray<BYTE> copiedArray;
    copiedArray.Copy(array);

    copiedArray.Add('\0');
    copiedArray.Add('\0');

    // Same here.        
    return CStringW((LPCWSTR)copiedArray.GetData());
}

So, the questions:

  • Is the C-style cast from BYTE* to LPCSTR (const char*) safe for all possible cases?

  • Is it really necessary to add double null-termination when converting array data to wide-character string?

  • The conversion routine CStringW((LPCWSTR)copiedArray.GetData()) seems invalid to me, is that true?

  • Any way to make all this code easier to understand and to maintain?

回答1:

The C standard is kind of weird when it comes to the definition of a byte. You do have a couple of guarantees though.

  • A byte will always be one char in size
    • sizeof(char) always returns 1
  • A byte will be at least 8 bits in size

This definition doesn't mesh well with older platforms where a byte was 6 or 7 bits long, but it does mean BYTE*, and char * are guaranteed to be equivalent.

Multiple nulls are needed at the end of a Unicode string because there are valid Unicode characters that start with a zero (null) byte.

As for making the code easier to read, that is completely a matter of style. This code appears to be written in a style used by a lot of old C Windows code, which has definitely fallen out of favor. There are probably a ton of ways to make it clearer for you, but how to make it clearer has no clear answer.



回答2:

  • Yes, it is always safe. Because they both point to an array of single-byte memory locations.
    LPCSTR: Long Pointer to Const (single-byte) String
    LPCWSTR : Long Pointer to Const Wide (multi-byte) String
    LPCTSTR : Long Pointer to Const context-dependent (single-byte or multi-byte) String

  • In wide character strings, every single character occupies 2 bytes of memory, and the length of the memory location containing the string must be a multiple of 2. So if you want to add a wide '\0' to the end of a string, you should add two bytes.

  • Sorry for this part, I do not know ATL and I cannot help you on this part, but actually I see no complexity here, and I think it is easy to maintain. What code do you really want to make easier to understand and maintain?



回答3:

  1. If the BYTE* behaves like a proper string (i.e. the last BYTE is 0), you can cast a BYTE* to a LPCSTR, yes. Functions working with LPCSTR assume zero-terminated strings.
  2. I think the multiple zeroes are only necessary when dealing with some multibyte character sets. The most common 8-bit encodings (like ordinary Windows Western and also UTF-8) don't require them.
  3. The CString is Microsoft's best attempt at user-friendly strings. For instance, its constructor can handle both char and wchar_t type input, regardless of whether the CString itself is wide or not, so you don't have to worry about the conversion much.

Edit: wait, now I see that they are abusing a BYTE array for storing wide chars in. I couldn't recommend that.



回答4:

An LPCWSTR is a String with 2 Bytes per character, a "char" is one Byte per character. That means you cannot cast it in C-style, because you have to adjust the memory (add a "0" before each standard-ASCII), and not just read the Data in a different way from the memory (what a C-Cast would do). So the cast is not so safe i would say.

The Double-Nulltermination: You have always 2 Bytes as one Character, so your "End-of-string" sign must be 2 Bytes long.

To make that code easier to understand look after lexical_cast in Boost (http://www.boost.org/doc/libs/1_48_0/doc/html/boost_lexical_cast.html)

Another way would be using the std::strings (using like std::basic_string; ), and you can perform on String operations.