How do I get the byte size of a multibyte-character string in Visual C? Is there a function or do I have to count the characters myself?
Or, more general, how do I get the right byte size of a TCHAR string?
Solution:
_tcslen(_T("TCHAR string")) * sizeof(TCHAR)
EDIT:
I was talking about null-terminated strings only.
According to MSDN, _tcslen
corresponds to strlen
when _MBCS
is defined. strlen
will return the number of bytes in the string. If you use _tcsclen
that corresponds to _mbslen
which returns the number of multibyte characters.
Also, multibyte strings do not (AFAIK) contain embedded nulls, no.
I would question the use of a multibyte encoding in the first place, though... unless you're supporting a legacy app, there's no reason to choose multibyte over Unicode.
Let's see if I can clear this up:
"Multi-byte character string" is a vague term to begin with, but in the world of Microsoft, it typically meants "not ASCII, and not UTF-16". Thus, you could be using some character encoding which might use 1 byte per character, or 2 bytes, or possibly more. As soon as you do, the number of characters in the string != the number of bytes in the string.
Let's take UTF-8 as an example, even though it isn't used on MS platforms. The character é is encoded as "c3 a9" in memory -- thus, two bytes, but 1 character. If I have the string "thé", it's:
text: t h é \0
mem: 74 68 c3 a9 00
This is a "null terminated" string, in that it ends with a null. If we wanted to allow our string to have nulls in it, we'd need to store the size in some other fashion, such as:
struct my_string
{
size_t length;
char *data;
};
... and a slew of functions to help deal with that. (This is sort of how std::string
works, quite roughly.)
For null-terminated strings, however, strlen()
will compute their size in bytes, not characters. (There are other functions for counting characters) strlen
just counts the number of bytes before it sees a 0 byte -- nothing fancy.
Now, "wide" or "unicode" strings in the world of MS refer to UTF-16 strings. They have similar problems in that the number of bytes != the number of characters. (Also: the number of bytes / 2 != the number of characters) Let look at thé again:
text: t h é \0
shorts: 0x0074 0x0068 0x00e9 0x0000
mem: 74 00 68 00 e9 00 00 00
That's "thé" in UTF-16, stored in little endian (which is what your typical desktop is). Notice all the 00 bytes -- these trip up strlen. Thus, we call wcslen
, which looks at it as 2-byte short
s, not single bytes.
Lastly, you have TCHAR
s, which are one of the above two cases, depending on if UNICODE
is defined. _tcslen
will be the appropriate function (either strlen
or wcslen
), and TCHAR
will be either char
or wchar_t
. TCHAR
was created to ease the move to UTF-16 in the Windows world.