I'm having some trouble figuring out the exact semantics of std::string.length()
.
The documentation explicitly points out that length()
returns the number of characters in the string and not the number of bytes. I was wondering in which cases this actually makes a difference.
In particular, is this only relevant to non-char instantiations of std::basic_string<>
or can I also get into trouble when storing UTF-8 strings with multi-byte characters? Does the standard allow for length()
to be UTF8-aware?
When dealing with non-
char
instantiations ofstd::basic_string<>
, sure, length may not equal number of bytes. This is particularly evident withstd::wstring
:But
std::string
is aboutchar
characters; there is no such thing as a multi-byte character as far asstd::string
is concerned, whether you crammed one in at a high level or not. So,std::string.length()
is always the number of bytes represented by the string. Note that if you're cramming multibyte "characters" into anstd::string
, then your definition of "character" suddenly becomes at odds with that of the container and of the standard.cplusplus.com is not "the documentation" for
std::string
, it's a poor quality site full of poor quality information. The C++ standard defines it very clearly:21.1 [strings.general] ¶1
21.4.4 [string.capacity] ¶1
If we are talking specifically about
std::string
, thenlength()
does return the number of bytes.This is because a
std::string
is abasic_string
ofchar
s, and the C++ Standard defines the size of onechar
to be exactly one byte.Note that the Standard doesn't say how many bits are in a byte, but that's another story entirely and you probably don't care.
EDIT: The Standard does say that an implementation shall provide a definition for
CHAR_BIT
which says how many bits are in a byte.By the way, if you go down a road where you do care how many bits are in a byte, you might consider reading this.
A
std::string
isstd::basic_string<char>
, sos.length() * sizeof(char) = byte length
. Also,std::string
knows nothing of UTF-8, so you're going to get the byte size even if that's not really what you're after.If you have UTF-8 data in a
std::string
, you'll need to use something else such as ICU to get the "real" length.