I am currently working on an application for a low-memory platform that requires an std::set of many short strings (>100,000 strings of 4-16 characters each). I recently transitioned this set from std::string to const char * to save memory and I was wondering whether I was really avoiding all that much overhead per string.
I tried using the following:
std::string sizeTest = "testString";
std::cout << sizeof(sizeTest) << " bytes";
But it just gave me an output of 4 bytes, indicating that the string contains a pointer. I'm well aware that strings store their data in a char * internally, but I thought the string class would have additional overhead.
Does the GCC implementation of std::string incur more overhead than sizeof(std::string) would indicate? More importantly, is it significant over this size of data set?
Here are the sizes of relevant types on my platform (it is 32-bit and has 8 bits per byte):
char: 1 bytes
void *: 4 bytes
char *: 4 bytes
std::string: 4 bytes
Well, at least with GCC 4.4.5, which is what I have handy on this
machine, std::string
is a typdef for std::basic_string<char>
, and
basic_string
is defined in
/usr/include/c++/4.4.5/bits/basic_string.h
. There's a lot of
indirection in that file, but what it comes down to is that nonempty
std::string
s store a pointer to one of these:
struct _Rep_base
{
size_type _M_length;
size_type _M_capacity;
_Atomic_word _M_refcount;
};
Followed in-memory by the actual string data. So std::string
is
going to have at least three words of overhead for each string, plus
any overhead for having a higher capacity
than `length
(probably
not, depending on how you construct your strings -- you can check by
asking the capacity()
method).
There's also going to be overhead from your memory allocator for doing
lots of small allocations; I don't know what GCC uses for C++, but
assuming it's similar to the dlmalloc
allocator it uses for C, that
could be at least two words per allocation, plus some space to align
the size to a multiple of at least 8 bytes.
I'm going to guess you are on a 32 bit, 8 bit per byte platform. I'm also going to guess that at least on the gcc version you are using, that they are using a reference counted implementation for std::string. The 4 byte sizeof you see is a pointer to a structure containing the reference count and the string data (and any allocator state if applicable).
In this design of gcc's the only "short" string has size == 0, in which case it can share a representation with every other empty string. Otherwise you get a refcounted COW string.
To investigate this yourself, code up an allocator that keeps track of how much memory it allocates and deallocates, and how many times. Use this allocator to investigate the implementation of the container you're interested in.
If it's guaranteed that ">100,000 strings of 4-16 characters each", then don't use std::string. Instead, write your own ShortString class. It's interesting that "sizeof(std::string) == 4", how is that possible? What are sizeof(char) and sizeof(void *)?
I've performed some comparisons about std::string overhead. In general it is about 48 bytes! Take a look at the article on my blog:
http://jovislab.com/blog/?p=76