Implementing a “string pool” that is guaranteed no

2019-01-26 08:57发布

问题:

I need a "string pool" object into which I can repeatedly insert a "sequence of chars" (I use this phrase to mean "string" without confusing it with std::string or a C string), obtain a pointer to the sequence, and be guaranteed that the pointer will not become invalidated if/when the pool needs to grow. Using a simple std::string as the pool won't work, because of the possibility for the string to be reallocated when it outgrows its initial capacity, thus invalidating all previous pointers into it.

The pool will not grow without bound -- there are well-defined points at which I will call a clear() method on it -- but I don't want to reserve any maximum capacity on it, either. It should be able to grow, without moving.

One possibility I'm considering is inserting each new sequence of chars into a forward_list<string> and obtaining begin()->c_str(). Another is inserting into an unordered_set<string>, but I'm having a hard time finding out what happens when an unordered_set has to grow. The third possibility I'm considering (less enthusiastically) is rolling my own chain of 1K buffers into which I concatenate the sequence of chars. That has the advantage (I guess) of having the highest performance, which is a requirement for this project.

I'd be interested in hearing how others would recommend approaching this.

UPDATE 1: edited to clarify my use of the phrase "sequence of chars" to be equivalent to the general notion of a "string" without implying either std::string or null-terminated char array.

回答1:

I've used this approach in the past:

using Atom = const char*;

Atom make_atom(string const& value)
{
    static set<string> interned;
    return interned.insert(value).first->c_str();
}

Obviously, if you want/need to clear the set, you'd make it available in some wider scope.

For even more efficiency move/emplace the strings into the set.

Update I've added this approach for completeness. See it Live on Coliru

#include <string>
#include <set>
using namespace std;

using Atom = const char*;

template <typename... Args>
typename enable_if<
    is_constructible<string, Args...>::value, Atom
>::type emplace_atom(Args&&... args)
{
    static set<string> interned;
    return interned.emplace(forward<Args>(args)...).first->c_str();
}

#include <iostream>

int main() {
    cout << emplace_atom("Hello World\n");
    cout << emplace_atom(80, '=');
}


回答2:

Yes, you're going to have to write a list of buffers. No, don't do all the hard work yourself.

The underlying datastructure should be a std::vector<std::string>. Using a (forward) list doesn't buy you a whole lot. When the vector is resized, the strings are moved efficiently. std::forward_list<std::string>. Even if the list is resized, the strings themselves remain in place. Iterating the list is only needed for a .clear so the list performance is not critical.

The wrapper class should abstract away the addition of new strings. A new string should be added when the capacity of the last string isn't enough to add the new string. When you add a new string, reserve all the memory a chunk will need - this ensures the capacity will be large enough to prevent reallocations later on.

This setup may waste some space when a large new allocation forces the use of a new chunk, leaving part of an older chunk unused. You could of course remember the size remaining in the last N blocks, for a small value of N such that those chunks might still be in cache. But it's quite possible that in your app N=5 would already be too big.



回答3:

Recapping, your requirements are:

  • Be able to push elements
  • Be able to obtain an iterator to the beginning of the sequence
  • Iterators should not get invalidated if the sequence grows
  • Be able to clear the sequence
  • Don't reserve maximum capacity

It seems that std::list<char> fits perfectly into this list of requirements. Of course you might need a wrapper around the class to make it behave exactly like std::string, but that really depends on how you manipulate the data.

And here's how well it fits the requirements:

  • To push elements, you can use the push_back and emplace_back member functions.

  • std::begin(container) or the member function begin will retrieve the iterator to the first element of the sequence.

  • Addition, removal and moving the elements within the list or across several lists does not invalidate the iterators. An iterator is invalidated only when the corresponding element is deleted.

  • To clear the sequence you can use the member function clear.

  • Most of the time it is implemented as a doubly-linked list, therefore no capacity is reserved.

Since std::list seems memory inefficient (even though the standard doesn't specify the size of it nor its implementation), it's correct to add that you can also use std::deque<char> with almost the same interface as above. The only difference being that std::deque might reserve unused memory.