I am working in a library in which some elements, that do not matter here, must be identified with a name (i.e. values are associated to names). Names are strings for the user, whatever their internal representation is, and should behave transparently.
- Names are constant and initialized with strings literals. They are known at compile-time.
- Two names that were initialized with different strings must compare different, whatever their internal representation is.
- Names may be of arbitrary length. My library does not set any restriction.
- Virtually, there must not be a limit on possible names. The implementation's constraints must not affect the interface, so, again, no restrictions from my side.
Considering that frequent lookup will happen, I have thought about using an unordered map.
Unordered associative containers store their elements, whatever their type is, by numbers (typically of type std::size_t
), which are obtained with a hash function. This means that:
- For types whose number of possible values is less or equal to that of the hash value, no collisions should happen.
- For types whose number of possible values is greater than that of the hash value, collisions may happen, since some data is lost in the hashing process.
I have thought about two solutions.
Hashing by value
Using the data itself to compute the hash value. Considerations:
- Possibly computed at compile-time. Since the names are constructed from string literals, a
constexpr
hashing function could be invoked by the constructor (which would beconstexpr
itself) and the hash value stored in the class itself, for quick retrieval later (by the hasher object). - How often would collisions happen? Which would be the best algorithm?
Hashing by order
The Boost.Log library, as explained here, maintains a global (i.e. static) table that associates names with their hash value. A possible implementation would be the following:
- When a name is constructed (from a string literal), the table is looked up (performing exact comparisons).
- If it isn't found, it is registered at the end of the container.
- Its entry's offset in the table becomes its hash value.
Considerations:
- Very slow. For every name constructed, as much string comparisons as names registered would have to be performed. This is not much better than a traditional
std::map
, is it? - Thread unsafe. The table would have to be protected.
- Forcibly done at runtime.
Questions
- Is it right to use an unordered map under these conditions? Would it be better to use a
std::map
instead? - If 1 is 'yes', which approach is better, and why? The one used in Boost.Log seems really unefficient, why is it used instead of the other I explained, even if strings are not necessarily known at compile-time?
Note: I have not added the c++14
tag even though I have access to the experimental support offered by gcc and clang. Please, do not hesitate using features included in the upcoming specification.
If you don't need the entries to be ordered, it is typically more efficient to use an
unordered_map
than amap
. Since both have a nearly identical interface, this is quite easy to measure, of course (which you should do).You should read the Boost documentation a bit better. I didn't read anything about linear complexity lookups. The description of
attribute_set
suggests an associative container is used (I would expect std::unordered_map, but you can check the source code for yourself). The reason to use an identifier rather than a string is also clearly mentioned in the documentation:"Working with identifiers is much more efficient than with strings. For example, copying does not involve dynamic memory allocation and comparison operators are very lightweight."
Whether this is beneficial in your case depends on the way you use these data structures. Since you indicate that string identifiers can be represented as string literals (but consider if you'll need to translate these strings), you'd only need to pass around a pointer to copy a string identifier. However, comparisons would still be slower than with
boost::attribute_name
s.While collisions may happen for types whose number of possible values is greater than that of the hash value, when they do, the container will notice that there is already a guest in the bucket identified by the hash value, and directly compare the keys. Thus, different keys will never collide. Try with a hash function that always returns a fixed value, and see what happens when you insert keys -- it will become slow, which is why the hashing algorithm is important.
Using a
std::unordered_map
is therefore a good option, if, as you mention, frequent lookup will happen, and order is not required. You should still measure it againststd::map
to be sure, though, as D Drmmr advised.If you were concerned about different keys colliding because their hash values were equal, then fear not; as explained above, that is not a problem. You should therefore opt for the first method, since it allows compile-time hashing, and does not suffer from all the problems of the second.
A possible implementation:
You would have to implement an interface for hashing algorithms to access the data underlying your
name
class. Furthermore, as stated in the example, the methods should beconstexpr
-declared; otherwise, they could not be called from yourconstexpr
-enabled hashing function. As for hashing algorithms, there are many, each appropiated in some circumstances. This page elaborates on the topic and presents an implementation of X65599 that, however, does not useconstexpr
. You could try it first and check how it behaves in your situation.