I wrote a class to act as a wrapper around a sequential container (std::vector
/std::queue
/std::list
) to have the interface of a std::map
, for performance when using small numbers of small objects. The coding was all remarkably simple given the algorithms that already existed. This code is obviously highly trimmed from my full code, but shows the problem.
template <class key_,
class mapped_,
class traits_ = std::less<key_>,
class undertype_ = std::vector<std::pair<key_,mapped_> >
>
class associative
{
public:
typedef traits_ key_compare;
typedef key_ key_type;
typedef mapped_ mapped_type;
typedef std::pair<const key_type, mapped_type> value_type;
typedef typename undertype_::allocator_type allocator_type;
typedef typename allocator_type::template rebind<value_type>::other value_allocator_type;
typedef typename undertype_::const_iterator const_iterator;
class value_compare {
key_compare pred_;
public:
inline value_compare(key_compare pred=key_compare()) : pred_(pred) {}
inline bool operator()(const value_type& left, const value_type& right) const {return pred_(left.first,right.first);}
inline bool operator()(const value_type& left, const key_type& right) const {return pred_(left.first,right);}
inline bool operator()(const key_type& left, const value_type& right) const {return pred_(left,right.first);}
inline bool operator()(const key_type& left, const key_type& right) const {return pred_(left,right);}
inline key_compare key_comp( ) const {return pred_;}
};
class iterator {
public:
typedef typename value_allocator_type::difference_type difference_type;
typedef typename value_allocator_type::value_type value_type;
typedef typename value_allocator_type::reference reference;
typedef typename value_allocator_type::pointer pointer;
typedef std::bidirectional_iterator_tag iterator_category;
inline iterator(const typename undertype_::iterator& rhs) : data(rhs) {}
inline reference operator*() const { return reinterpret_cast<reference>(*data);}
inline pointer operator->() const {return reinterpret_cast<pointer>(structure_dereference_operator(data));}
operator const_iterator&() const {return data;}
protected:
typename undertype_::iterator data;
};
template<class input_iterator>
inline associative(input_iterator first, input_iterator last) : internal_(first, last), comp_()
{if (std::is_sorted(internal_.begin(), internal_.end())==false) std::sort(internal_.begin(), internal_.end(), comp_);}
inline iterator find(const key_type& key) {
iterator i = std::lower_bound(internal_.begin(), internal_.end(), key, comp_);
return (comp_(key,*i) ? internal_.end() : i);
}
protected:
undertype_ internal_;
value_compare comp_;
};
SSCCE at http://ideone.com/Ufn7r, full code at http://ideone.com/MQr0Z (note: resulting times as IdeOne are highly erratic, probably due to server load, and do not clearly show the results in question)
I tested with std::string
, and PODs from 4 to 128 bytes, ranging from 8 to 2000 elements with MSVC10.
I expected higher performance for (1) creating from a range for small objects, (2) random insertion/erasure for small numbers of small objects, and (3) lookup for all objects. Surprisingly, the vector was significantly faster for creating from a range for all tests, and faster for random erasure depending on size up to about 2048 bytes (512 4-byte objects, or 128 16-byte objects, etc). However, most shocking of all, was that the std::vector
using std::lower_bound
was slower than the std::map::find
for all PODs. The difference was miniscule for 4 and 8-byte PODs, and but for 128-byte PODs, std::vector
was up to 36% slower! However, for std::string
, the std::vector
was 6% faster on average.
I feel like std::lower_bound
on a sorted std::vector
should have outperformed std::map
due to better cache locality/smaller memory size, and since the map
can be imperfectly balanced, or in the worst case it should match std::map
, but can't for the life of me think of any reason that std::map
should be faster. My only thought is the predicate is somehow slowing it down, but I can't figure out how. So the question is: How could it be that std::lower_bound
on a sorted std::vector
be outperformed by a std::map
(in MSVC10)?
[EDIT] I've confirmed that std::lower_bound
on std::vector<std::pair<4BYTEPOD,4BYTEPOD>>
uses fewer comparisons on average than std::map<4BYTEPOD,4BYTEPOD>::find
(by 0-0.25), but my implementation is still up to 26% slower.
[POST-ANSWER-EDIT] I made a SSCCE at http://ideone.com/41iKt that removes all unneeded fluff, and clearly shows that find
on the sorted vector
is slower than that of the map
, by ~15%.
This is a somewhat more interesting nut to crack! Before discussing my findings so far, let me point out that the associative::find()
function behaves differently to std::map::find()
: if the key isn't found the former returns the lower bound while the latter returns end()
. To fix this, associative::find()
needs to be changed to become something like this:
auto rc = std::lower_bound(this->internal_.begin(), this->internal_.end(), key, this->comp_);
return rc != this->internal_.end() && !this->comp_(key, rc->first)? rc: this->internal_.end();
Now that we are more likely to compare apples to apples (I haven't verified if the logic is really correct now), let's go on to investigate the performance. I'm not quite convinced that the approach used to test performance really hold water but I'm sticking with it for now and I could definitely improve the performance of the associative
container. I don't think I have quite found all performance issues in the code but, at least, made some progress. The biggest is to noticed that the comparison function used in the associative
is pretty bad because it keeps making copies. This is putting this container somewhat at a disadvantage. If you are checking the comparator now you probably don't see it because it looks as if this comparator is passing by reference! The issue is actually rather subtle: the underlying container has a value_type
of std::pair<key_type, mapped_type>
but the comparator is taking std::pair<key_type const, mapped_type>
as argument! Fixing this seems to give the associative container quite a bit of a performance boost.
To implement a comparator class which has not opportunity to fail matching the arguments exactly I'm using a simple helper to detect if a type is a std::pair<L, R>
:
template <typename> struct is_pair { enum { value = false }; };
template <typename F, typename S> struct is_pair<std::pair<F, S>> { enum { value = true }; };
... and then I replaced the comparator with this, slightly more complicated, one:
class value_compare {
key_compare pred_;
public:
inline value_compare(key_compare pred=key_compare()) : pred_(pred) {}
template <typename L, typename R>
inline typename std::enable_if<is_pair<L>::value && is_pair<R>::value, bool>::type
operator()(L const& left, R const& right) const {
return pred_(left.first,right.first);
}
template <typename L, typename R>
inline typename std::enable_if<is_pair<L>::value && !is_pair<R>::value, bool>::type
operator()(L const& left, R const& right) const {
return pred_(left.first,right);
}
template <typename L, typename R>
inline typename std::enable_if<!is_pair<L>::value && is_pair<R>::value, bool>::type
operator()(L const& left, R const& right) const {
return pred_(left,right.first);
}
template <typename L, typename R>
inline typename std::enable_if<!is_pair<L>::value && !is_pair<R>::value, bool>::type
operator()(L const& left, R const& right) const {
return pred_(left,right);
}
inline key_compare key_comp( ) const {return pred_;}
};
This generally gets the two approaches quite a bit closer together. Given that I would expect that the std::vector<T>
with lower_bound()
approach should be a lot better than using std::map<K, T>
I feel the investigation isn't over, yet.
Addendum:
Rethinking the exercise a bit more I spotted why I felt uncomfortable with the implementation of the predicate class: it is way to complex! This can be done a lot simpler by not using std::enable_if
for a change: this nicely reduces the code to something which is much easier to read. The key is to get the key:
template <typename Key>
Key const& get_key(Key const& value) { return value; }
template <typename Key, typename Value>
Key const& get_key(std::pair<Key, Value> const& pair) { return pair.first; }
With this implementation to get hold of a "key" from either a value or a pair of values, the predicate object can just define one very simple function call operator:
template <typename L, typename R>
bool operator()(L const& l, R const& r)
{
return this->pred_(get_key<key_type>(l), get_key<key_type>(r));
}
There is a slight trick in this as well, though: the expected key_type
needs to be passed to the get_key()
function. Without this the predicate wouldn't work in cases where the key_type
is itself a std::pair<F, S>
of objects.
I have a guess. First, lower_bound
must do log2(n) comparisons, regardless. This means that there is never a time (like there would be for find
) when it can stop early. Secondly, for data types that are larger than a certain size, there must be a multiply operation involved in any pointer arithmetic for the vector. Whereas for the map, it's simply a pointer load of a 4 (or 8 on 64-bit) byte value from memory.
The x86 has some nice instructions for doing very fast multiplies by powers of two during indexing calculations. But they only work for small powers of two as they are designed for indexing arrays of integer-like entities. For larger numbers it has to actually use an integer multiply instruction that's significantly slower.
When you do lower_bound
you have to do exactly log2(n) of these multiplies. But for find
it can be cut off at a smaller number for half the values. This means that their effect is going to be much larger for lower_bound
than any other method.
As an aside... in my opinion, ::std::map
should be implemented as a B-tree where each node is a page in size. Virtual memory sets it up so that basically every program that has a significantly large data structure will end up having parts of that structure paged out under memory pressure. Having each node only store one value risks creating a nearly worst case in which you have to page in a whole page for every comparison for a log2(n) depth, where if you used a b-tree, your worst paging case would be logx(n) pages where x is the number of values per node.
This also has the nice side-effect of lessening the bad effects of cache-line boundaries. There will be a LCM of the (key, value) tuple size and the cache-line size. Having multiple (key, value) pairs in a node sets it up so this LCM is much more likely to happen and X pairs will take exactly Y cache lines. But if each node contains only one pair, this will basically never happen unless the node size is an exact multiple of the cache-line size.