Data structure for storing elements in sorted orde

2019-06-05 04:50发布

问题:

I need a sorted and indexed container. I want to build it based on std:set or std::vector. By using set, I need expensive std::distance to calculate the index of its element. By using vector, I know adding or deleting an element is expensive. Let's say my element is a pointer (small object). I know two operations' complexity is the same. But which one is faster? Thanks.

回答1:

If you need a data structure that supports sorted order and lookup by index, you should definitely look into the order statistic tree, which is a data structure specifically designed to support exactly these operations. It supports insertions and deletions in O(log n) time, lookup of an element's index in O(log n) time, and lookup by value or by index in O(log n) time as well, so it should be dramatically faster than either the vector or the set approaches.

Unfortunately, the STL doesn't have order statistic trees built it, so you would have to search for a third party library (this earlier question and answer provides an example of one). That said, the speedup you should expect from the order statistic tree should make it well worth the investment.

Hope this helps!



回答2:

According to you guys' suggestions, I measured it as the bottom code (flat_set is added). The results for debug vesion is

for set: 4.720976s wall, 4.711230s user + 0.000000s system = 4.711230s CPU (99.8%) (also tested set::insert which takes little time)

for vector: 1.407571s wall, 1.404009s user + 0.000000s system = 1.404009s CPU (99.7%)

for flat_set: 0.327714s wall, 0.327602s user + 0.000000s system = 0.327602s CPU (100.0%)

And release version (I need to write out results to let the compiler optimization not to overkill my code) gives each about 10 times speedup.

And my conclusion is vectors are 2-3 times faster, and boost flat_set is the best. And for number of entries less than several hundreds (such as 200), flat_set insert is not slower than std::set (that w/o index calculation).

int N = 10000;
{
    boost::timer::auto_cpu_timer t;
    std::set<int> s;
    for (int i = 0; i < N; ++i)
    {
        auto it = s.insert(i);
        int index = std::distance(s.begin(), it.first);
    }
}

{
    boost::timer::auto_cpu_timer t;
    std::vector<int> v;
    for (int i = 0; i < N; ++i)
    {
        v.insert(v.begin(), i);
    }
}

{
    boost::timer::auto_cpu_timer t;
    boost::container::flat_set<int> s;
    for (int i = 0; i < N; ++i)
    {
        auto it = s.insert(-i); // negative sign is used to make insertion
                                // (at the beginning) expensive
        int index = std::distance(s.begin(), it.first);
    }
}


回答3:

Separate the storage and the index:

having a vector of integers {I} that index the storage vector of your object type, {T}.

{I} is sorted by comparison between the objects it points to in {T}. insert/delete to {I} is cheaper than those to {T}.

Whenever you add a new item to {T} vector, you insert its index to the sorted {I}.

when you delete an item, remove the index in {I}, but you may leave the object in {T} untouched, only push_back the index just removed into a reuse vector {I'}. next time when you add a new item, you can pop_back the index in {I'} and reuse the storage.

If you know the number of items ever used, you can call resize() on {T} at startup to avoid (re)allocation at run time.

This method is similar as using vector of pointers, the advantage is less dynamic allocation and more cache friendly, as object storage are in contiguous memory.