std stack performance issues [closed]

2019-09-22 04:02发布

问题:

Recently I was trying to do some performance benchmarks, comparing std::stack<int, std::vector<int>> and my own simple implementation of stack (that use pre-allocated memory). Now I’m experiencing some strange behavior.

First thing I want to ask is this line in stack benchmark code:

//  std::vector<int> magicVector(10);

When I uncomment this line, performance rise about 17% (benchmark time drops from 6.5 to 5.4 seconds). But the line should have no impact on the rest of the program because it's not modifying any others members. Besides, it doesn't matter if it is vector of int or vector of double...

Second thing I want to ask is big performance difference between my stack implementation and std::stack. I was told that std::stack should be as fast as my stack but results shows that my "FastStack" is twice as fast.

Results (with uncommented performance increasing line):
stack 5.38979
stack 5.34406
stack 5.32404
stack 5.30519
FastStack 2.59635
FastStack 2.59204
FastStack 2.59713
FastStack 2.64814

These results come from release build from VS2010 with /O2, /Ot, /Ob2 and others default optimizations. My CPU is Intel i5 3570k with default clock (3.6 GHz for one thread).

I put all code in one file so anyone can easily test it.

#define _SECURE_SCL 0

#include <iostream>
#include <vector>
#include <stack>
#include <Windows.h>

using namespace std;

//---------------------------------------------------------------------------------
//---------------------------------------------------------------------------------
//  Purpose:    High Resolution Timer
//---------------------------------------------------------------------------------

class HRTimer
{
public:
    HRTimer();
    double GetFrequency(void);
    void Start(void) ;
    double Stop(void);
    double GetTime();

private:
    LARGE_INTEGER start;
    LARGE_INTEGER stop;
    double frequency;
};

HRTimer::HRTimer()
{
    frequency = this->GetFrequency();
}

double HRTimer::GetFrequency(void)
{
    LARGE_INTEGER proc_freq;
    if (!::QueryPerformanceFrequency(&proc_freq))
        return -1;
    return proc_freq.QuadPart;
}

void HRTimer::Start(void)
{
    DWORD_PTR oldmask = ::SetThreadAffinityMask(::GetCurrentThread(), 0);
    ::QueryPerformanceCounter(&start);
    ::SetThreadAffinityMask(::GetCurrentThread(), oldmask);
}

double HRTimer::Stop(void)
{
    DWORD_PTR oldmask = ::SetThreadAffinityMask(::GetCurrentThread(), 0);
    ::QueryPerformanceCounter(&stop);
    ::SetThreadAffinityMask(::GetCurrentThread(), oldmask);
    return ((stop.QuadPart - start.QuadPart) / frequency);
} 

double HRTimer::GetTime()
{
    LARGE_INTEGER time;
    ::QueryPerformanceCounter(&time);
    return time.QuadPart / frequency;
}

//---------------------------------------------------------------------------------
//---------------------------------------------------------------------------------
//  Purpose:    Should be faster than std::stack
//---------------------------------------------------------------------------------

template <class T>

class FastStack
{
public:
    T* st;
    int allocationSize;
    int lastIndex;

public:
    FastStack(int stackSize);
    ~FastStack();

    inline void resize(int newSize);
    inline void push(T x);
    inline void pop();
    inline T getAndRemove();
    inline T getLast();
    inline void clear();
};

template <class T>
FastStack<T>::FastStack( int stackSize )
{
    st = NULL;
    this->allocationSize = stackSize;
    st = new T[stackSize];
    lastIndex = -1;
}

template <class T>
FastStack<T>::~FastStack()
{
    delete [] st;
}

template <class T>
void FastStack<T>::clear()
{
    lastIndex = -1;
}

template <class T>
T FastStack<T>::getLast()
{
    return st[lastIndex];
}

template <class T>
T FastStack<T>::getAndRemove()
{
    return st[lastIndex--];
}

template <class T>
void FastStack<T>::pop()
{
    --lastIndex;
}

template <class T>
void FastStack<T>::push( T x )
{
    st[++lastIndex] = x;
}

template <class T>
void FastStack<T>::resize( int newSize )
{
    if (st != NULL)
        delete [] st;
    st = new T[newSize];
}
//---------------------------------------------------------------------------------
//---------------------------------------------------------------------------------
//---------------------------------------------------------------------------------
//  Purpose:    Benchmark of std::stack and FastStack
//---------------------------------------------------------------------------------


int main(int argc, char *argv[])
{
#if 1
    for (int it = 0; it < 4; it++)
    {
        std::stack<int, std::vector<int>> bStack;
        int x;

        for (int i = 0; i < 100; i++)   // after this two loops, bStack's capacity will be 141 so there will be no more reallocating
            bStack.push(i);
        for (int i = 0; i < 100; i++)
            bStack.pop();
    //  std::vector<int> magicVector(10);           // when you uncomment this line, performance will magically rise about 18%

        HRTimer timer;
        timer.Start();

        for (int i = 0; i < 2000000000; i++)
        {
            bStack.push(i);
            x = bStack.top();
            if (i % 100 == 0 && i != 0)
                for (int j = 0; j < 100; j++)
                    bStack.pop();
        }

        double totalTime = timer.Stop();
        cout << "stack " << totalTime << endl;
    }
#endif

    //------------------------------------------------------------------------------------

#if 1
    for (int it = 0; it < 4; it++)
    {
        FastStack<int> fstack(200);
        int x;

        HRTimer timer;
        timer.Start();

        for (int i = 0; i < 2000000000; i++)
        {
            fstack.push(i);
            x = fstack.getLast();
            if (i % 100 == 0 && i != 0)
                for (int j = 0; j < 100; j++)
                    fstack.pop();
        }

        double totalTime = timer.Stop();
        cout << "FastStack " << totalTime << endl;
    }
#endif

    cout << "Done";
    cin.get();
    return 0;
}

.
EDIT: Since everybody talks about my really bad implementation of my stack I want to set things right. I created that stack in few minutes and I implemented just few features that I currently needed. It was never meant to be a replacement of std::stack :) or save to use in all cases. The only goal was to achieve maximum speed and correct results. I'm sorry about this misunderstanding… I just want to know few answers…

回答1:

The many comments (and even answers) focus on the risks in your implementation. Yet the question stands.

As directly demonstrated below rectifying the perceived code shortcomings would not change anything significant about the performance.

Here is the OP's code modified to be (A) safe, and (B) supporting the same operations as std::stack, and (C) reserving buffer space also for the std::stack, in order to clarify things for those who mistakenly believe that this stuff matters for the performance:

#define _SECURE_SCL 0
#define _SCL_SECURE_NO_WARNINGS

#include <algorithm>        // std::swap
#include <iostream>
#include <vector>
#include <stack>
#include <stddef.h>         // ptrdiff_t
#include <type_traits>      // std::is_pod
using namespace std;

#undef UNICODE
#define UNICODE
#include <Windows.h>

typedef ptrdiff_t   Size;
typedef Size        Index;

template< class Type, class Container >
void reserve( Size const newBufSize, std::stack< Type, Container >& st )
{
    struct Access: std::stack< Type, Container >
    {
        static Container& container( std::stack< Type, Container >& st )
        {
            return st.*&Access::c;
        }
    };

    Access::container( st ).reserve( newBufSize );
}

class HighResolutionTimer
{
public:
    HighResolutionTimer();
    double GetFrequency() const;
    void Start() ;
    double Stop();
    double GetTime() const;

private:
    LARGE_INTEGER start;
    LARGE_INTEGER stop;
    double frequency;
};

HighResolutionTimer::HighResolutionTimer()
{
    frequency = GetFrequency();
}

double HighResolutionTimer::GetFrequency() const
{
    LARGE_INTEGER proc_freq;
    if (!::QueryPerformanceFrequency(&proc_freq))
        return -1;
    return static_cast< double >( proc_freq.QuadPart );
}

void HighResolutionTimer::Start()
{
    DWORD_PTR oldmask = ::SetThreadAffinityMask(::GetCurrentThread(), 0);
    ::QueryPerformanceCounter(&start);
    ::SetThreadAffinityMask(::GetCurrentThread(), oldmask);
}

double HighResolutionTimer::Stop()
{
    DWORD_PTR oldmask = ::SetThreadAffinityMask(::GetCurrentThread(), 0);
    ::QueryPerformanceCounter(&stop);
    ::SetThreadAffinityMask(::GetCurrentThread(), oldmask);
    return ((stop.QuadPart - start.QuadPart) / frequency);
} 

double HighResolutionTimer::GetTime() const
{
    LARGE_INTEGER time;
    ::QueryPerformanceCounter(&time);
    return time.QuadPart / frequency;
}

template< class Type, bool elemTypeIsPOD = !!std::is_pod< Type >::value >
class FastStack;

template< class Type >
class FastStack< Type, true >
{
private:
    Type*   st_;
    Index   lastIndex_;
    Size    capacity_;

public:
    Size const size() const { return lastIndex_ + 1; }
    Size const capacity() const { return capacity_; }

    void reserve( Size const newCapacity )
    {
        if( newCapacity > capacity_ )
        {
            FastStack< Type >( *this, newCapacity ).swapWith( *this );
        }
    }

    void push( Type const& x )
    {
        if( size() == capacity() )
        {
            reserve( 2*capacity() );
        }
        st_[++lastIndex_] = x;
    }

    void pop()
    {
        --lastIndex_;
    }

    Type top() const
    {
        return st_[lastIndex_];
    }

    void swapWith( FastStack& other ) throw()
    {
        using std::swap;
        swap( st_, other.st_ );
        swap( lastIndex_, other.lastIndex_ );
        swap( capacity_, other.capacity_ );
    }

    void operator=( FastStack other )
    {
        other.swapWith( *this );
    }

    ~FastStack()
    {
        delete[] st_;
    }

    FastStack( Size const aCapacity = 0 )
        : st_( new Type[aCapacity] )
        , capacity_( aCapacity )
    {
        lastIndex_ = -1;
    }

    FastStack( FastStack const& other, int const newBufSize = -1 )
    {
        capacity_ = (newBufSize < other.size()? other.size(): newBufSize);
        st_ = new Type[capacity_];
        lastIndex_ = other.lastIndex_;
        copy( other.st_, other.st_ + other.size(), st_ );   // Can't throw for POD.
    }
};

template< class Type >
void reserve( Size const newCapacity, FastStack< Type >& st )
{
    st.reserve( newCapacity );
}

template< class StackType >
void test( char const* const description )
{
    for( int it = 0; it < 4; ++it )
    {
        StackType st;
        reserve( 200, st );

        // after this two loops, st's capacity will be 141 so there will be no more reallocating
        for( int i = 0; i < 100; ++i ) { st.push( i ); }
        for( int i = 0; i < 100; ++i ) { st.pop(); }

        // when you uncomment this line, std::stack performance will magically rise about 18%
        // std::vector<int> magicVector(10);

        HighResolutionTimer timer;
        timer.Start();

        for( Index i = 0; i < 1000000000; ++i )
        {
            st.push( i );
            (void) st.top();
            if( i % 100 == 0 && i != 0 )
            {
                for( int j = 0; j < 100; ++j ) { st.pop(); }
            }
        }

        double const totalTime = timer.Stop();
        wcout << description << ": "  << totalTime << endl;
    }
}

int main()
{
    typedef stack< Index, vector< Index > > SStack;
    typedef FastStack< Index >              FStack;

    test< SStack >( "std::stack" );
    test< FStack >( "FastStack" );

    cout << "Done";
}

Results on this slow-as-molasses Samsung RC530 laptop:

[D:\dev\test\so\12704314]
> a
std::stack: 3.21319
std::stack: 3.16456
std::stack: 3.23298
std::stack: 3.20854
FastStack: 1.97636
FastStack: 1.97958
FastStack: 2.12977
FastStack: 2.13507
Done
[D:\dev\test\so\12704314]
> _

And similarly for Visual C++.

Now let's look at a typical implementation of std::vector::push_back, which is called by std::stack<T, std::vector<T>>::push (in passing, I know of only 3 programmers who have ever used this indentation style, namely PJP, Petzold and myself; I now, since 1998 or thereabouts, think it's horrible!):

void push_back(const value_type& _Val)
    {   // insert element at end
    if (_Inside(_STD addressof(_Val)))
        {   // push back an element
        size_type _Idx = _STD addressof(_Val) - this->_Myfirst;
        if (this->_Mylast == this->_Myend)
            _Reserve(1);
        _Orphan_range(this->_Mylast, this->_Mylast);
        this->_Getal().construct(this->_Mylast,
            this->_Myfirst[_Idx]);
        ++this->_Mylast;
        }
    else
        {   // push back a non-element
        if (this->_Mylast == this->_Myend)
            _Reserve(1);
        _Orphan_range(this->_Mylast, this->_Mylast);
        this->_Getal().construct(this->_Mylast,
            _Val);
        ++this->_Mylast;
        }
    }

I suspect that the measured inefficiency lies at least partly in all the stuff going on there, and perhaps it's also a matter of automatically generated safety checks.

For a debug build the std::stack performance is so extremely ungood that I gave up waiting for any result.


EDIT: following Xeo’s comment below I updated push to check for "self-push" in the case of buffer reallocation, by factoring that out as a separate function:

void push( Type const& x )
{
    if( size() == capacity() )
    {
        reserveAndPush( x );
    }
    st_[++lastIndex_] = x;
}

Mysteriously, although reserveAndPush is never called in this testing, it affects the performance – due to code size not fitting cache?

[D:\dev\test\so\12704314]
> a
std::stack: 3.21623
std::stack: 3.30501
std::stack: 3.24337
std::stack: 3.27711
FastStack: 2.52791
FastStack: 2.44621
FastStack: 2.44759
FastStack: 2.47287
Done
[D:\dev\test\so\12704314]
> _


EDIT 2: DeadMG showed that the code must be buggy. I believe the problem was a missing return, plus the expression computing new size (twice zero is still zero). He also pointed out that I forgot to show reserveAndPush. Should be:

void reserveAndPush( Type const& x )
{
    Type const xVal = x;
    reserve( capacity_ == 0? 1 : 2*capacity_ );
    push( xVal );
}

void push( Type const& x )
{
    if( size() == capacity() )
    {
        return reserveAndPush( x );    // <-- The crucial "return".
    }
    st_[++lastIndex_] = x;
}


回答2:

Your method implementations are all broken. Ignoring the copy constructor and other missing operations, your push invokes UB if you push too much, and your resize is plainly broken as it does not copy over the previous data and it's not exception safe and your push isn't exception safe and you invoke too many copies and your getAndRemove isn't exception safe and you don't destruct popped off elements and you don't construct new elements properly, only assign them and you needlessly default-construct when created, and there are probably more I haven't found.

Basically, your class is extremely and hideously unsafe in every imaginable respect, destroys the user's data at the drop of a hat, calls all the wrong functions on T, and will go crying in a corner the instant an exception is thrown anywhere.

It's a giant pile of bad and the fact that it's "faster" than std::stack is, well, entirely irrelevant, since all you've proven is that if you don't have to meet the requirements, you can go as fast as you like, which we all already knew.

Fundamentally, as sbi said, you clearly don't understand the semantics of std::stack, nor important C++ aspects like exception safety, and the ways in which your code fails to work correctly is what makes it execute faster. You've got a long way to go, my friend.



回答3:

Contrary to a std::stack using std::vector, your stack does not reallocate when it runs out of space, but simply blows up the planet. Allocation, however, is a huge drain on performance, so skipping on that will certainly gain you performance.

However, in your place I'd grab one of the well-aged static_vector implementations floating on the web and stuff that into std::stack in place of std::vector. That way, you skip all the performance-hungry dynamic memory handling, but you have a valid stack implementation with a container for memory handling underneath that's very likely to be much better than what you come up with.