Why are the atomics much slower than the lock in t

2019-05-26 04:52发布

问题:

I wrote something using atomics rather than locks and perplexed at it being so much slower in my case I wrote the following mini test:

#include <pthread.h>
#include <vector>

struct test
{
    test(size_t size) : index_(0), size_(size), vec2_(size)
        {
            vec_.reserve(size_);
            pthread_mutexattr_init(&attrs_);
            pthread_mutexattr_setpshared(&attrs_, PTHREAD_PROCESS_PRIVATE);
            pthread_mutexattr_settype(&attrs_, PTHREAD_MUTEX_ADAPTIVE_NP);

            pthread_mutex_init(&lock_, &attrs_);
        }

    void lockedPush(int i);
    void atomicPush(int* i);

    size_t              index_;
    size_t              size_;
    std::vector<int>    vec_;
    std::vector<int>    vec2_;
    pthread_mutexattr_t attrs_;
    pthread_mutex_t     lock_;
};

void test::lockedPush(int i)
{
    pthread_mutex_lock(&lock_);
    vec_.push_back(i);
    pthread_mutex_unlock(&lock_);
}

void test::atomicPush(int* i)
{
    int ii       = (int) (i - &vec2_.front());
    size_t index = __sync_fetch_and_add(&index_, 1);
    vec2_[index & (size_ - 1)] = ii;
}

int main(int argc, char** argv)
{
    const size_t N = 1048576;
    test t(N);

//     for (int i = 0; i < N; ++i)
//         t.lockedPush(i);

    for (int i = 0; i < N; ++i)
        t.atomicPush(&i);
}

If I uncomment the atomicPush operation and run the test with time(1) I get output like so:

real    0m0.027s
user    0m0.022s
sys     0m0.005s

and if I run the loop calling the atomic thing (the seemingly unnecessary operation is there because i want my function to look as much as possible as what my bigger code does) I get output like so:

real    0m0.046s
user    0m0.043s
sys     0m0.003s

I'm not sure why this is happening as I would have expected the atomic to be faster than the lock in this case...

When I compile with -O3 I see lock and atomic updates as follows:

lock:
    real    0m0.024s
    user    0m0.022s
    sys     0m0.001s

atomic:    
    real    0m0.013s
    user    0m0.011s
    sys     0m0.002s

In my larger app though the performance of the lock (single threaded testing) is still doing better regardless though..

回答1:

An uncontended mutex is extremely fast to lock and unlock. With an atomic variable, you're always paying a certain memory synchronisation penalty (especially since you're not even using relaxed ordering).

Your test case is simply too naive to be useful. You have to test a heavily contended data access scenario.

Generally, atomics are slow (they get in the way of clever internal reordering, pipelining, and caching), but they allow for lock-free code which ensures that the entire program can make some progress. By contrast, if you get swapped out while holding a lock, everyone has to wait.



回答2:

Just to add to the first answer, when you do a __sync_fetch_and_add you actually enforce specific code ordering. From the documentation

A full memory barrier is created when this function is invoked

A memory barrier is when

a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction

Chances are even though your work is atomic, you are losing compiler optimizations by forcing ordering of instructions.