atomicMax + AtomicCAS(atomicExch)

2019-03-01 17:45发布

问题:

I would like to ask to you guys if there is a better way to combine 2 atomics.

My goal is to find the highest results for a set of K equations (more than 32) under a list of J parameters values (very similar a 2-way input) and to save the value and the j index.

if (atomicMax(&max_k[id], t_max) < t_max) atomicExch(&indexMax[id],t_pos);

Initially we used the approach described above, but, since we do expect an even higher value for every thread, it is possible to have in the same warp the thread B > C > A (thread B has the highest value and thread C has a value higher than A). I'm not sure but the atomicExch can be executed in another thread order than the atomicMax was (is that correct?), so we tried a critical section, but it led to a deadlock. After all the solutions bellow seems to work.

Is there a better way or there is any issue in the following code?

__device__ int atomicMaxCAS(int* addressMax, int valMax, int* addressCAS, int valCas) {
        int oldCas = *addressCAS, assumedCas;
        int oldMax = *addressMax, assumedMax;
        do {
            assumedCas = oldCas;
            assumedMax = oldMax;
            oldMax = atomicMax(addressMax, valMax);
            if (oldMax < valMax) oldCas = atomicCAS(addressCAS, assumedCas, valCas);
        } while (assumedCas != oldCas || assumedMax != oldMax);
        return (oldMax);
    }

Thanks in advance! I was able to start writing CUDA due to all this posts about!

回答1:

there is any issue in the following code?

Yes, you can't use two atomics like that and expect coherent results. You have set up a possible race condition.

Suppose thread A does the atomicMax and replaces the old value with 100. Then thread B does the atomicMax and replaces the 100 value with 110. Then suppose thread B does the atomicCAS, and replaces its index. Then thread A does the atomicCAS, and replaces thread B index with thread A index. You now have a max value of 110 with an index corresponding to thread A.

Even within a single warp, there is no stated order of execution of atomic operations.

Is there a better way?

  1. since your values are both 32-bit quantities, you might be interested in using a custom 64-bit atomic operation like this to update a value and an index at the same time, atomically.

  2. For large scale usage (lots of threads) you may want to explore a classical paraellel reduction. There are questions here on the CUDA tag such as this one and this one that discuss how to do an index+value reduction.

Global atomics on Kepler are pretty fast, so depending on your exact code and reduction "density" a global atomic reduction might not be a big problem performance-wise.



标签: cuda atomic