atomicMax + AtomicCAS(atomicExch)

I would like to ask to you guys if there is a better way to combine 2 atomics.

My goal is to find the highest results for a set of K equations (more than 32) under a list of J parameters values (very similar a 2-way input) and to save the value and the j index.

if (atomicMax(&max_k[id], t_max) < t_max) atomicExch(&indexMax[id],t_pos);

Initially we used the approach described above, but, since we do expect an even higher value for every thread, it is possible to have in the same warp the thread B > C > A (thread B has the highest value and thread C has a value higher than A). I'm not sure but the atomicExch can be executed in another thread order than the atomicMax was (is that correct?), so we tried a critical section, but it led to a deadlock. After all the solutions bellow seems to work.

Is there a better way or there is any issue in the following code?

__device__ int atomicMaxCAS(int* addressMax, int valMax, int* addressCAS, int valCas) {
        int oldCas = *addressCAS, assumedCas;
        int oldMax = *addressMax, assumedMax;
        do {
            assumedCas = oldCas;
            assumedMax = oldMax;
            oldMax = atomicMax(addressMax, valMax);
            if (oldMax < valMax) oldCas = atomicCAS(addressCAS, assumedCas, valCas);
        } while (assumedCas != oldCas || assumedMax != oldMax);
        return (oldMax);
    }

Thanks in advance! I was able to start writing CUDA due to all this posts about!

there is any issue in the following code?

Yes, you can't use two atomics like that and expect coherent results. You have set up a possible race condition.

Suppose thread A does the atomicMax and replaces the old value with 100. Then thread B does the atomicMax and replaces the 100 value with 110. Then suppose thread B does the atomicCAS, and replaces its index. Then thread A does the atomicCAS, and replaces thread B index with thread A index. You now have a max value of 110 with an index corresponding to thread A.

Even within a single warp, there is no stated order of execution of atomic operations.

Is there a better way?

since your values are both 32-bit quantities, you might be interested in using a custom 64-bit atomic operation like this to update a value and an index at the same time, atomically.
For large scale usage (lots of threads) you may want to explore a classical paraellel reduction. There are questions here on the CUDA tag such as this one and this one that discuss how to do an index+value reduction.

Global atomics on Kepler are pretty fast, so depending on your exact code and reduction "density" a global atomic reduction might not be a big problem performance-wise.