How does warp work with atomic operation?

2019-04-29 08:28发布

问题:

The threads in a warp run physically parallel, so if one of them (called, thread X) start an atomic operation, what other will do? Wait? Is it mean, all threads will be waiting while thread X is pushed to the atomic-queue, get the access (mutex) and do some stuff with memory, which was protected with that mutex, and realese mutex after?

Is there any way to take other threads for some work, like reads some memory, so the atomic operation will hide it's latency? I mean, 15 idle threads it's.. not well, I guess. Atomic is really slow, does it? How can I accelerate it? Is there any pattern to work with it?

Does atomic operation with shared memory lock for a bank or whole memory? For example (without mutexs), there is __shared__ float smem[256];

  • Thread1 runs atomicAdd(smem, 1);
  • Thread2 runs atomicAdd(smem + 1, 1);

Those threads works with different banks, but in general shared memory. Does they run parralel or they will be queued? Is there any difference with this example, if Thread1 and Thread2 are from separated warps or general one?

回答1:

I count something like 10 questions. It makes it quite difficult to answer. It's suggested you ask one question per question.

Generally speaking, all threads in a warp are executing the same instruction stream. So there are two cases we can consider:

  1. without conditionals (e.g. if...then...else) In this case, all threads are executing the same instruction, which happens to be an atomic instruction. Then all 32 threads will execute an atomic, although not necessarily on the same location. All of these atomics will get processed by the SM, and to some extent will serialize (they will completely serialize if they are updating the same location).
  2. with conditionals For example, suppose we had if (!threadIdx.x) AtomicAdd(*data, 1); Then thread 0 would execute the atomic, and others wouldn't. It might seem like we could get the others to do something else, but the lockstep warp execution doesn't allow this. Warp execution is serialized such that all threads taking the if (true) path will execute together, and all threads executing the if (false) path will execute together, but the true and false paths will be serialized. So again, we can't really have different threads in a warp executing different instructions simultaneously.

The net of it is, within a warp, we can't have one thread do an atomic while others do something else simultaneously.

A number of your other questions seem to expect that memory transactions are completed at the end of the instruction cycle that they originated in. This is not the case. With global and with shared memory, we must take special steps in the code to ensure that previous write transactions are visible to other threads (which could be argued as the evidence that the transaction completed.) One typical way to do this is to use barrier instructions, such as __syncthreads() or __threadfence() But without those barrier instructions, threads are not "waiting" for writes to complete. A (an operation dependent on a) read can stall a thread. A write generally cannot stall a thread.

Now lets see about your questions:

so if one of them start an atomic operation, what other will do? Wait?

No, they don't wait. The atomic operation gets dispatched to a functional unit on the SM that handles atomics, and all threads continue, together, in lockstep. Since an atomic generally implies a read, yes, the read can stall the warp. But the threads do not wait until the atomic operation is completed (i.e, the write). However, a subsequent read of this location could stall the warp, again, waiting for the atomic (write) to complete. In the case of a global atomic, which is guaranteed to update global memory, it will invalidate the L1 in the originating SM (if enabled) and the L2, if they contain that location as an entry.

Is there any way to take other threads for some work, like reads some memory, so the atomic operation will hide it's latency?

Not really, for the reasons I stated at the beginning.

Atomic is really slow, does it? How can I accelerate it? Is there any pattern to work with it?

Yes, atomics can make a program run much more slowly if they dominate the activity (such as naive reductions or naive histogramming.) Generally speaking, the way to accelerate atomic operations is to not use them, or use them sparingly, in a way that doesn't dominate program activity. For example, a naive reduction would use an atomic to add every element to the global sum. A smart parallel reduction will use no atomics at all for the work done in the threadblock. At the end of the threadblock reduction, a single atomic might be used to update the threadblock partial sum into the global sum. This means that I can do a fast parallel reduction of an arbitrarily large number of elements with perhaps on the order of 32 atomic adds, or less. This sparing use of atomics will basically not be noticeable in the overall program execution, except that it enables the parallel reduction to be done in a single kernel call rather than 2.

Shared memory: Does they run parralel or they will be queued?

They will be queued. The reason for this is that there are a limited number of functional units that can process atomic operations on shared memory, not enough to service all the requests from a warp in a single cycle.

I've avoided trying to answer questions that relate to the throughput of atomic operations, because this data is not well specified in the documentation AFAIK. It may be that if you issue enough simultaneous or nearly-simultaneous atomic operations, that some warps will stall on the atomic instruction, due to the queues that feed the atomic functional units being full. I don't know this to be true and I can't answer questions about it.