Is it safe to use __syncthreads()
in a block where I have purposefully dropped threads using return
?
The documentation states that __syncthreads()
must be called by every thread in the block or else it will lead to a deadlock, but in practice I have never experienced such behavior.
Sample code:
__global__ void kernel(float* data, size_t size) {
// Drop excess threads if user put too many in kernel call.
// After the return, there are `size` active threads.
if (threadIdx.x >= size) {
return;
}
// ... do some work ...
__syncthreads(); // Is this safe?
// For the rest of the kernel, we need to drop one excess thread
// After the return, there are `size - 1` active threads
if (threadIdx.x + 1 == size) {
return;
}
// ... do more work ...
__syncthreads(); // Is this safe?
}
Compute Capability 7.x (Volta) update:
With the introduction of Independent Thread Scheduling among threads in a warp, CUDA is finally more strict in practice, now matching documented behavior. From the Programming Guide:
Below is the previous answer, which rambled about pre-Volta behavior.
Update: This answer may not add anything on top of talonmies' (depending on your understanding of the subject, I suppose), but at the risk of being too verbose I'm presenting the information that helped me understand this better. Also, if you are not interested in how things might work "under the hood" or what might be possible beyond the official documentation, there's nothing to see here. That all said, I still don't recommend making assumptions beyond what is officially documented, especially in an environment that hopes to support multiple or future architectures. I primarily wanted to point out that while this is explicitly called out as bad practice by the CUDA Programming Guide, the actual behavior of
__syncthreads()
may be somewhat different from how it is described and to me that is interesting. The last thing I want is to spread misinformation, so I'm open to discussion and revising my answer!A few important facts
There is no TL;DR for this answer as there is too much potential for misinterpretation, but here are some relevant facts to start:
__syncthreads()
behaves like a barrier for warps in a block rather than all of the threads in a block, although when used as advised it amounts to the same thing.bar
instruction (e.g. from_syncthreads
), it is as if all the threads in the warp have.bar.sync
is called (as generated by the instrinsic__syncthreads()
), the arrival count for that block and barrier are incremented by the warp size. This is how the previous points are achieved.__syncthreads()
. The instruction will not cause the warp to stall and wait for the threads on divergent paths. Branch execution is serialized, so only when the branches rejoin or the code terminates do the threads in the warp then resynchronize. Until that, the branches run in sequence and independently. Again, only one thread in each warp of the block needs to hit__syncthreads()
for execution to continue.These statements are supported by official documentation and other sources.
Interpretation and documentation
Since
__syncthreads()
acts as a barrier for warps in a block rather than all of the threads in a block, as it is described in the Programming Guide, it seems that a simple early exit would be fine if at least one thread in each warp hits the barrier. (But that is not to say you can't cause deadlocks with the intrinsic!) This also supposes that__syncthreads()
will always generate a simplebar.sync a;
PTX instruction and that the semantics of that will not change either, so don't do this in production.One interesting study that I came across actually investigates what happens when you go against the recommendations of the CUDA Programming Guide, and they found that while it is indeed possible to cause a deadlock by abusing
__syncthreads()
in conditional blocks, not all use of the intrinsic in conditional code will do so. From Section D.1 in the paper:This statement is concordant with the bit of the PTX documentation quoted by talonmies. Specifically:
It is clear from this why the optional thread count
b
in thebar.sync a{, b};
instruction must be a multiple of warp size -- whenever a single thread in a warp executes abar
instruction the arrival count is incremented by the warp size, not the number of threads in the warp that actually hit the barrier. Threads that terminate early (followed a different path) were effectively counted as arrived anyway. Now, the next sentence in the quoted passage does then say not to use__syncthreads()
in conditional code unless "it is known that all threads evaluate the condition identically (the warp does not diverge)." This seems to be an overly strict recommendation (for current architecture), meant to ensure that the arrival count actually reflects the real number of threads that hit the barrier. If at least one thread hitting the barrier increments the arrival count for the entire warp, you might really have a little more flexibility.There is no ambiguity in the PTX documentation that the
bar.sync a;
instruction generated by__syncthreads()
waits for all threads in the current cooperative thread array (block) to reach barriera
. However, the point is that how "all threads" is presently determined by incrementing the arrival count in multiples of warp size whenever the barrier is hit (by default whenb
is not specified). This part is not undefined behavior, at least not with Parallel Thread Execution ISA Version 4.2.Keep in mind that there may be inactive threads in a warp even without a conditional -- "the last threads of a block whose number of threads is not a multiple of the warp size." (SIMT architecture notes). Yet
__syncthreads()
is not forbidden in such blocks.Examples
Early exit version 1:
This will not deadlock if at least one thread per warp hits the sync, but a possible issue is order of serialization of the execution of divergent code paths. You can change around the above kernel to effectively swap the branches.
Early exit version 2:
Still no deadlock if you have at least one thread in the warp hit the barrier, but is the order of branch execution important in this case? I don't think so, but it's probably a bad idea to require a particular execution order.
The paper demonstrates this in a more involved example compared to a trivial early exit that also reminds us to be cautious around warp divergence. Here the first half of the warp (thread id
tid
on [0,15]) writes to some shared memory and executes__syncthreads()
, while the other half (thread idtid
on [16,31]) also executes__syncthreads()
but now reads from the shared memory locations written by the first half of the warp. Ignoring the shared memory test at first, you might expect a deadlock at either barrier.There is no deadlock, indicating that
__syncthreads()
does not synchronize diverged threads within a warp. Divergent code paths are serialized in a warp and it only takes one thread in a code path to make the call to__syncthreads()
work at the per-warp level.However, the shared memory bit shows where some unpredictable behavior can enter into this. The second half of the warp does not get the updated values from the first half because branch divergence serialized execution of the warp and the else block was executed first. So the function doesn't work right, but it also show that
__syncthreads()
does not synchronize divergent threads in a warp.Summary
__syncthreads()
does not wait for all threads in a warp, and the arrival of a single thread in a warp effectively counts the entire warp as having reached the barrier. (Present architecture).It can be dangerous to use
__syncthreads()
in conditional code because of how divergent thread execution is serialized.Use the intrinsic in conditional code only if you understand how it works and how branch divergence (which occurs within a warp) is handled.
Note that I didn't say to go ahead and use
__syncthreads()
in a way inconsistent with how it is documented.The answer to the short question is "No". Warp level branch divergence around a
__syncthreads()
instruction will cause a deadlock and result in a kernel hang. Your code example is not guaranteed to be safe or correct. The correct way to implement the code would be like this:so that the
__syncthreads()
instructions are executed unconditionally.EDIT: Just to add a bit of additional information which confirms this assertion,
__syncthreads()
calls get compiled into the PTXbar.sync
instruction on all architectures. The PTX2.0 guide (p133) documentsbar.sync
and includes the following warning:So despite any assertions to the contrary, it is not safe to have conditional branching around a
__syncthreads()
call unless you can be 100% certain that every thread in any given warp follows the same code path and no warp divergence can occur.