I have two pieces of code. One written in C and the corresponding operation written in CUDA.
Please help me understand how __syncthreads()
works in context of the following programs. As per my understanding, __syncthreads()
ensures synchronization of threads limited to one block.
C program :
{
for(i=1;i<10000;i++)
{
t=a[i]+b[i];
a[i-1]=t;
}
}
`
The equivalent CUDA program : `
__global__ void kernel0(int *b, int *a, int *t, int N)
{
int b0=blockIdx.x;
int t0=threadIdx.x;
int tid=b0*blockDim.x+t0;
int private_t;
if(tid<10000)
{
private_t=a[tid]+b[tid];
if(tid>1)
a[tid-1]=private_t;
__syncthreads();
if(tid==9999)
*t=private_t;
}
}
Kernel Dimensions:
dim3 k0_dimBlock(32);
dim3 k0_dimGrid(313);
kernel0 <<<k0_dimGrid, k0_dimBlock>>>
The surprising fact is output from C and CUDA program are identical. Given the nature of problem, which has dependency of a[] onto itself, a[i] is loaded by thrad-ID i and written to a[i-1] by the same thread. Now the same happens for thread-ID i-1. Had the problem size been lesser than 32, the output is obvious. But for a problem of size 10000 with 313 blocks and blocks, how does the dependency gets respected ?
You're right.
__syncthreads()
is a synchronization barrier in the context of a block. Therefore, it is useful, for instance, when you must to ensure that all your data is updated before starting the next stage of your algorithm.Just imagine the thread 2 reach the if statement, since it matches the condition it enters to the statement. Now that threads do the following:
Witch is equivalent to:
As you pointed, it is data dependency on array
a
. Since you can't control the order of execution of the warps at some point you'll be using an updated version of thea
array. In my mind, you need to add an extra__syncthreads()
statement:In this way, every thread gets its own version of
private_t
variable using the original arraya
, then the array is updated in parallel.About the *t value:
If you're only looking at the value of
*t
, you'll not notice the effect of this random scheduling depending on the launching parameters, that's because the thread withtid==9999
could be in the last warp along with the threadtid==9998
. Since the two array positions needed to create theprivate_t
value and you already had that synchronization barrier the answer should be right