I have two pieces of code. One written in C and the corresponding operation written in CUDA.
Please help me understand how __syncthreads()
works in context of the following programs. As per my understanding, __syncthreads()
ensures synchronization of threads limited to one block.
C program :
{
for(i=1;i<10000;i++)
{
t=a[i]+b[i];
a[i-1]=t;
}
}
`
The equivalent CUDA program :
`
__global__ void kernel0(int *b, int *a, int *t, int N)
{
int b0=blockIdx.x;
int t0=threadIdx.x;
int tid=b0*blockDim.x+t0;
int private_t;
if(tid<10000)
{
private_t=a[tid]+b[tid];
if(tid>1)
a[tid-1]=private_t;
__syncthreads();
if(tid==9999)
*t=private_t;
}
}
Kernel Dimensions:
dim3 k0_dimBlock(32);
dim3 k0_dimGrid(313);
kernel0 <<<k0_dimGrid, k0_dimBlock>>>
The surprising fact is output from C and CUDA program are identical. Given the nature of problem, which has dependency of a[] onto itself, a[i] is loaded by thrad-ID i and written to a[i-1] by the same thread. Now the same happens for thread-ID i-1. Had the problem size been lesser than 32, the output is obvious. But for a problem of size 10000 with 313 blocks and blocks, how does the dependency gets respected ?
As per my understanding, __syncthreads() ensures synchronization of
threads limited to one block.
You're right. __syncthreads()
is a synchronization barrier in the context of a block. Therefore, it is useful, for instance, when you must to ensure that all your data is updated before starting the next stage of your algorithm.
Given the nature of problem, which has dependency of a[] onto itself,
a[i] is loaded by thread-ID i and written to a[i-1] by the same thread.
Just imagine the thread 2 reach the if statement, since it matches the condition it enters to the statement. Now that threads do the following:
private_t=a[2]+b[2];
a[1]=private_t;
Witch is equivalent to:
a[1]=a[2]+b[2];
As you pointed, it is data dependency on array a
. Since you can't control the order of execution of the warps at some point you'll be using an updated version of the a
array. In my mind, you need to add an extra __syncthreads()
statement:
if( tid > 0 && tid<10000)
{
private_t=a[tid]+b[tid];
__syncthreads();
a[tid-1]=private_t;
__syncthreads();
if(tid==9999)
*t=private_t;
}
In this way, every thread gets its own version of private_t
variable using the original array a
, then the array is updated in parallel.
About the *t value:
If you're only looking at the value of *t
, you'll not notice the effect of this random scheduling depending on the launching parameters, that's because the thread with tid==9999
could be in the last warp along with the thread tid==9998
. Since the two array positions needed to create the private_t
value and you already had that synchronization barrier the answer should be right