Race Condition in CUDA programs

I have two pieces of code. One written in C and the corresponding operation written in CUDA. Please help me understand how __syncthreads() works in context of the following programs. As per my understanding, __syncthreads() ensures synchronization of threads limited to one block.

C program :

{
    for(i=1;i<10000;i++)
    {
        t=a[i]+b[i];
        a[i-1]=t;
    }
}

The equivalent CUDA program : `

__global__ void kernel0(int *b, int *a, int *t, int N)
{
    int b0=blockIdx.x;
    int t0=threadIdx.x;
    int tid=b0*blockDim.x+t0;
    int private_t;
    if(tid<10000)
    {
        private_t=a[tid]+b[tid];
        if(tid>1)
            a[tid-1]=private_t;
        __syncthreads();
        if(tid==9999)
        *t=private_t;
    }
}

Kernel Dimensions:

dim3 k0_dimBlock(32);
dim3 k0_dimGrid(313);
kernel0 <<<k0_dimGrid, k0_dimBlock>>>

The surprising fact is output from C and CUDA program are identical. Given the nature of problem, which has dependency of a[] onto itself, a[i] is loaded by thrad-ID i and written to a[i-1] by the same thread. Now the same happens for thread-ID i-1. Had the problem size been lesser than 32, the output is obvious. But for a problem of size 10000 with 313 blocks and blocks, how does the dependency gets respected ?

As per my understanding, __syncthreads() ensures synchronization of threads limited to one block.

You're right. __syncthreads() is a synchronization barrier in the context of a block. Therefore, it is useful, for instance, when you must to ensure that all your data is updated before starting the next stage of your algorithm.

Given the nature of problem, which has dependency of a[] onto itself, a[i] is loaded by thread-ID i and written to a[i-1] by the same thread.

Just imagine the thread 2 reach the if statement, since it matches the condition it enters to the statement. Now that threads do the following:

private_t=a[2]+b[2];
a[1]=private_t;

Witch is equivalent to:

a[1]=a[2]+b[2];

As you pointed, it is data dependency on array a. Since you can't control the order of execution of the warps at some point you'll be using an updated version of the aarray. In my mind, you need to add an extra __syncthreads() statement:

if( tid > 0 && tid<10000)
{
    private_t=a[tid]+b[tid];
    __syncthreads();
    a[tid-1]=private_t;
    __syncthreads();
    if(tid==9999)
    *t=private_t;
}

In this way, every thread gets its own version of private_t variable using the original array a, then the array is updated in parallel.

About the *t value:

If you're only looking at the value of *t, you'll not notice the effect of this random scheduling depending on the launching parameters, that's because the thread with tid==9999 could be in the last warp along with the thread tid==9998. Since the two array positions needed to create the private_t value and you already had that synchronization barrier the answer should be right