CUDA Array Reduction

I'm aware that there are multiple questions similar to this one already answered but I've been unable to piece together anything very helpful from them other than that I'm probably incorrectly indexing something.

I'm trying to preform a sequential addressing reduction on input vector A into output vector B.

The full code is available here http://pastebin.com/7UGadgjX, but this is the kernel:

__global__ void vectorSum(int *A, int *B, int numElements) {
  extern __shared__ int S[];
  // Each thread loads one element from global to shared memory
  int tid = threadIdx.x;
  int i = blockDim.x * blockIdx.x + threadIdx.x;
  if (i < numElements) {
    S[tid] = A[i];
    __syncthreads();
    // Reduce in shared memory
    for (int t = blockDim.x/2; t > 0; t>>=1) {
      if (tid < t) {
        S[tid] += S[tid + t];
      }
      __syncthreads();
    }
    if (tid == 0) B[blockIdx.x] = S[0];
  }
}

and these are the kernel launch statements:

// Launch the Vector Summation CUDA Kernel
  int threadsPerBlock = 256;
  int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
  vectorSum<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, numElements);

I'm getting a unspecified launch error which I've read is similar to a segfault. I've been following the nvidia reduction documentation closely and tried to keep my kernel within the bounds of numElements but I seem to be missing something key considering how simple the code is.

Your problem is that the reduction kernel requires dynamically allocated shared memory to operate correctly, but your kernel launch doesn't specify any. The result is out of bounds/illegal shared memory access which aborts the kernel.

In CUDA runtime API syntax, the kernel launch statement has four arguments. The first two are the grid and block dimensions for the launch. The latter two are optional with zero default values, but specify the dynamically allocated shared memory size and stream.

To fix this, change the launch code as follows:

// Launch the Vector Summation CUDA Kernel
  int threadsPerBlock = 256;
  int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
  size_t shmsz = (size_t)threadsPerBlock * sizeof(int);
  vectorSum<<<blocksPerGrid, threadsPerBlock, shmsz>>>(d_A, d_B, numElements);

[disclaimer: code written in browser, not compiled or tested, use at own risk]

This should at least fix the most obvious problem with your code.