vector step addition slower on cuda

I am trying to run the vector step addition function on CUDA C++ code, but for large float arrays of size 5,000,000 too, it runs slower than my CPU version. Below is the relevant CUDA and cpu code that I am talking about:

#define THREADS_PER_BLOCK 1024
typedef float real;
__global__ void vectorStepAddKernel2(real*x, real*y, real*z, real alpha, real beta, int size, int xstep, int ystep, int zstep)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < size)
    {
        x[i*xstep] = alpha* y[i*ystep] + beta*z[i*zstep];
    }
}

cudaError_t vectorStepAdd2(real *x, real*y, real* z, real alpha, real beta, int size, int xstep, int ystep, int zstep)
{

    cudaError_t cudaStatus;
    int threadsPerBlock = THREADS_PER_BLOCK;
    int blocksPerGrid = (size + threadsPerBlock -1)/threadsPerBlock;
    vectorStepAddKernel2<<<blocksPerGrid, threadsPerBlock>>>(x, y, z, alpha, beta, size, xstep, ystep, zstep);

    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching vectorStepAddKernel!\n", cudaStatus);
        exit(1);
    }

    return cudaStatus;
}

//CPU function:

void vectorStepAdd3(real *x, real*y, real* z, real alpha, real beta, int size, int xstep, int ystep, int zstep)
{
    for(int i=0;i<size;i++)
    {
        x[i*xstep] = alpha* y[i*ystep] + beta*z[i*zstep];
    }
}

Calling vectorStepAdd2 results in slower computation than vectorStepAdd3 when each of the 3 arrays are of size 5,000,000 and size=50,000 (i.e., 50,000 elements are added together in this step-wise manner).

Any ideas on what I can do to speed up the GPU code? My device is a Tesla M2090 GPU

Thanks

Responding to your question "Any ideas on what I can do to speed up the GPU code?"

First let me preface this with the statement that the proposed operation X = alpha * Y + beta * Z does not have a large amount of compute intensity per byte of data transfer required. As a result I was not able to beat the CPU time on this particular code. However it may be instructive to cover 2 ideas to speed up this code:

Use page-locked memory for data transfer operations. This netted a reduction by about a factor of 2 in data transfer times for the GPU version, which dominated the overall execution time for the GPU version.
Use a strided copying technique with cudaMemcpy2D as proposed by @njuffa here. The result is 2-fold: we can reduce the amount of data transfer to only that which is required for the computation, and we can then re-write the kernel to operate on the data contiguously as suggested in the comments (again by njuffa). This netted about a further 3x improvement in data transfer time and about a 10x improvement in kernel compute time.

This code provides an example of these operations:

#include <stdio.h>
#include <stdlib.h>


#define THREADS_PER_BLOCK 1024
#define DSIZE 5000000
#define WSIZE 50000
#define XSTEP 47
#define YSTEP 43
#define ZSTEP 41
#define TOL 0.00001f


#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

typedef float real;

__global__ void vectorStepAddKernel2(real *x, real *y, real *z, real alpha, real beta, int size, int xstep, int ystep, int zstep)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < size)
    {
        x[i*xstep] = alpha* y[i*ystep] + beta*z[i*zstep];
    }
}

__global__ void vectorStepAddKernel2i(real *x, real *y, real *z, real alpha, real beta, int size)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < size)
    {
        x[i] = alpha* y[i] + beta*z[i];
    }
}

void vectorStepAdd2(real *x, real *y, real *z, real alpha, real beta, int size, int xstep, int ystep, int zstep)
{

    int threadsPerBlock = THREADS_PER_BLOCK;
    int blocksPerGrid = (size + threadsPerBlock -1)/threadsPerBlock;
    vectorStepAddKernel2<<<blocksPerGrid, threadsPerBlock>>>(x, y, z, alpha, beta, size, xstep, ystep, zstep);
    cudaDeviceSynchronize();
    cudaCheckErrors("kernel2 fail");
}


void vectorStepAdd2i(real *x, real *y, real *z, real alpha, real beta, int size)
{

    int threadsPerBlock = THREADS_PER_BLOCK;
    int blocksPerGrid = (size + threadsPerBlock -1)/threadsPerBlock;
    vectorStepAddKernel2i<<<blocksPerGrid, threadsPerBlock>>>(x, y, z, alpha, beta, size);
    cudaDeviceSynchronize();
    cudaCheckErrors("kernel3 fail");
}

//CPU function:

void vectorStepAdd3(real *x, real*y, real* z, real alpha, real beta, int size, int xstep, int ystep, int zstep)
{
    for(int i=0;i<size;i++)
    {
        x[i*xstep] = alpha* y[i*ystep] + beta*z[i*zstep];
    }
}

int main() {

  real *h_x, *h_y, *h_z, *c_x, *h_x1;
  real *d_x, *d_y, *d_z, *d_x1, *d_y1, *d_z1;

  int dsize = DSIZE;
  int wsize = WSIZE;
  int xstep = XSTEP;
  int ystep = YSTEP;
  int zstep = ZSTEP;
  real alpha = 0.5f;
  real beta = 0.5f;
  float et;

/*
  h_x = (real *)malloc(dsize*sizeof(real));
  if (h_x == 0){printf("malloc1 fail\n"); return 1;}
  h_y = (real *)malloc(dsize*sizeof(real));
  if (h_y == 0){printf("malloc2 fail\n"); return 1;}
  h_z = (real *)malloc(dsize*sizeof(real));
  if (h_z == 0){printf("malloc3 fail\n"); return 1;}
  c_x = (real *)malloc(dsize*sizeof(real));
  if (c_x == 0){printf("malloc4 fail\n"); return 1;}
  h_x1 = (real *)malloc(dsize*sizeof(real));
  if (h_x1 == 0){printf("malloc1 fail\n"); return 1;}
*/

  cudaHostAlloc((void **)&h_x, dsize*sizeof(real), cudaHostAllocDefault);
  cudaCheckErrors("cuda Host Alloc 1 fail");
  cudaHostAlloc((void **)&h_y, dsize*sizeof(real), cudaHostAllocDefault);
  cudaCheckErrors("cuda Host Alloc 2 fail");
  cudaHostAlloc((void **)&h_z, dsize*sizeof(real), cudaHostAllocDefault);
  cudaCheckErrors("cuda Host Alloc 3 fail");
  cudaHostAlloc((void **)&c_x, dsize*sizeof(real), cudaHostAllocDefault);
  cudaCheckErrors("cuda Host Alloc 4 fail");
  cudaHostAlloc((void **)&h_x1, dsize*sizeof(real), cudaHostAllocDefault);
  cudaCheckErrors("cuda Host Alloc 5 fail");


  cudaMalloc((void **)&d_x, dsize*sizeof(real));
  cudaCheckErrors("cuda malloc1 fail");
  cudaMalloc((void **)&d_y, dsize*sizeof(real));
  cudaCheckErrors("cuda malloc2 fail");
  cudaMalloc((void **)&d_z, dsize*sizeof(real));
  cudaCheckErrors("cuda malloc3 fail");
  cudaMalloc((void **)&d_x1, wsize*sizeof(real));
  cudaCheckErrors("cuda malloc4 fail");
  cudaMalloc((void **)&d_y1, wsize*sizeof(real));
  cudaCheckErrors("cuda malloc5 fail");
  cudaMalloc((void **)&d_z1, wsize*sizeof(real));
  cudaCheckErrors("cuda malloc6 fail");

  for (int i=0; i< dsize; i++){
    h_x[i] = 0.0f;
    h_x1[i] = 0.0f;
    c_x[i] = 0.0f;
    h_y[i] = (real)(rand()/(real)RAND_MAX);
    h_z[i] = (real)(rand()/(real)RAND_MAX);
    }


  cudaEvent_t t_start, t_stop, k_start, k_stop;
  cudaEventCreate(&t_start);
  cudaEventCreate(&t_stop);
  cudaEventCreate(&k_start);
  cudaEventCreate(&k_stop);
  cudaCheckErrors("event fail");

  // first test original GPU version

  cudaEventRecord(t_start);
  cudaMemcpy(d_x, h_x, dsize * sizeof(real), cudaMemcpyHostToDevice);
  cudaCheckErrors("cuda memcpy 1 fail");
  cudaMemcpy(d_y, h_y, dsize * sizeof(real), cudaMemcpyHostToDevice);
  cudaCheckErrors("cuda memcpy 2 fail");
  cudaMemcpy(d_z, h_z, dsize * sizeof(real), cudaMemcpyHostToDevice);
  cudaCheckErrors("cuda memcpy 3 fail");


  cudaEventRecord(k_start);
  vectorStepAdd2(d_x, d_y, d_z, alpha, beta, wsize, xstep, ystep, zstep);
  cudaEventRecord(k_stop);

  cudaMemcpy(h_x, d_x, dsize * sizeof(real), cudaMemcpyDeviceToHost);
  cudaCheckErrors("cuda memcpy 4 fail");
  cudaEventRecord(t_stop);
  cudaEventSynchronize(t_stop);
  cudaEventElapsedTime(&et, t_start, t_stop);
  printf("GPU original version total elapsed time is: %f ms.\n", et);
  cudaEventElapsedTime(&et, k_start, k_stop);
  printf("GPU original kernel elapsed time is: %f ms.\n", et);

  //now test CPU version

  cudaEventRecord(t_start);
  vectorStepAdd3(c_x, h_y, h_z, alpha, beta, wsize, xstep, ystep, zstep);
  cudaEventRecord(t_stop);
  cudaEventSynchronize(t_stop);
  cudaEventElapsedTime(&et, t_start, t_stop);
  printf("CPU version total elapsed time is: %f ms.\n", et);
  for (int i = 0; i< dsize; i++)
    if (fabsf((float)(h_x[i]-c_x[i])) > TOL) {
      printf("cpu/gpu results mismatch at i = %d, cpu = %f, gpu = %f\n", i, c_x[i], h_x[i]);
      return 1;
      }


  // now test improved GPU version

  cudaEventRecord(t_start);
//  cudaMemcpy2D(d_x1, sizeof(real),  h_x, xstep * sizeof(real), sizeof(real), wsize, cudaMemcpyHostToDevice);
//  cudaCheckErrors("cuda memcpy 5 fail");
  cudaMemcpy2D(d_y1, sizeof(real),  h_y, ystep * sizeof(real), sizeof(real), wsize, cudaMemcpyHostToDevice);
  cudaCheckErrors("cuda memcpy 6 fail");
  cudaMemcpy2D(d_z1, sizeof(real),  h_z, zstep * sizeof(real), sizeof(real), wsize, cudaMemcpyHostToDevice);
  cudaCheckErrors("cuda memcpy 7 fail");

  cudaEventRecord(k_start);
  vectorStepAdd2i(d_x1, d_y1, d_z1, alpha, beta, wsize);
  cudaEventRecord(k_stop);

  cudaMemcpy2D(h_x1, xstep*sizeof(real), d_x1, sizeof(real), sizeof(real), wsize, cudaMemcpyDeviceToHost);
  cudaCheckErrors("cuda memcpy 8 fail");
  cudaEventRecord(t_stop);
  cudaEventSynchronize(t_stop);
  cudaEventElapsedTime(&et, t_start, t_stop);
  printf("GPU improved version total elapsed time is: %f ms.\n", et);
  cudaEventElapsedTime(&et, k_start, k_stop);
  printf("GPU improved kernel elapsed time is: %f ms.\n", et);

  for (int i = 0; i< dsize; i++)
    if (fabsf((float)(h_x[i]-h_x1[i])) > TOL) {
      printf("gpu/gpu improved results mismatch at i = %d, gpu = %f, gpu imp = %f\n", i, h_x[i], h_x1[i]);
      return 1;
      }

  printf("Results:i   CPU     GPU     GPUi \n");
  for (int i = 0; i< 20*xstep; i+=xstep)
    printf("    %d         %f      %f     %f    %f    %f\n",i, c_x[i], h_x[i], h_x1[i]);


  return 0;
}

As mentioned, I still was not able to beat the CPU time, and I attribute this either to my own lack of coding skill or else the fact that this operation fundamentally does not have enough compute complexity to be interesting on the GPU. Nevertheless here are some sample results:

GPU original version total elapsed time is: 13.352256 ms.
GPU original kernel elapsed time is: 0.195808 ms.
CPU version total elapsed time is: 2.599584 ms.
GPU improved version total elapsed time is: 4.228288 ms.
GPU improved kernel elapsed time is: 0.027392 ms.
Results:i   CPU     GPU     GPUi
    0         0.617285      0.617285     0.617285
    47         0.554522      0.554522     0.554522
    94         0.104245      0.104245     0.104245
....

We can see that the improved kernel had an overall reduction of about 3x compared to the original kernel, almost all of which was due to the reduction in data copy time. This reduction in data copy time was due to the fact that with the improved 2D memcpy, we only need to copy the data we are actually using. (with out the page-locked memory, these data transfer times would be twice as long, approximately). We can also see the the kernel computation time is about 10x faster than the CPU computation for the original kernel, and about 100x faster than the CPU computation for the improved kernel. Nevertheless, with the data transfer times factored in, we are not able to overcome the CPU speed.

One final comment is that the "cost" of the cudaMemcpy2D operation is still pretty high. For a reduction of 100x in vector size, we are only seeing a 3x reduction in time to copy. So the strided access still makes for a relatively expensive way to use the GPU. If we were simply transferring vectors of 50,000 contiguous elements, we would expect an almost linear 100x reduction in the time to copy (as compared to the original copying vectors of 5000000 elements). This means the time to copy would be less than 1ms and our GPU version would be faster than the CPU, at least this naive single threaded CPU code.