128-bit vector addition with Cuda, performance iss

I want to add 128-bit vectors with carry. My 128-bit version (addKernel128 in the code below) is twice slower than the basic 32-bit version (addKernel32 below). Do I have memory coalescing problems ? How can I get better performance ?

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>

#define UADDO(c, a, b) asm volatile("add.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
#define UADDC(c, a, b) asm volatile("addc.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));

__global__ void addKernel32(unsigned int *c, const unsigned int *a, const unsigned int *b, const int size)
{
  int tid = blockIdx.x * blockDim.x + threadIdx.x;

  while (tid < size)
  {
    c[tid] = a[tid] + b[tid];
    tid += blockDim.x * gridDim.x;
  }
}

__global__ void addKernel128(unsigned *c, const unsigned *a, const unsigned *b, const int size)
{
  int tid = blockIdx.x * blockDim.x + threadIdx.x;

  while (tid < size / 4)
  {
    uint4 a4 = ((const uint4 *)a)[tid],
          b4 = ((const uint4 *)b)[tid],
          c4;

    UADDO(c4.x, a4.x, b4.x)
    UADDC(c4.y, a4.y, b4.y) // add with carry
    UADDC(c4.z, a4.z, b4.z) // add with carry
    UADDC(c4.w, a4.w, b4.w) // add with carry (no overflow checking for clarity)

    ((uint4 *)c)[tid] = c4;

    tid += blockDim.x * gridDim.x;
  }
}

int main()
{
  const int size = 10000000; // 10 million

  unsigned int *d_a, *d_b, *d_c;

  cudaMalloc((void**)&d_a, size * sizeof(int));
  cudaMalloc((void**)&d_b, size * sizeof(int));
  cudaMalloc((void**)&d_c, size * sizeof(int));

  cudaMemset(d_a, 1, size * sizeof(int)); // dummy init just for the example
  cudaMemset(d_b, 2, size * sizeof(int)); // dummy init just for the example
  cudaMemset(d_c, 0, size * sizeof(int));

  int nbThreads = 512;
  int nbBlocks = 1024; // for example

  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  cudaEventRecord(start);

  addKernel128<<<nbBlocks, nbThreads>>>(d_c, d_a, d_b, size);

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  float m = 0;
  cudaEventElapsedTime(&m, start, stop);

  cudaFree(d_c);
  cudaFree(d_b);
  cudaFree(d_a);
  cudaDeviceReset();
  printf("Elapsed = %g\n", m);
  return 0;
}

Timing CUDA code on a WDDM GPU can be quite difficult for a variety of reasons. Most of these revolve around the fact that the GPU is being managed as a display device by Windows, and this can introduce a variety of artifacts into the timing. One example is that the windows driver and WDDM will batch work for the GPU, and may interleave display work in the middle of CUDA GPU work.

if possible, time your cuda code on linux, or else on a windows GPU in TCC mode.
for performance, always build without the -G switch. In visual studio, this usually corresponds to building the release, not the debug version of the project.
To get a good performance comparison, it's usually advisable to do some "warm up runs" before actually measuring the timing results. These will eliminate "start-up" and other one-time measurement issues, are you are more likely to get sensible results. You may also wish to run your code a number of times and average the results.
It's also usually advisable to compile with an arch flag that corresponds to your GPU, so for example -arch=sm_20 for a cc2.0 GPU.