CUDA profiler reports inefficient global memory ac

I have a simple CUDA kernel which I thought was accessing global memory efficiently. The Nvidia profiler however reports that I am performing inefficient global memory accesses. My kernel code is:

__global__ void update_particles_kernel
(
    float4 *pos, 
    float4 *vel, 
    float4 *acc, 
    float dt, 
    int numParticles
)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
int offset = 0;

while(index + offset < numParticles)
{
    vel[index + offset].x += dt*acc[index + offset].x;   // line 247
    vel[index + offset].y += dt*acc[index + offset].y;
    vel[index + offset].z += dt*acc[index + offset].z;

    pos[index + offset].x += dt*vel[index + offset].x;   // line 251
    pos[index + offset].y += dt*vel[index + offset].y;
    pos[index + offset].z += dt*vel[index + offset].z;

    offset += blockDim.x * gridDim.x;
}

In particular the profiler reports the following:

From the CUDA best practices guide it says:

"For devices of compute capability 2.x, the requirements can be summarized quite easily: the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of cache lines necessary to service all of the threads of the warp. By default, all accesses are cached through L1, which as 128-byte lines. For scattered access patterns, to reduce overfetch, it can sometimes be useful to cache only in L2, which caches shorter 32-byte segments (see the CUDA C Programming Guide).

For devices of compute capability 3.x, accesses to global memory are cached only in L2; L1 is reserved for local memory accesses. Some devices of compute capability 3.5, 3.7, or 5.2 allow opt-in caching of globals in L1 as well."

Now in my kernel based on this information I would expect that 16 accesses would be required to service a 32 thread warp because float4 is 16 bytes and on my card (770m compute capability 3.0) reads from the L2 cache are performed in 32 bytes chunks (16 bytes * 32 threads / 32 bytes cache lines = 16 accesses). Indeed as you can see the profiler reports that I am doing 16 access. What I don't understand is why the profiler reports that the ideal access would involve 8 L2 transactions per access for line 247 and only 4 L2 transactions per access for the remaining lines. Can someone explain what I am missing here?

I have a simple CUDA kernel which I thought was accessing global memory efficiently. The Nvidia profiler however reports that I am performing inefficient global memory accesses.

To take one example, your float4 vel array is stored in memory like this:

0.x 0.y 0.z 0.w 1.x 1.y 1.z 1.w 2.x 2.y 2.z 2.w 3.x 3.y 3.z 3.w ...
  ^               ^               ^               ^             ...
  thread0         thread1         thread2         thread3

So when you do this:

vel[index + offset].x += ...;   // line 247

you are accessing (storing) at the locations (.x) that I have marked above. The gaps in between each ^ mark indicate an inefficient access pattern, which the profiler is pointing out. (It does not matter that in the very next line of code, you are storing to the .y locations.)

There are at least 2 solutions, one of which would be a classical AoS -> SoA reorganization of your data, with appropriate code adjustments. This is well documented (e.g. here on the cuda tag and elsewhere) in terms of what it means, and how to do it, so I will let you look that up.

The other typical solution is to load a float4 quantity per thread, when you need it, and store a float4 quantity per thread, when you need to. Your code can be trivially reworked to do this, which should give improved profiling results:

//preceding code need not change
while(index + offset < numParticles)
{
    float4 my_vel = vel[index + offset];
    float4 my_acc = acc[index + offset];
    my_vel.x += dt*my_acc.x;   
    my_vel.y += dt*my_acc.y;
    my_vel.z += dt*my_acc.z;
    vel[index + offset] = my_vel;

    float4 my_pos = pos[index + offset];
    my_pos.x += dt*my_vel.x; 
    my_pos.y += dt*my_vel.y;
    my_pos.z += dt*my_vel.z;
    pos[index + offset] = my_pos;

    offset += blockDim.x * gridDim.x;
}

Even though you might think that this code is "less efficient" than your code, because your code "appears" to be only loading and storing .x, .y, .z, whereas mine "appears" to also load and store .w, in fact there is essentially no difference, due to the way a GPU loads and stores to/from global memory. Although your code does not appear to touch .w, in the process of accessing the adjacent elements, the GPU will load the .w elements from global memory, and also (eventually) store the .w elements back to global memory.

What I don't understand is why the profiler reports that the ideal access would involve 8 L2 transactions per access for line 247

For line 247 in your original code, you are accessing one float quantity per thread for the load operation of acc.x, and one float quantity per thread for the load operation of vel.x. A float quantity per thread by itself should require 128 bytes for a warp, which is 4 32-byte L2 cachelines. Two loads together would require 8 L2 cacheline loads. This is the ideal case, which assumes that the quantities are packed together nicely (SoA). But that is not what you have (you have AoS).