What Causes Instruction Replay Overhead in CUDA

I ran the visual profiler on a CUDA application of mine. The application calls a single kernel multiple times if the data is too large. This kernel has no branching.

The profiler reports a high instruction replay overhead of 83.6% and a high global memory instruction replay overhead of 83.5%.

Here is how the kernel generally looks:

// Decryption kernel
__global__ void dev_decrypt(uint8_t *in_blk, uint8_t *out_blk){

    __shared__ volatile word sdata[256];
    register uint32_t data;

    // Thread ID
#define xID (threadIdx.x + blockIdx.x * blockDim.x)
#define yID (threadIdx.y + blockIdx.y * blockDim.y)
    uint32_t tid = xID + yID * blockDim.x * gridDim.x;
#undef xID
#undef yID

    register uint32_t pos4 = tid%4;
    register uint32_t pos256 = tid%256;
    uint32_t blk = pos256&0xFC;

    // Indices
    register uint32_t index0 = blk + (pos4+3)%4;
    register uint32_t index1 = blk + (pos4+2)%4;

    // Read From Global Memory
    b0[pos256] = ((word*)in_blk)[tid+4] ^ dev_key[pos4];

    data  = tab(0,sdata[index0]);
    data ^= tab(1,sdata[index1]);
    sdata[pos256] = data ^ tab2[pos4];

    data  = tab(0,sdata[index0]);
    data ^= tab(1,sdata[index1]);
    sdata[pos256] = data ^ tab2[2*pos4];

    data  = tab(0,sdata[index0]);
    data ^= tab(1,sdata[index1]);
    data ^= tab2[3*pos4];

    ((uint32_t*)out_blk)[tid] = data + ((uint32_t*)in_blk)[tid];
}

As you can see there are no branches. The threads will initially read from global memory based on thread ID + 16 bytes. They will then write to an output buffer after performing an operation with data from global memory based on their thread ID.

Any ideas why this kernel would have so much overhead?

The source of the instruction replay in this case is non-uniform constant memory access within a warp. In you code, tab is stored in constant memory and indexed according to some combination of thread index and data stored shared memory. The result would appear to be non-uniform access threads within the same warp. Constant memory is really intended for cases where all threads in a warp access the same word, then the value can be broadcast from constant memory cache in a single operation, otherwise warp serialization occurs.

In cases where non-uniform access of small, read-only datasets is required, it would probably be better to bind the data to a texture than store it is constant memory.