I ran the visual profiler on a CUDA application of mine. The application calls a single kernel multiple times if the data is too large. This kernel has no branching.
The profiler reports a high instruction replay overhead of 83.6% and a high global memory instruction replay overhead of 83.5%.
Here is how the kernel generally looks:
// Decryption kernel
__global__ void dev_decrypt(uint8_t *in_blk, uint8_t *out_blk){
__shared__ volatile word sdata[256];
register uint32_t data;
// Thread ID
#define xID (threadIdx.x + blockIdx.x * blockDim.x)
#define yID (threadIdx.y + blockIdx.y * blockDim.y)
uint32_t tid = xID + yID * blockDim.x * gridDim.x;
#undef xID
#undef yID
register uint32_t pos4 = tid%4;
register uint32_t pos256 = tid%256;
uint32_t blk = pos256&0xFC;
// Indices
register uint32_t index0 = blk + (pos4+3)%4;
register uint32_t index1 = blk + (pos4+2)%4;
// Read From Global Memory
b0[pos256] = ((word*)in_blk)[tid+4] ^ dev_key[pos4];
data = tab(0,sdata[index0]);
data ^= tab(1,sdata[index1]);
sdata[pos256] = data ^ tab2[pos4];
data = tab(0,sdata[index0]);
data ^= tab(1,sdata[index1]);
sdata[pos256] = data ^ tab2[2*pos4];
data = tab(0,sdata[index0]);
data ^= tab(1,sdata[index1]);
data ^= tab2[3*pos4];
((uint32_t*)out_blk)[tid] = data + ((uint32_t*)in_blk)[tid];
}
As you can see there are no branches. The threads will initially read from global memory based on thread ID + 16 bytes. They will then write to an output buffer after performing an operation with data from global memory based on their thread ID.
Any ideas why this kernel would have so much overhead?