As I understand correctly for the 2.x compute capability devices there's a 63 register limit per thread. Do you know which is the register limit per thread for devices of compute capability 1.3?
I have a big kernel which I'm testing on a GTX260. I'm pretty sure I'm using a lot of registers since the kernel is very complex and I need a lot of local variables. According to the Cuda profiler my register usage is 63 (Static Smem is 68 although I'm not so sure what that means and dynamic Smem is 0), although I'm pretty sure I have more than 63 local variables, so I figured the compiler is reusing registers or spilling them into local memory.
Now I thought the devices of compute capability 1.3 had a higher limit of registers per thread than the 2.x devices. My guess was that the compiler was choosing the 63 limit because I'm using using blocks of 256 threads in which case 256*63 is 16128 while 256*64 is 16384 which is the limit number of registers for a SM of this device. So my guess was that if I lower the number of threads per block I can increase the number of registers in use. So I ran the kernel with blocks of 196 threads. But again the profiler shows 63 registers even though 63*192 is 12096 and 64*192 is 12288 which is way inside the 16384 limit of the SM.
So any idea why the compiler is limiting itself still to 63 registers? Could it be all because of register reuse or is it still spilling registers?