Is there a rule of thumb for keeping the compiler happy when it looks at a kernel and assigns registers?
The compiler has a lot of flexibility, but I worry that it might start using excessive local memory if I created like, 500 variables in my kernel... or a very long single line with a ton of operations.
I know the only way my program could really examine register use on a specific device is by using the AMD SDK or the NVIDIA SDK (or comparing the assembly code to the Device's architecture). Unfortunately, I am using PyOpenCL, so working with those SDKs would be impractical.
My program generates semi-random kernels, and I'm trying to prevent it from doing things that would choke the compiler and start dumping registers in local memory.
The compiler will keep track of the private variables scope, is not the number of variables you declare that matters, but how hey are used.
For example, in the following example, only 2 registers are used. Although, 5 private variables are used:
//Notice here, that a value is used in the register when it has to be stored
// not when it is declared. So declaring a variable that is never used will be
// optimized and removed by the compiler.
R1 | R2 | Code
a | - | int a = 1;
a | b | int b = 3;
a | b | int c;
c | b | c = a + b;
c | b | int d;
c | d | d = c + b;
c | d | int e;
e | - | int e = c + d;
- | - | out[idx] = e; //Global memory output
It all depends on the scope of the variable (when is it needed, if it is needed, and for how long).
The only critical thing is NOT to create more memory than needed if the compiler cannot predict that memory.
int a[100];
//Initialize a with some value
int b;
b = a[global_index];
The compiler will not be able to predict the values you are using, therefore it needs the 100 values, and will spil out the memory if needed. For those kind of operations is better to create a table or even do a single reading to a global table.
There is an option for NVIDIA platforms that even works programmatically, without the SDK. (Maybe there is something similar for AMD cards?)
You can specify the "-cl-nv-verbose"
as the "build option" when calling clBuildProgram
. This will generate some log information that can later be obtained via the build logs.
clBuildProgram(program, 0, NULL, "-cl-nv-verbose", NULL, NULL);
clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, ...);
(sorry, I'm not sure about the python syntax for this).
The result should be a string containing the desired information. For a simple vector addition, this shows
ptxas : info : 0 bytes gmem
ptxas : info : Compiling entry function 'sampleKernel' for 'sm_21'
ptxas : info : Function properties for sampleKernel
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 4 registers, 44 bytes cmem[0], 4 bytes cmem[16]
You can also use the "-cl-nv-maxrregcount=..."
option to specify the maximum register count, but of course, all this is device- and platform specific, and thus should be used with care.