I understand how #pragma unroll
works, but if I have the following example:
__global__ void
test_kernel( const float* B, const float* C, float* A_out)
{
int j = threadIdx.x + blockIdx.x * blockDim.x;
if (j < array_size) {
#pragma unroll
for (int i = 0; i < LIMIT; i++) {
A_out[i] = B[i] + C[i];
}
}
}
I want to determine the optimal value for LIMIT
in the kernel above which will be launched with x
number of threads and y
number of blocks. The LIMIT
can be anywhere from 2
to 1<<20
. Since 1 million seems like a very big number for the variable (1 million loops unrolled will cause register pressure and I am not sure if the compiler will do that unroll), what is a "fair" number, if any? And how do I determine that limit?
CUDA takes advantage of thread-level parallelism, which you expose by splitting work into multiple threads, and instruction-level parallelism, which CUDA finds by searching for independent instructions in your compiled code.
@talonmies' result, showing that your loop might be unrolled somewhere between 4096 and 8192 iterations was surprising to me because loop unrolling has sharply diminishing returns on a modern CPU, where most iteration overhead has been optimized away with techniques such as branch prediction and speculative execution.
On a CPU, I doubt that there would be much to gain from unrolling more than, say, 10-20 iterations and an unrolled loop takes up more room in the instruction cache so there's a cost to unrolling as well. The CUDA compiler will be considering the cost/benefit tradeoff when determining how much unrolling to do. So the question is, what might be the benefit from unrolling 4096+ iterations? I think it might be because it gives the GPU more code in which it can search for independent instructions that it can then run concurrently, using instruction-level parallelism.
The body of your loop is
A_out[i] = B[i] + C[i];
. Since the logic in your loop does not access external variables and does not access results from earlier iterations of the loop, each iteration is independent from all other iterations. Soi
doesn't have to increase sequentially. The end result would be the same even if the loop iterated over each value ofi
between0
andLIMIT - 1
in completely random order. That property makes the loop a good candidate for parallel optimization.But there is a catch, and it's what I mentioned in the comment. The iterations of your loop are only independent if the
A
buffer is stored separately from yourB
andC
buffers. If yourA
buffer partially or fully overlaps theB
and/orC
buffers in memory, a connection between different iterations is created. One iteration may now change theB
andC
input values for another iteration by writing toA
. So you get different results depending on which of the two iterations runs first.Multiple pointers pointing to the same locations in memory is called pointer aliasing. So, in general, pointer aliasing can cause "hidden" connections between sections of code that appear to be separate because writes done by one section of code through one pointer can alter values read by another section of code reading from another pointer. By default, CPU compilers generate code that take possible pointer aliasing into account, generating code that yields the correct result regardless. The question is what CUDA does, because, coming back to the talonmies' test results, the only reason I can see for such a large amount of unrolling is that it opens the code up for instruction level parallelism. But that then means that CUDA does not take pointer aliasing into account in this particular situation.
Re. your question about running more than a single thread, a regular serial program does not automatically become a parallel program when you increase the number of threads. You have to identify the portions of the work that can run in parallel and then express that in your CUDA kernel. That's what's called thread-level parallelism and it's the main source of performance increase for your code. In addition, CUDA will search for independent instructions in each kernel and may run those concurrently, which is the instruction-level parallelism. Advanced CUDA programmers may keep instruction-level parallelism in mind and write code that facilitates that, but we mortals should just focus on thread-level parallelism. That means that you should look at your code again and consider might be able to run in parallel. Since we already concluded that the body of your loop is a good candidate for parallelization, your job becomes rewriting the serial loop in your kernel to express to CUDA how to run separate iterations in parallel.
Your example kernel is completely serial and not in anyway a useful real world use case for loop unrolling, but let's restrict ourselves to the question of how much loop unrolling the compiler will perform.
Here is a compileable version of your kernel with a bit of template decoration:
You can compile this to PTX and see for yourself that (at least with the CUDA 7 release compiler and the default compute capability 2.0 target architecture), the kernels with up to
LIMIT=4096
are fully unrolled. TheLIMIT=8192
case is not unrolled. If you have more patience that I do, you can probably play around with the templating to find the exact compiler limit for this code, although I doubt that is particularly instructive to know.You can also see for yourself via the compiler that all of the heavily unrolled versions use the same number of registers (because of the trivial nature of your kernel).