I need to do data reduction (find k-max number) on vector of N numbers. The problem is I don't know the N beforehand (before compilation), and I am not sure if I'm doing it right when I'm constructing two kernels - one with (int)(N / block_size)
blocks and the second kernel with one block of N % block_size
threads.
Is there a better way to process "undividable" count of numbers by block_size in CUDA?
A typical approach is like this (1-D grid example):
This assumes your kernel has a "thread check" at the beginning like this:
@RobertCrovella's answer describes the standard way of handling the situation and there is typically no need to worry about the extra
if
conditional that is needed in the kernel.However, another alternative is to allocate the input and output buffers with padding up to a number that is divisible by the block size, run the kernel (without the
if
) and then ignore the extra results, for instance by not copying them back to the CPU.