Is there a better way to process “undividable coun

2019-06-13 16:31发布

I need to do data reduction (find k-max number) on vector of N numbers. The problem is I don't know the N beforehand (before compilation), and I am not sure if I'm doing it right when I'm constructing two kernels - one with (int)(N / block_size) blocks and the second kernel with one block of N % block_size threads.

Is there a better way to process "undividable" count of numbers by block_size in CUDA?

2条回答
叼着烟拽天下
2楼-- · 2019-06-13 16:58

A typical approach is like this (1-D grid example):

#define DATA_SIZE ...   // this is some arbitrary number
#define NUM_THREADS_PER_BLOCK ...  // this is block size, usually a multiple of 32
                                  // typical choices are 256, 512, 1024 (pick one)

unsigned int N = DATA_SIZE;  
unsigned int nTPB = NUM_THREADS_PER_BLOCK; 
my_kernel<<<(N + nTPB - 1)/nTPB, nTPB>>>(...);

This assumes your kernel has a "thread check" at the beginning like this:

unsigned int idx = threadIdx.x + blockDim.x*blockIdx.x;
if (idx < DATA_SIZE){
   // kernel code goes here
}
查看更多
Melony?
3楼-- · 2019-06-13 17:04

@RobertCrovella's answer describes the standard way of handling the situation and there is typically no need to worry about the extra if conditional that is needed in the kernel.

However, another alternative is to allocate the input and output buffers with padding up to a number that is divisible by the block size, run the kernel (without the if) and then ignore the extra results, for instance by not copying them back to the CPU.

查看更多
登录 后发表回答