Is there a better way to process “undividable coun

2019-06-13 16:31发布

I need to do data reduction (find k-max number) on vector of N numbers. The problem is I don't know the N beforehand (before compilation), and I am not sure if I'm doing it right when I'm constructing two kernels - one with (int)(N / block_size) blocks and the second kernel with one block of N % block_size threads.

Is there a better way to process "undividable" count of numbers by block_size in CUDA?

标签： cuda parallel-processing

2条回答

叼着烟拽天下

2楼-- · 2019-06-13 16:58

A typical approach is like this (1-D grid example):

#define DATA_SIZE ...   // this is some arbitrary number
#define NUM_THREADS_PER_BLOCK ...  // this is block size, usually a multiple of 32
                                  // typical choices are 256, 512, 1024 (pick one)

unsigned int N = DATA_SIZE;  
unsigned int nTPB = NUM_THREADS_PER_BLOCK; 
my_kernel<<<(N + nTPB - 1)/nTPB, nTPB>>>(...);

This assumes your kernel has a "thread check" at the beginning like this:

unsigned int idx = threadIdx.x + blockDim.x*blockIdx.x;
if (idx < DATA_SIZE){
   // kernel code goes here
}

0人赞添加讨论(0) 举报

Melony?

3楼-- · 2019-06-13 17:04

@RobertCrovella's answer describes the standard way of handling the situation and there is typically no need to worry about the extra if conditional that is needed in the kernel.

However, another alternative is to allocate the input and output buffers with padding up to a number that is divisible by the block size, run the kernel (without the if) and then ignore the extra results, for instance by not copying them back to the CPU.

0人赞添加讨论(0) 举报

Is there a better way to process “undividable coun

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间