我有被存储为GPU维数组的矩阵,我试图让这将这个矩阵的每一行中减量使用,例如在OpenCL内核:
让我们考虑一下我的矩阵的2x3与元素[1,2,3,4,5,6,我想要做的是:
[1, 2, 3] = [ 6]
[4, 5, 6] [15]
显然,因为我说的减少,实际收益可能会多于每行一个元素:
[1, 2, 3] = [3, 3]
[4, 5, 6] [9, 6]
那么最终的计算,我可以在另一个内核或CPU在做。
好了,到目前为止,我有什么是内核里面做了减少,但使用数组的所有元素,就像这样:
[1, 2, 3] = [21]
[4, 5, 6]
实际减少内核这样做是一个(这是我从这里得到了计算器实际上):
__kernel void
sum2(__global float *inVector, __global float *outVector,
const unsigned int inVectorSize, __local float *resultScratch)
{
const unsigned int localId = get_local_id(0);
const unsigned int workGroupSize = get_local_size(0);
if (get_global_id(0) < inVectorSize)
resultScratch[localId] = inVector[get_global_id(0)];
else
resultScratch[localId] = 0;
for (unsigned int a = workGroupSize >> 1; a > 0; a >>= 1)
{
barrier(CLK_LOCAL_MEM_FENCE);
if (a > localId)
resultScratch[localId] += resultScratch[localId + a];
}
if (localId == 0)
outVector[get_group_id(0)] = resultScratch[0];
barrier(CLK_LOCAL_MEM_FENCE);
}