Due to some performance issues with the Thrust libraries (see this page for more details), I am planning on re-factoring a CUDA application to use CUB instead of Thrust. Specifically, to replace the thrust::sort_by_key and thrust::inclusive_scan calls). In a particular point in my application I need to sort 3 arrays by key. This is how I did this with thrust:
thrust::sort_by_key(key_iter, key_iter + numKeys, indices);
thrust::gather_wrapper(indices, indices + numKeys,
thrust::make_zip_iterator(thrust::make_tuple(values1Ptr, values2Ptr, values3Ptr)),
thrust::make_zip_iterator(thrust::make_tuple(valuesOut1Ptr, valuesOut2Ptr, valuesOut3Ptr))
);
where
key iter
is a thrust::device_ptr that points to the keys i want to sort byindices
point to a sequence (from 0 to numKeys-1) in device memoryvalues{1,2,3}Ptr
are device_ptrs to the values i want to sortvalues{1,2,3}OutPtr
are device_ptrs to the sorted values
With the CUB SortPairs function I can sort a single value buffer, but not all 3 in one shot. Problem is I don't see any CUB "gather-like" utilities. Suggestions?
EDIT:
I suppose I could implement my own gather kernel, but is there any better way to do this other than:
template <typename Index, typename Value>
__global__ void gather_kernel(const unsigned int N, const Index * map,
const Value * src, Value * dst)
{
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
{
dst[i] = src[map[i]];
}
}
The non-coalesed loads and stores make me chringe, but it probably unavoidable without a known structure on map
.