This question already has an answer here:
-
Thrust inside user written kernels
4 answers
The Thrust library can be used to sort data. The call might look like this (with a keys and a values vector):
thrust::sort_by_key(d_keys.begin(), d_keys.end(), d_values.begin());
called on the CPU, with d_keys
and d_values
being in the CPU memory; and the bulk of the execution happens on the GPU.
However, my data is already on the GPU? How can I use the Thrust library to perform efficient sorting directly on the GPU, i.e., to call the sort_by_key
function from a kernel?
Also, my data consists of keys that are either
unsigned long long int
or unsigned int
and data that is always unsigned int
. How should I make the thrust call for these types?
As stated in the question Talonmies linked, you cannot call Thrust from a CUDA function (e.g. __device__
or __global__
). However, this doesn't mean you can't use data you already have in device memory with Thrust. Rather, you call the desired Thrust functions from the host using Thrust vectors wrapping your raw data. e.g.
//raw pointer to device memory
unsigned int * raw_data;
unsigned int * raw_keys;
//allocate device memory for data and keys
cudaMalloc((void **) &raw_data, N_data * sizeof(int));
cudaMalloc((void **) &raw_keys, N_keys * sizeof(int));
//populate your device pointers in your kernel
kernel<<<...>>>(raw_data, raw_keys, ...);
...
//wrap raw pointer with a device_ptr to use with Thrust functions
thrust::device_ptr<unsigned int> dev_data_ptr(raw_data);
thrust::device_ptr<unsigned int> dev_keys_ptr(raw_keys);
//use the device memory with a thrust call
thrust::sort_by_key(d_keys, d_keys + N_keys, dev_data_ptr);
The device memory pointed to by raw_data
and raw_keys
are still in device memory when you wrap them with Thrust::device_ptr
, so while you're calling the Thrust function from the host, it doesn't have to copy any memory from host to device or vice versa. That is, you're sorting directly on the GPU, using device memory; the only overhead you would have is in launching the Thrust kernel(s) and wrapping the raw device pointers.
And of course, you can get your raw pointers back if you need to use them in a regular CUDA kernel afterward:
unsigned int * raw_ptr = thrust::raw_pointer_cast(dev_data_ptr);
As for using either unsigned long long int
or unsigned int
as your keys with data that's unsigned int
, this isn't a problem, as Thrust is templated. That is, the signature for sort_by_key
is
template<typename RandomAccessIterator1 , typename RandomAccessIterator2 >
void thrust::sort_by_key(
RandomAccessIterator1 keys_first,
RandomAccessIterator1 keys_last,
RandomAccessIterator2 values_first )
meaning that you can have different types for the keys and data. As long as all of your key-types are homogenous for a given call, Thrust should be able to infer the types automatically and you won't have to do anything special. Hopefully that makes sense