I am a newbie to Thrust. I see that all Thrust presentations and examples only show host code.
I would like to know if I can pass a device_vector to my own kernel? How? If yes, what are the operations permitted on it inside kernel/device code?
I am a newbie to Thrust. I see that all Thrust presentations and examples only show host code.
I would like to know if I can pass a device_vector to my own kernel? How? If yes, what are the operations permitted on it inside kernel/device code?
If you mean to use the data allocated / processed by thrust yes you can, just get the raw pointer of the allocated data.
if you want to allocate thrust vectors in the kernel I never tried but I don't think will work and also if it works I don't think it will provide any benefit.
As it was originally written, Thrust is purely a host side abstraction. It cannot be used inside kernels. You can pass the device memory encapsulated inside a
thrust::device_vector
to your own kernel like this:and you can also use device memory not allocated by thrust within thrust algorithms by instantiating a thrust::device_ptr with the bare cuda device memory pointer.
Edited four and half years later to add that as per @JackOLantern's answer, thrust 1.8 adds a sequential execution policy which means you can run single threaded versions of thrust's alogrithms on the device. Note that it still isn't possible to directly pass a thrust device vector to a kernel and device vectors can't be directly used in device code.
Note that it is also possible to use the
thrust::device
execution policy in some cases to have parallel thrust execution launched by a kernel as a child grid. This requires separate compilation/device linkage and hardware which supports dynamic parallelism. I am not certain whether this is actually supported in all thrust algorithms or not, but certainly works with some.I would like to provide an updated answer to this question.
Starting from Thrust 1.8, CUDA Thrust primitives can be combined with the
thrust::seq
execution policy to run sequentially within a single CUDA thread (or sequentially within a single CPU thread). Below, an example is reported.If you want parallel execution within a thread, then you may consider using CUB which provides reduction routines that can be called from within a threadblock, provided that your card enables dynamic parallelism.
Here is the example with Thrust
This is an update to my previous answer.
Starting from Thrust 1.8.1, CUDA Thrust primitives can be combined with the
thrust::device
execution policy to run in parallel within a single CUDA thread exploiting CUDA dynamic parallelism. Below, an example is reported.The above example performs reductions of the rows of a matrix in the same sense as Reduce matrix rows with CUDA, but it is done differently from the above post, namely, by calling CUDA Thrust primitives directly from user written kernels. Also, the above example serves to compare the performance of the same operations when done with two execution policies, namely,
thrust::seq
andthrust::device
. Below, some graphs showing the difference in performance.The performance has been evaluated on a Kepler K20c and on a Maxwell GeForce GTX 850M.