I would like to copy memory from the host to the device using thrust as in
thrust::host_vector<float> h_vec(1 << 28);
thrust::device_vector<float> d_vec(1 << 28);
thrust::copy(h_vec.begin(), h_vec.end(), d_vec.begin());
using CUDA streams analogously to how you would copy memory from the device to the device using streams:
cudaStream_t s;
cudaStreamCreate(&s);
thrust::device_vector<float> d_vec1(1 << 28), d_vec2(1 << 28);
thrust::copy(thrust::cuda::par.on(s), d_vec1.begin(), d_vec1.end(), d_vec2.begin());
cudaStreamSynchronize(s);
cudaStreamDestroy(s);
The problem is that I can't set the execution policy to CUDA to specify the stream when copying from the host to the device, because, in that case, thrust would assume that both vectors are stored on the device. Is there a way to get around this problem? I'm using the latest thrust version from github (it says 1.8 in the version.h file).
Here's a worked example using
thrust::cuda::experimental::pinned_allocator<T>
:Comment out the synchronize step and you should get
0
printed to the console due to the async memory transfer.As indicated in the comments, I don't think this will be possible directly with
thrust::copy
. However we can usecudaMemcpyAsync
in a thrust application to achieve the goal of asynchronous copies and overlap of copy with compute.Here is a worked example:
For my test case, I used RHEL5.5, Quadro5000, and cuda 6.5RC. This example is designed to have thrust create very small kernels (only a single threadblock, as long as
KSIZE
is small, say 32 or 64), so that the kernels that thrust creates fromthrust::for_each
are able to run concurrently.When I profile this code, I see:
This indicates that we are achieving proper overlap both between thrust kernels, and between copy operations and thrust kernels, as well as asynchronous data copying at the completion of the kernels. Note that the
cudaDeviceSynchronize()
operation "fills" the timeline, indicating that all the async operations (data copying, thrust functions) were issued asynchronously and control returned to the host thread before any of the operations were underway. All of this is expected, proper behavior for full concurrency between host, GPU, and data copying operations.