I refer you to following page http://code.google.com/p/thrust/wiki/QuickStartGuide#Vectors. Please see second paragraph where it says that
Also note that individual elements of a device_vector can be accessed using the standard bracket notation. However, because each of these accesses requires a call to cudaMemcpy, they should be used sparingly. We'll look at some more efficient techniques later.
I searched all over the document but I could not find the more efficient technique. Does anyone know the fastest way to do this? i.e how to access device vector/device pointer on host fastest?
The "more efficient techniques" the guide alludes to are the Thrust algorithms. It's more efficient to access (or copy across the PCI-E bus) millions of elements at once than it is to access a single element because the fixed cost of CPU/GPU communication is amortized.
There's no faster way to copy data from the GPU to the CPU than by calling
cudaMemcpy
, because it is the most primitive way for a CUDA programmer to implement the task.If you have a device_vector which you need to do more processing on, try to keep the data on the device and process it with Thrust algorithms or your own kernels. If you need to read only a few values from the device_vector, just access the values directly with bracket notation. If you need to access more than a few values, copy the device_vector over to a host_vector and read the the values from there.