Why does the transposed matrix look differently, when converted to a pycuda.gpuarray
?
Can you reproduce this? What could cause this? Am I using the wrong approach?
Example code
from pycuda import gpuarray
import pycuda.autoinit
import numpy
data = numpy.random.randn(2,4).astype(numpy.float32)
data_gpu = gpuarray.to_gpu(data.T)
print "data\n",data
print "data_gpu.get()\n",data_gpu.get()
print "data.T\n",data.T
Output
data
[[ 0.70442784 0.08845157 -0.84840715 -1.81618035]
[ 0.55292499 0.54911566 0.54672164 0.05098847]]
data_gpu.get()
[[ 0.70442784 0.08845157]
[-0.84840715 -1.81618035]
[ 0.55292499 0.54911566]
[ 0.54672164 0.05098847]]
data.T
[[ 0.70442784 0.55292499]
[ 0.08845157 0.54911566]
[-0.84840715 0.54672164]
[-1.81618035 0.05098847]]
The basic reason is that numpy transpose only creates a view, which has no effect on the underlying array storage, and it is that storage which PyCUDA directly accesses when a copy is performed to device memory. The solution is to use the copy
method when doing the transpose, which will create an array with data in the transposed order in host memory, then copy that to the device:
data_gpu = gpuarray.to_gpu(data.T.copy())
In numpy, data.T
doesn't do anything to the underlying 1D array. It simply manipulates the strides to obtain the transpose. This makes it a constant-time and constant-memory operation.
It would appear that pycuda.to_gpu()
isn't respecting the strides and is simply copying the underlying 1D array. This would produce the exact behaviour you're observing.
In my view there is nothing wrong with your code. Rather, I would consider this a bug in pycuda
.
I've googled around, and have found a thread that discusses this issue in detail.
As a workaround, you could try passing numpy.ascontiguousarray(data.T)
to gpuarray.to_gpu()
. This will, of course, create a second copy of the data in the host RAM.