I'm using pyOpenCL to do some complex calculations. It runs fine on CPU, but I get an error when trying to run it on an NVIDIA GeForce 9400M (256 MB). I'm working on Mac OS X Lion (10.7.5)
The strange thing is that this error does not always show up. It seems to occur when my calculations use larger numbers (resulting in larger iterations) but only when run on GPU.
I'm not writing to memory locations I'm not supposed to write to. I ruled out possible problems with concurrent modification by running the code as a single work item.
I simplified my OpenCL code as much as possible, and from what was left created some very simple code with extremely weird behavior that causes the pyopencl.LogicError
. It consists of 2 nested loops in which a couple of assignments are made to the result
array. This assignment need not even depend on the state of the loop.
This is run on a single thread (or work item, shape = (1,)
) on the GPU.
__kernel void weirdError(__global unsigned int* result){
unsigned int outer = (1<<30)-1;
for(int i=20; i--; ){
unsigned int inner = 0;
while(inner != outer){
result[0] = 1248;
result[1] = 1337;
inner++;
}
outer++;
}
}
The strange part is that removing either one of the assignments to the result array removes the error. Also, decreasing the initial value for outer (down to (1<<20)-1
for example) also removes the error. In these cases, the code returns normally, with the correct result available in the corresponding buffer.
On CPU, it never raises an error.
The OpenCL code is run from Python using PyOpenCL.
Nothing fancy in the setup:
platform = cl.get_platforms()[0]
device = platform.get_devices(cl.device_type.GPU)[0]
context = cl.Context([device])
program = cl.Program(context, getProgramCode()).build()
queue = cl.CommandQueue(context)
In this Python code I set the result_buf
to 0
, then I run the calculation in OpenCL that will set its values in a large iteration. Afterwards I try to collect this value from the device memory, but that's where it goes wrong:
result = numpy.zeros(2, numpy.uint32)
result_buf = cl.Buffer(context, mem_flags.READ_WRITE | mem_flags.COPY_HOST_PTR, hostbuf=result)
shape = (1,)
program.weirdError(queue, shape, None, result_buf)
cl.enqueue_copy(queue, result, result_buf)
The last line gives me:
pyopencl.LogicError: clEnqueueReadBuffer failed: invalid command queue
How can this repeated assignment cause an error?
And more importantly: how can it be avoided?
I understand that this problem is probably platform dependent, and thus perhaps hard to reproduce. But this is the only machine I have access to, so the code should work on this machine.
DISCLAIMER: I have never worked with OpenCL (or CUDA) before. I wrote the code on a machine where the GPU did not support OpenCL. I always tested it on CPU. Now that I switched to GPU, I find it frustrating that errors do not occur consistently and I have no idea why.