OpenCL read variable size result buffer from the G

2019-03-04 06:56发布

I have one searching OpenCL 1.1 algorithm which works well with small amount of data:

1.) build the inputData array and pass it to the GPU

2.) create a very big resultData container (e.g. 200000 * sizeof (cl_uint) ) and pass this one too

3.) create the resultSize container (inited to zero) which can be access via atomic operation (at least I suppose this)

When one of my workers has a result it copies that into the the resultData buffer and increments the resultSize in an atomic inc operation (until the buffer is full).

Let me write a code example (opencl code):

lastPosition = atomic_add(resultBufferSize, 5);
while (lastPosition > RESULT_BUFFER_SIZE)
{
    lastPosition = atomic_add(resultBufferSize, 5);
}

And on the host side I read the buffer and set resultBufferSize to zero:

resultBufferSize = 0;
oclErr |= clEnqueueWriteBuffer(gpuAcces.getCqCommandQueue(), cm_resultBufferSize,  CL_TRUE, 0,  sizeof(cl_uint), (void*)&resultBufferSize, 0, NULL, NULL);

Now my problem is:

I have much more results than the resultData can store. And anyway I have no idea about the size of the result (e.g. how many paths I can find).

My idea:

time to time I would empty ( or process) the container on the host side and reset the resultSize when the buffer is full and the workers would wait in a while loop.

I liked this idea because I can process the data parallel on the host too.

But I was not able to implement any solution yet for this:

1.) NVIDIA cannot work with endless while or at least I cannot use it. When I try use endless loop the card crashed.

2.) barrier() anf mem_fence() can manage sync issue but not this one

Do you have any robust idea how I can handle not fix result sizes (e.g. during searching problems)? I almost pretty sure there must be a good patterns but I cannot find it.

Is there any sleep in NVIDIA opencl? Because I would put it into the endless loop maybe this can help a bit me.

I guess the variable result is an old issue and there must be good patterns. I had a similar issue in my earlier post (but the context was different).

4条回答
smile是对你的礼貌
2楼-- · 2019-03-04 07:07

I had a similiar problem regarding variable problem sizes. One way could be to simply implement a divide-and-conquer approach and to split up your data on the host. You could process your data blocks one after the other on the device.

BTW: you are sure about the comparison

while (lastPosition **>** RESULT_BUFFER_SIZE)

查看更多
▲ chillily
3楼-- · 2019-03-04 07:10

You should use OpenCL 2.0 and Pipes; they are perfect for this kind of problem.

查看更多
成全新的幸福
4楼-- · 2019-03-04 07:21

You have not clearly indicated that you are using Windows as OS but I assume it since you have the VS2013 tag in your question.

The Nvidia card does not crash. On Windows you have Timeout Detection & Recovery (TDR) in the WDDM driver which restarts GPU drivers if they become unresponsive. You can disable this "feature" with Nsight easily. However, be aware that this may cause problems with your desktop environment, so make sure to write a kernel that will end in a tolerable amount of time. Then you can run your very long kernels even on Windows with Nvidias OpenCL implementation.

查看更多
欢心
5楼-- · 2019-03-04 07:23

Why not use addr = atomic_add(&addr_counter, 1); on a global variable, and use the returned address to write to a another global buffer buffer[addr*2] = X; buffer[addr*2+1] = Y;.

You can easily check when you run out of space if the returned address is bigger than the buffer size.

EDIT: What you want is to have parallel kernel execution and data access, that is not possible with OpenCL 1.1. You should go for OpenCL 2.0 that has that feature (SVM or pipes).

Keeping the kernels in a while loop checking for a variable, and not having a mechanism to empty (access the variable) from host side. Will make your kernel to deadlock, and crash your graphics.

If you want to stick to OpenCL 1.1, the only way is to run many small sized kernels, and then check the results. You can parallel launch more kernels while you do the processing of that data in the CPU.

查看更多
登录 后发表回答