Large execution time of OpenCL Kernel causes crash

I'm currently building a ray marcher to look at things like the mandelbox, etc. It works great. However, with my current program, it uses each worker as a ray projected from the eye. This means that there is a large amount of execution per worker. So when looking at an incredibly complex object or trying to render with large enough precision it causes my display drivers to crash because the kernel was taking too long to execute on a single worker. I'm trying to avoid changing my registry values to make the timeout longer as I want this application to work on multiple computers.

Is there any way to remedy this? As it stands the executions of each work-item are completely independent of the work items nearby. I've contemplated subscribing a buffer to the GPU that would store the current progress on that ray and only execute a small amount of iterations. Then, I would just call the program over and over and the result would hopefully refine a bit more. The problem with this is that I am unsure how to deal with branching rays (eg. reflecting and refraction) unless I have a max number of each to anticipate.

Anyone have any pointers on what I should do to remedy this problem? I'm quite the greenhorn to OpenCL and have been having this issue for quite some time. I feel as though I'm doing something wrong or misusing OpenCL principally since my single workitems have a lot of logic behind them, but I don't know how to split the task as it is just a series of steps and checks and adjustments.

The crash you are experiencing is caused by the HW watchdog timer of nVIDIA. Also, the OS may as well detect the GPU as not responsive and reboot it (at least Windows7 does it).

You can avoid it by many ways:

Improve/optimize your kernel code to take less time
Buy faster Hardware ($$$$)
Disable the watchdog timer (but is not an easy task, and not all the devices have the feature)
Reduce the amount of work queued to the device each time, by launching multiple small kernels (NOTE: There is a small overhead of doing it this way, introduced by the launch of each small kernel)

The easier and straightforward solution is the last one. But if you can, try the first one as well.

As an example, a call like this (1000x1000 = 1M work items, Global size):

clEnqueueNDRangeKernel(queue, kernel, 2, NDRange(0,0)/*Offset*/, NDRange(1000,1000)/*Global*/, ... );

Can be split up in many small calls of ((100x100)x(10x10) = 1M ). Since the global size is now 100 times smaller the watchdog should not be triggered:

for(int i=0; i<10; i++)
    for(int j=0; j<10; j++)
        clEnqueueNDRangeKernel(queue, kernel, 2, NDRange(i*100,j*100)/*Offset*/, NDRange(100,100)/*Global*/, ... );