CUDA: Stop all other threads

2019-04-02 02:24发布

问题:

I have a problem that is seemingly solvable by enumerating all possible solutions and then finding the best. In order to do so, I devised a backtracking algorithm that enumerates and stores the best solution if found. It works fine so far.

Now, I wanted to port this algorithm to CUDA. Therefore, I created a procedure that generates some distinct basic cases. These basic cases should be processed in parallel on the GPU. If one of the CUDA-threads finds an optimal solution, all the other threads can - of course - stop their work.

So, I wanted kind of the following: The thread that finds the optimal solution should stop all running CUDA-threads of my program, thus finishing calculation.

After some quick search, I found that threads can only communicate if they are in the same block. (So I suppose it's impossible to stop others blocks threads.)

The only method I could think of is that I have a dedicated flag optimum_found, which is checked at the beginning of every kernel. If an optimum solution is found, this flag is set to 1, so all future threads know that they do not have to work. But of course, threads already running do not notice this flag if they do not check it at every iteration.

So, is there a possibility to stop all remaining CUDA-threads?

回答1:

I think that your method of having a dedicated flag could work provided that it was a memory location in global memory. That way you can check this, as you said, at the beginning of each kernel call.

Kernel calls should generally be relatively short anyways, therefore letting the other threads in a batch finish even though an optimal solution was found by one of those threads shouldn't affect your performance too much.

That said, I am fairly sure there is no CUDA call that can kill off other actively executing threads.



回答2:

I think Ian has the right idea here. Optimum performance would come from minimal memory transfers and branching. Writing to global memory and checking flags (branching) goes against the CUDA best practices guide and will reduce your speedup.



回答3:

You might want to look at callbacks. The main CPU thread can make sure all threads run in the right order. CPU callback threads (read: postprocessing) can do additional overhead and call the related api functions as well as disposing all of the sub thread data... This feature is found in cuda samples and compiles on cuda capability 2. Hope this helps.