I try to use a ForkJoinPool to parallelize my CPU intensive calculations.
My understanding of a ForkJoinPool is, that it continues to work as long as any task is available to be executed. Unfortunately I frequently observed worker threads idling/waiting, thus not all CPU are kept busy. Sometimes I even observed additional worker threads.
I did not expect this, as I strictly tried to use non blocking tasks.
My observation is very similar to those of ForkJoinPool seems to waste a thread.
After debugging a lot into ForkJoinPool I have a guess:
I used invokeAll() to distribute work over a list of subtasks. After invokeAll() finished to execute the first task itself it starts joining the other ones. This works fine, until the next task to join is on top of the executing queue. Unfortunately I submitted additional tasks asynchronously without joining them. I expected the ForkJoin framework to continue executing those task first and than turn back to joining any remaining tasks.
But it seems not to work this way. Instead the worker thread gets stalled calling wait() until the task waiting for gets ready (presumably executed by an other worker thread). I did not verify this, but it seems to be a general flaw of calling join().
ForkJoinPool provides an asyncMode, but this is a global parameter and can not be used for individual submissions. But I like to see my asynchronously forked tasks to be executed soon.
So, why does ForkJoinTask.doJoin() not simply executes any available task on top of its queue until it gets ready (either executed by itself or stolen by others)?
You are dead right about join(). I wrote this article two years ago that points out the problem with join().
As I said there, the framework cannot execute newly submitted requests until it finishes the earlier ones. And each WorkThread cannot steal until it's current request finishes which results in the wait().
The additional threads you see are "continuation threads." Since join() eventually issues a wait(), these threads are needed so the entire framework doesn't stall.
Since nobody else seems to understand my question I try to explain what I found after some nights of debugging:
The current implementation of ForkJoinTasks works well if all fork/join calls are strictly paired. Illustrating a fork by an opening bracket and join by a closing one a perfect binary fork join pattern may look like this:
{([][]) ([][])} {([][]) ([][])}
If you use invokeAll() you may also submit list of subtasks like this:
{([][][][]) ([][][][]) ([][][][])}
What I did however looks like this pattern:
{([) ([)} ... ]]
You may argue this looks ill or is a misuse of the fork-join framework. But the only constraint is, that the tasks completion dependencies are acyclic, else you may run into a deadlock. As long as my [] tasks are not dependent on the () tasks, I don't see any problem with it. The offending ]]'s just express that I do not wait for them explicitly; they may finish some day, it does not matter to me (at that point).
Indeed the current implementation is able to execute my interlocked tasks, but only by spawning additional helper threads which is quite inefficient.
The flaw seems to be the current implementation of join(): joining an ) expects to see its corresponding ( on top of its execution queue, but it finds a [ and is perplexed. Instead of simply executing [] to get rid of it, the current thread suspends (calling wait()) until someone else comes around to execute the unexpected task. This causes a drastic performance break down.
My primary intend was to put additional work onto the queue to prevent the worker thread from suspending if the queue runs empty. Unfortunately the opposite happens :-(
You’re not using this framework for the very narrow purpose for which it was intended.
The framework started life as the experiment in the 2000 research paper. It’s been modified since then but the basic design, fork-and-join on large arrays, remains the same. The basic purpose is to teach undergraduates how to walk down the leaves of a balanced tree. When people use it for other than simple array-processing weird things happen. What it is doing in Java7 is beyond me; which is the purpose of the article.
The problems only get worse in Java8. There it’s the engine to drive all stream parallel work. Have a read in part two of that article. The lambda interest lists are filled with reports of thread stalls, stack overflow, and out of memory errors.
You use it at your own risk when you don’t use it for pure recursive decomposition of large data structures. Even then, the excessive threads it creates can cause havoc. I’m not going to pursue this discussion any further.