I am designing a C++ data structure (for graphs) which is to be used by parallel code (using OpenMP).
Suppose I want to have a method which enables iteration over all elements (nodes). Of course, this iteration is going to be parallelized.
Is it possible to use an iterator for this purpose? How should an iterator look like that enables parallel access? Would you advise for or against using iterators in this case?
OpenMP parallel loops don't play nicely with iterators. You'll want to implement an indexing mechanism (operator[]
taking an integral argument) on your graph class.
If you do want to use OpenMP 3.0 iterator support, make sure you have a random access iterator. Implementing it as an pointer to a node or edge is the simplest choice.
Let me expand on my comment. Unless you target cross-platform compatibility and you'd like your code to also compile and work with MS Visual C++, you can offset the complexity of providing "linear" iterators to graph objects by using explicit OpenMP tasks. Explicit tasking was introduced in OpenMP 3.0 (hence it is not supported by MSVC which only conforms to a much earlier specification, even in 2012). Tasks are blocks of code that can execute in parallel. They are created by the task
construct:
... other code ...
#pragma omp task
{
... task code ...
}
... other code ...
Whenever the execution flow reaches the marked region, a new task object is created and put into a task queue. Then at certain points in time, idle threads grab one task from the queue and start executing it. Tasks are much like OpenMP sections in that they inherit their environment and that they can run in different order than in the serial version of the code.
With tasks one can implement recursive algorithms and also can easily work with C++ containers that do not provide random iterators. For example traversal of a binary tree can be performed like this with tasks:
// Helper routine to traverse a tree and create one task for each node
void traverse_and_make_tasks(tree_node *tn)
{
if (tn == NULL)
return;
// Create a task to process the node
#pragma omp task
{
very_long_computation(tn->value);
}
traverse_and_make_tasks(tn->left);
traverse_and_make_tasks(tn->right);
}
... main code ...
// Disable dynamic teams
omp_set_dynamic(0);
#pragma omp parallel
{
#pragma omp single nowait
{
traverse_and_make_tasks(tree->root_node);
}
}
(disabling dynamic teams is needed in order to prevent the OpenMP run-time from being too smart and executing the parallel
region single-threadedly)
This is a very common tasking pattern known as the single/serial producer. Whenever the execution enters the parallel
region, a single thread executes the code in the single
construct. It calls traverse_and_make_tasks
with the root node of the three. traverse_and_make_tasks
walks the three and creates one task for each node. The task
construct only works when used inside a parallel
region (static scoping) or when used in code called (directly or indirectly) from inside a parallel
region (dynamic scoping). As traverse_and_make_tasks
walks the tree, it produces tasks that get queued. At the end of the parallel
region there is an implicit task scheduling point, which roughly means that the execution would not resume past the end of the region until all tasks have been executed. One can also put explicit points inside the parallel region using #pragma omp taskwait
. It works similarly to barrier
- execution "blocks" until all tasks have been processed.
Another common pattern is the parallel producer where tasks are generated in parallel. The example code above can be easily transformed into a parallel producer by a simple modification on traverse_and_make_tasks
:
void traverse_and_make_tasks(tree_node *tn)
{
if (tn == NULL)
return;
#pragma omp task
traverse_and_make_tasks(tn->left);
#pragma omp task
traverse_and_make_tasks(tn->right);
// Create a task to process the node
very_long_computation(tn->value);
}
This version of the code creates two task at each node - one to process the left descendant and one to process the right descendant. If this was serial code it would traverse the tree from bottom up. However in the parallel case task queueing would result in more or less top to bottom traversal.
There are many other possible scenarios of using tasks. One can also use them in a non-recursive case to process containers that do not provide random iterators, required by the worksharing construct for
:
typedef container_type::iterator citer;
container_type container;
... push some values in the container ...
#pragma omp parallel
{
#pragma omp single nowait
{
for (citer it = container.begin(); it != container.end(); it++)
#pragma omp task
process(*it);
}
#pragma omp taskwait
// process more
#pragma omp single nowait
{
for (citer it = container.begin(); it != container.end(); it++)
#pragma omp task
process_more(*it);
}
}
This example also illustrates the use of explicit task synchronisation inside the parallel
region.
This is a variation of the reader/writer problem.
It depends on whether that structure will be mutable or not. If not, then off you go, read in parallel as much as you want.
But if it will be mutable, then iterators will likely clash with changes made to the structure. For example, the iterator might arrive on an element that is currently being deleted. One solution is to make a read-only, immutable copy of the structure for each iterator, but then that iterator will not register any changes made to the structure after the iterator was created. Second solution is to make a copy-on-write implementation, which will cause all writes to the structure create a new object, with currently running iterators operating on the old one.
You need to decide what writes to that structure mean to your program, algorithm-wise, and then implement the read/write locking appropriately.
If they are trees, you will probably want to think more in terms of scans on Euler Tour Traversals than "iterators". http://en.wikipedia.org/wiki/Euler_tour_technique
I wish I had Stepanov's book in front of me. I remember him briefly touching on it.
I've had exactly the same problem in Java. A solution which I have implemented uses a "hashmap of hashmaps". I still don't understand WHY the standard libraries do not let us do multi-threaded iterators...
You can read my question, along with my answer (with link to Java code) here:
Scalable way to access every element of ConcurrentHashMap<Element, Boolean> exactly once