The code i'm trying to parallelize in open mp is a Monte Carlo that boils down to something like this:
int seed = 0;
std::mt19937 rng(seed);
double result = 0.0;
int N = 1000;
#pragma omp parallel for
for(i=0; x < N; i++)
{
result += rng()
}
std::cout << result << std::endl;
I want to make sure that the state of the random number generator is shared across threads, and the addition to the result is atomic.
Is there a way of replacing this code with something from thrust::omp. From the research that I did so far it looks like thrust::omp is more of a directive to use multiple CPU threads rather than GPU for some standard thrust operations.
Yes, it's possible to use thrust to do something similar, with (parallel) execution on the host CPU using OMP threads underneath the thrust OMP backend. Here's one example:
$ cat t535.cpp
#include <random>
#include <iostream>
#include <thrust/system/omp/execution_policy.h>
#include <thrust/system/omp/vector.h>
#include <thrust/reduce.h>
int main(int argc, char *argv[]){
unsigned N = 1;
int seed = 0;
if (argc > 1) N = atoi(argv[1]);
if (argc > 2) seed = atoi(argv[2]);
std::mt19937 rng(seed);
unsigned long result = 0;
thrust::omp::vector<unsigned long> vec(N);
thrust::generate(thrust::omp::par, vec.begin(), vec.end(), rng);
result = thrust::reduce(thrust::omp::par, vec.begin(), vec.end());
std::cout << result << std::endl;
return 0;
}
$ g++ -std=c++11 -O2 -I/usr/local/cuda/include -o t535 t535.cpp -fopenmp -lgomp
$ time ./t535 100000000
214746750809749347
real 0m0.700s
user 0m2.108s
sys 0m0.600s
$
For this test I used Fedora 20, with CUDA 6.5RC, running on a 4-core Xeon CPU (netting about a 3x speedup based on time
results). There are probably some further "optimizations" that could be made for this particular code, but I think they will unnecessarily clutter the idea, and I assume that your actual application is more complicated than just summing random numbers.
Much of what I show here was lifted from the thrust direct system access page but there are several comparable methods to access the OMP backend, depending on whether you want to have a flexible, retargettable code, or you want one that specifically uses the OMP backend (this one specifically targets OMP backend).
The thrust::reduction operation guarantees the "atomicity" you are looking for. Specifically, it guarantees that two threads are not trying to update a single location at the same time. However the use of std::mt19937
in a multithreaded OMP app is outside the scope of my answer, I think. If I create an ordinary OMP app using the code you provided, I observe variability in the results due (I think) to some interaction between the use of the std::mt19937
rng in multiple OMP threads. This is not something thrust can sort out for you.
Thrust also has random number generators, which are designed to work with it.