Thrust equivalent of Open MP code

2020-04-21 03:47发布

问题:

The code i'm trying to parallelize in open mp is a Monte Carlo that boils down to something like this:

int seed = 0;
std::mt19937 rng(seed); 
double result = 0.0;
int N = 1000;

#pragma omp parallel for
for(i=0; x < N; i++)
{
    result += rng()
}
std::cout << result << std::endl;

I want to make sure that the state of the random number generator is shared across threads, and the addition to the result is atomic.

Is there a way of replacing this code with something from thrust::omp. From the research that I did so far it looks like thrust::omp is more of a directive to use multiple CPU threads rather than GPU for some standard thrust operations.

回答1:

Yes, it's possible to use thrust to do something similar, with (parallel) execution on the host CPU using OMP threads underneath the thrust OMP backend. Here's one example:

$ cat t535.cpp
#include <random>
#include <iostream>
#include <thrust/system/omp/execution_policy.h>
#include <thrust/system/omp/vector.h>
#include <thrust/reduce.h>

int main(int argc, char *argv[]){
  unsigned N = 1;
  int seed = 0;
  if (argc > 1)  N = atoi(argv[1]);
  if (argc > 2)  seed = atoi(argv[2]);
  std::mt19937 rng(seed);
  unsigned long result = 0;

  thrust::omp::vector<unsigned long> vec(N);
  thrust::generate(thrust::omp::par, vec.begin(), vec.end(), rng);
  result = thrust::reduce(thrust::omp::par, vec.begin(), vec.end());
  std::cout << result << std::endl;
  return 0;
}
$ g++ -std=c++11 -O2 -I/usr/local/cuda/include -o t535 t535.cpp -fopenmp -lgomp
$ time ./t535 100000000
214746750809749347

real    0m0.700s
user    0m2.108s
sys     0m0.600s
$

For this test I used Fedora 20, with CUDA 6.5RC, running on a 4-core Xeon CPU (netting about a 3x speedup based on time results). There are probably some further "optimizations" that could be made for this particular code, but I think they will unnecessarily clutter the idea, and I assume that your actual application is more complicated than just summing random numbers.

Much of what I show here was lifted from the thrust direct system access page but there are several comparable methods to access the OMP backend, depending on whether you want to have a flexible, retargettable code, or you want one that specifically uses the OMP backend (this one specifically targets OMP backend).

The thrust::reduction operation guarantees the "atomicity" you are looking for. Specifically, it guarantees that two threads are not trying to update a single location at the same time. However the use of std::mt19937 in a multithreaded OMP app is outside the scope of my answer, I think. If I create an ordinary OMP app using the code you provided, I observe variability in the results due (I think) to some interaction between the use of the std::mt19937 rng in multiple OMP threads. This is not something thrust can sort out for you.

Thrust also has random number generators, which are designed to work with it.