Sequential operation in GPU implementation

I have to implement the following algorithm in GPU

for(int I = 0; I < 1000; I++){
    VAR1[I+1] = VAR1[I] + VAR2[2*K+(I-1)];//K is a constant
}

Each iteration is dependent on previous so the parallelizing is difficult. I am not sure if atomic operation is valid here. What can I do?

EDIT:

The VAR1 and VAR2 both are 1D array.

VAR1[0] = 1

标签： cuda parallel-processing gpu

1条回答

爷的心禁止访问

2楼-- · 2019-06-13 17:32

This is in a category of problems called recurrence relations. Depending on the structure of the recurrence relation, there may exist closed form solutions that describe how to compute each element individually (i.e. in parallel, without recursion). One of the early seminal papers (on parallel computation) was Kogge and Stone, and there exist recipes and strategies for parallelizing specific forms.

Sometimes recurrence relations are so simple that we can identify a closed-form formula or algorithm with a little bit of "inspection". This short tutorial gives a little bit more treatment of this idea.

In your case, let's see if we can spot anything just by mapping out what the first few terms of VAR1 should look like, substituting previous terms into newer terms:

i      VAR1[i]
___________________
0        1
1        1 + VAR2[2K-1]
2        1 + VAR2[2K-1] + VAR2[2K]
3        1 + VAR2[2K-1] + VAR2[2K] + VAR2[2K+1]
4        1 + VAR2[2K-1] + VAR2[2K] + VAR2[2K+1] + VAR2[2K+2]
...

Hopefully what jumps out at you is that the VAR2[] terms above follow a pattern of a prefix sum.

This means one possible solution method could be given by:

VAR1[i] = 1+prefix_sum(VAR2[2K + (i-2)])   (for i > 0) notes:(1) (2)
VAR1[i] = 1                                (for i = 0)

Now, a prefix sum can be done in parallel (this is not truly a fully independent operation, but it can be parallelized. I don't want to argue too much about terminology or purity here. I'm offering one possible method of parallelization for your stated problem, not the only way to do it.) To do a prefix sum in parallel on the GPU, I would use a library like CUB or Thrust. Or you can write your own although I wouldn't recommend it.

Notes:

the use of -1 or -2 as an offset to i for the prefix sum may be dictated by your use of an inclusive or exclusive scan or prefix sum operation.
VAR2 must be defined over an appropriate domain to make this sensible. However that requirement is implicit in your problem statement.

Here is a trivial worked example. In this case, since the VAR2 indexing term 2K+(I-1) just represents a fixed offset to I (2K-1), we are simply using an offset of 0 for demonstration purposes, so VAR2 is just a simple array over the same domain as VAR1. And I am defining VAR2 to just be an array of all 1, for demonstration purposes. The gpu parallel computation occurs in the VAR1 vector, the CPU equivalent computation is just computed on-the-fly in the cpu variable for validation purposes:

$ cat t1056.cu
#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <iostream>

const int dsize = 1000;
using namespace thrust::placeholders;
int main(){

  thrust::device_vector<int> VAR2(dsize, 1);  // initialize VAR2 array to all 1's
  thrust::device_vector<int> VAR1(dsize);
  thrust::exclusive_scan(VAR2.begin(), VAR2.end(), VAR1.begin(), 0); // put prefix sum of VAR2 into VAR1
  thrust::transform(VAR1.begin(), VAR1.end(), VAR1.begin(),  _1 += 1);   // add 1 to every term
  int cpu = 1;
  for (int i = 1; i < dsize; i++){
    int gpu = VAR1[i];
    cpu += VAR2[i];
    if (cpu != gpu) {std::cout << "mismatch at: " << i << " was: " << gpu << " should be: " << cpu << std::endl; return 1;}
    }
  std::cout << "Success!" << std::endl;
  return 0;
}

$ nvcc -o t1056 t1056.cu
$ ./t1056
Success!
$

For an additional reference particular to the usage of scan operations to solve linear recurrence problems, refer to Blelloch's paper here section 1.4.

0人赞添加讨论(0) 举报

Sequential operation in GPU implementation

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间