matlab if statements with CUDA

2020-04-10 21:48发布

I have the following matlab code:

randarray = gpuArray(rand(N,1));
N = 1000;

tic
g=0;
for i=1:N

    if randarray(i)>10
        g=g+1;
    end

end
toc

secondrandarray = rand(N,1);
 g=0;

 tic 
for i=1:N

    if secondrandarray(i)>10
        g=g+1;
    end

end
toc



Elapsed time is 0.221710 seconds.
Elapsed time is 0.000012 seconds.

1) Why is the if clause so slow on the GPU? It is slowing down all my attempts at optimisation

2) What can I do to get around this limitation?

Thanks

标签: matlab cuda
4条回答
做个烂人
2楼-- · 2020-04-10 22:03

Using MATLAB R2011b and Parallel Computing Toolbox on a now rather old GPU (Tesla C1060), here's what I see:

>> g = 100*parallel.gpu.GPUArray.rand(1, 1000);
>> tic, sum(g>10); toc
Elapsed time is 0.000474 seconds.

Operating on scalar elements of a gpuArray one at a time is always going to be slow, so using the sum method is much quicker.

查看更多
爷、活的狠高调
3楼-- · 2020-04-10 22:05

No expert on the Matlab gpuArray implementation, but I would suspect that each randarray(i) access in the first loop triggers a PCI-e transaction to retrieve a value from GPU memory, which will incur a very large latency penalty. You might be better served by calling gather to transfer the whole array in a single transaction instead and then loop over a local copy in host memory.

查看更多
Root(大扎)
4楼-- · 2020-04-10 22:10

This is typically a bad thing to do no matter if you are doing it on the cpu or the gpu.

The following would be a good way to do the operation you are looking at.

N = 1000;
randarray = gpuArray(100 * rand(N,1));
tic
g = nnz(randarray > 10);
toc

I do not have PCT and I can not verify if this actually works (number of functions supported on GPU are fairly limited).

However if you had Jacket, you would definitely be able to do the following.

N = 1000;
randarray = gdouble(100 * rand(N, 1));
tic
g = nnz(randarray > 10);
toc

Full disclosure: I am one of the engineers developing Jacket.

查看更多
▲ chillily
5楼-- · 2020-04-10 22:18

I cannot comment on a prior solution because I'm too new, but extending on the solution from Pavan. The nnz function is (not yet) implemented for gpuArrays, at least on the Matlab version I'm using (R2012a).

In general, it is much better to vectorize Matlab code. However, in some cases looped code can run fast in Matlab bercause of the JIT compilation.

Check the results from

N = 1000;
randarray_cpu = rand(N,1);
randarray_gpu = gpuArray(randarray_cpu);
threshold     = 0.5;

% CPU: looped
g=0;
tic
for i=1:N
    if randarray_cpu(i)>threshold
        g=g+1;
    end
end
toc

% CPU: vectorized
tic
g = nnz(randarray_cpu>threshold);
toc

% GPU: looped
tic
g=0;
for i=1:N
    if randarray_gpu(i)>threshold
        g=g+1;
    end
end
toc

% GPU: vectorized
tic
g_d = sum(randarray_gpu > threshold);
g = gather(g_d); % I'm assuming that you want this in the CPU at some point
toc

Which is (on my core i7+ GeForce 560Ti):

Elapsed time is 0.000014 seconds.
Elapsed time is 0.000580 seconds.
Elapsed time is 0.310218 seconds.
Elapsed time is 0.000558 seconds.

So what we see from this case is:

Loops in Matlab are not considered good praxis, but in your particular case, it does run fast because Matlab somehow "precompiles" it internally. I changed your threshold from 10 to 0.5, as rand will never give you a value higher than 1.

The looped GPU version performs horribly because at each loop iteration, a kernel is launched (or data is read from the GPU, however TMW implemented that...), which is slow. A lot of small memory transfers while calculating basically nothing are the worst thing one could do on the GPU.

From the last (best) GPU result the answer would be: unless the data is already on the GPU, it doesn't make sense to calculate this on the GPU. Since the arithmetic complexity of your operation is basically nonexistent, the memory transfer overhead does not pay off in any way. If this is part of a bigger GPU calculation, it's OK. If not... better stick to the CPU ;)

查看更多
登录 后发表回答