For loops seem to be extremely slow, so I was wondering if the nested loops in the code shown next could be vectorized using bsxfun and maybe GPU could be introduced too.

Code

%// Paramaters
i = 1;
j = 3;
n1 = 1500;
n2 = 1500;

%// Pre-allocate for output
LInc(n1+n2,n1+n2)=0;

%// Nested Loops - I 
for x = 1:n1
    for y = 1:n1
        num = ((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - (n2 * n * (L1(x,i) + L1(y,i)));
        LInc(x, y) = L1(x, y) + (num/denom);
        LInc(y, x) = LInc(x, y);
    end
end

%// Nested Loops - II
for x = 1:n1
    for y = 1:n2
        num = (n1 * n * L1(x,i)) + (n2 * n * L2(y,j)) - ((n1 * n2 * (L1(i, i) + L2(j, j) + 1)));
        LInc(x, n1+y) = num/denom;
        LInc(n1+y, x) = LInc(x, n1+y);
    end
end

Edit 1: n and denom could be assumed as constants too.

标签： performance matlab gpu vectorization bsxfun

2条回答

爱情/是我丢掉的垃圾

2楼-- · 2019-01-18 18:03

Here are vectorized CPU and GPU codes and I am hoping that I am using at least good practices for the GPU code and the benchmarking later on.

CPU Code

%// Pre-allocate for output
LInc(n1+n2,n1+n2)=0;

%// Calculate num/denom value for stage 1 and 2
nd1 = L1 + (((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - n2*n*bsxfun(@plus,L1(:,i),L1(:,i).'))./denom; %//'
nd2 = (bsxfun(@plus,n1*n*L1(:,i),n2*n*L2(:,j).') - ((n1 * n2 * (L1(i, i) + L2(j, j) + 1))))./denom; %//'

%// Plug in the values in the output matrix
LInc(1:n1,1:n1) = tril(nd1) + tril(nd1,-1).'; %//'
LInc(n1+1:end,1:n1) = nd2.';  %//'
LInc(1:n1,n1+1:end) = nd2;

GPU Code

%// Pre-allocate for output
gLInc = zeros(n1+n2,n1+n2,'gpuArray');

%// Convert to gpu arrays
gL1 = gpuArray(L1);
gL2 = gpuArray(L2);

%// Calculate num/denom value for stage 1 and 2
nd1 = gL1 + (((n2 ^ 2) * (gL1(i, i) + gL2(j, j) + 1)) - n2*n*bsxfun(@plus,gL1(:,i),gL1(:,i).'))./denom; %//'
nd2 = (bsxfun(@plus,n1*n*gL1(:,i),n2*n*gL2(:,j).') - ((n1 * n2 * (gL1(i, i) + gL2(j, j) + 1))))./denom; %//'

%// Plug in the values in the output matrix
gLInc(1:n1,1:n1) = tril(nd1) + tril(nd1,-1).'; %//'
gLInc(n1+1:end,1:n1) = nd2.';  %//'
gLInc(1:n1,n1+1:end) = nd2;

%// Gather data from GPU back to CPU
LInc = gather(gLInc);

Benchmarking

GPU benchmarking tips were taken from Measure and Improve GPU Performance.

%// Warm up GPU call with insignificant small scalar inputs, just in case
%// gputimeit doesn't do the same
temp1 = modp2(1,1,1,1,1,1,1,1); %// This is vectorized GPU code

i = 1;
j = 3;
n = 1000; %// Assumed
denom = 1e6;  %// Assumed

N_arr = [50 100 200 500 1000 1500]; %// array elements for N (datasize)
timeall = zeros(3,numel(N_arr));

for k1 = 1:numel(N_arr)
    N = N_arr(k1);
    n1 = N;  %// n1, n2 are assumed identical for less-complicated benchmarking
    n2 = N;

    L1 = rand(n1,n1);
    L2 = rand(n2,j);

    f = @() modp0(i,j,n1,n2,L1,L2,n,denom);%// Original CPU w/ preallocation
    timeall(1,k1) = timeit(f);
    clear f

    f = @() modp1(i,j,n1,n2,L1,L2,n,denom);%// Vectorzied CPU code
    timeall(2,k1) = timeit(f);
    clear f

    f = @() modp2(i,j,n1,n2,L1,L2,n,denom);%// Vectorized GPU(GTX 750Ti) code
    timeall(3,k1) = gputimeit(f);
    clear f
end

%// Display benchmark results
figure,hold on, grid on
plot(N_arr,timeall(1,:),'-b.')
plot(N_arr,timeall(2,:),'-ro')
plot(N_arr,timeall(3,:),'-kx')
legend('Original CPU','Vectorized CPU','Vectorized GPU (GTX 750 Ti)')
xlabel('Datasize (N) ->'),ylabel('Time(sec) ->')

Results

enter image description here

Conclusions

Results show that the vectorized GPU code performs really well with higher datasize and goes from slower than both the vectorized CPU and original code to being twice as fast as the vectorized CPU code.

0人赞添加讨论(0) 举报

神经病院院长

3楼-- · 2019-01-18 18:16

If you have not done so, you should preallocate LInc.

LInc = zeros(n1,n2);

If you want to vectorize it, you don't need to use bsxfun to vectorize your code. I think you can do something like

x = 1:n1;
y = 1:n1;
num = ((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - (n2 * n * (L1(x,i) + L1(y,i)));
                    LInc(x, y) = L1(x, y) + (num/denom);

However, this code is confusing to me because as it is, you are overwriting the value of LInc several times. Without knowing what your goal is its hard for me to help more. The above code probably will not return the same values as your function.

0人赞添加讨论(0) 举报

Vectorizing nested loops in matlab using bsxfun an

CPU Code

GPU Code

Benchmarking

Results

Conclusions

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间