For loops seem to be extremely slow, so I was wondering if the nested loops in the code shown next could be vectorized using bsxfun
and maybe GPU could be introduced too.
Code
%// Paramaters
i = 1;
j = 3;
n1 = 1500;
n2 = 1500;
%// Pre-allocate for output
LInc(n1+n2,n1+n2)=0;
%// Nested Loops - I
for x = 1:n1
for y = 1:n1
num = ((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - (n2 * n * (L1(x,i) + L1(y,i)));
LInc(x, y) = L1(x, y) + (num/denom);
LInc(y, x) = LInc(x, y);
end
end
%// Nested Loops - II
for x = 1:n1
for y = 1:n2
num = (n1 * n * L1(x,i)) + (n2 * n * L2(y,j)) - ((n1 * n2 * (L1(i, i) + L2(j, j) + 1)));
LInc(x, n1+y) = num/denom;
LInc(n1+y, x) = LInc(x, n1+y);
end
end
Edit 1: n
and denom
could be assumed as constants too.
Here are vectorized
CPU
andGPU
codes and I am hoping that I am using at least good practices for theGPU
code and the benchmarking later on.CPU Code
GPU Code
Benchmarking
GPU benchmarking tips were taken from Measure and Improve GPU Performance.
Results
Conclusions
Results show that the vectorized GPU code performs really well with higher datasize and goes from slower than both the vectorized CPU and original code to being twice as fast as the vectorized CPU code.
If you have not done so, you should preallocate LInc.
If you want to vectorize it, you don't need to use bsxfun to vectorize your code. I think you can do something like
However, this code is confusing to me because as it is, you are overwriting the value of LInc several times. Without knowing what your goal is its hard for me to help more. The above code probably will not return the same values as your function.