I was programming something in MATLAB and, as recommended, I am always trying to use vectorization. But in the end the program was quite slow. So I found out that in one place the code is significantly faster when using loops (example below).
I would like to know if I misinterpreted something or did something wrong, because performance is important in this case, and I don't want to keep guessing if vectorization or loops are going to be faster.
% data initialization
k = 8;
n = 2^k+1;
h = 1/(n-1);
cw = 0.1;
iter = 10000;
uloc = zeros(n);
fploc = uloc;
uloc(2:end-1,2:end-1) = 1;
vloc = uloc;
ploc = ones(n);
uloc2 = zeros(n);
fploc2 = uloc2;
uloc2(2:end-1,2:end-1) = 1;
vloc2 = uloc2;
ploc2 = ones(n);
%%%%%%%%%%%%%%%%%%%%%%
% vectorized version %
%%%%%%%%%%%%%%%%%%%%%%
tic
for it=1:iter
il=2:4;
jl=2:4;
fploc(il,jl) = h/6*(-uloc(il-1,jl-1) + uloc(il-1,jl)...
-2*uloc(il,jl-1)+2*uloc(il,jl+1)...
-uloc(il+1,jl) + uloc(il+1,jl+1)...
...
-vloc(il-1,jl-1) - 2*vloc(il-1,jl)...
+vloc(il,jl-1) - vloc(il,jl+1)...
+ 2*vloc(il+1,jl) + vloc(il+1,jl+1))...
...
+cw*h^2*(-ploc(il-1,jl)-ploc(il,jl-1)+4*ploc(il,jl)...
-ploc(il+1,jl)-ploc(il,jl+1));
end
toc
%%%%%%%%%%%%%%%%%%%%%%
% loop version %
%%%%%%%%%%%%%%%%%%%%%%
tic
for it=1:iter
for il=2:4
for jl=2:4
fploc2(il,jl) = h/6*(-uloc2(il-1,jl-1) + uloc2(il-1,jl)...
-2*uloc2(il,jl-1)+2*uloc2(il,jl+1)...
-uloc2(il+1,jl) + uloc2(il+1,jl+1)...
...
-vloc2(il-1,jl-1) - 2*vloc2(il-1,jl)...
+vloc2(il,jl-1) - vloc2(il,jl+1)...
+ 2*vloc2(il+1,jl) + vloc2(il+1,jl+1))...
...
+cw*h^2*(-ploc2(il-1,jl)-ploc2(il,jl-1)+4*ploc2(il,jl)...
-ploc2(il+1,jl)-ploc2(il,jl+1));
end
end
end
toc
I didn't go through your code, but the JIT compiler in recent versions of Matlab has improved to the point where the situation you're facing is quite common - loops can be faster than vectorized code. It is difficult to know in advance which will be faster, so the best approach is to write the code in the most natural fashion, profile it and then if there is a bottleneck, try switching from loops to vectorized (or the other way).
MATLAB's just in time compiler (JIT) has been improved significantly over the last couple years. And even though you are right that one should generally vectorize code, from my experience this is only true for certain operations and functions and also depends on how much data your functions are handling.
The best way for you to find out what works best, is to profile your MATLAB code with and without vectorization.
Maybe a matrix of a few elements is not a good test for vectorization efficiency. In the end it depends on the application on what works well.
Also, usually vectorized code looks better (more true to underlying model), but it many cases it does not and it ends up hurting the implementation. What you did is great as now you know what works best for you.
I would not call this vectorisation.
You appear to be doing some sort of filtering operation. A truly vectorised version of such a filter is the original data, multiplied by the filter matrix (that is, one matrix that represents the whole for-loop).
Problem with these matrices is that they are so sparse (only a few nonzero elements around the diagonal) that it's hardly ever efficient to use them. You can use the sparse
command but even then, the elegance of the notation probably does not justify the extra memory required.
Matlab used to be bad at for loops because even the loop counters etc were still treated as complex matrices, so all the checks for such matrices were evaluated at every iteration. My guess is that inside your for loop, all those checks are still performed every time you apply the filter coefficients.
Perhaps the matlab functions filter
and filter2
are useful here?
You may also ant to read this post: Improving MATLAB Matrix Construction Code : Or, code Vectorization for begginers
One possible explanation is startup overhead. If a temporary matrix is created behind the scene, be prepared for memory allocations. Also, I guess MATLAB cannot deduce that your matrix is small so there will be loop overhead. So your vectorized version may end up in code like
double* tmp=(double*)malloc(n*sizeof(double));
for(size_t k=0;k<N;++k)
{
// Do stuff with elements
}
free(tmp);
Compare this to a known number of operations:
double temp[2];
temp[0]=...;
temp[1]=...;
So JIT may be faster when the malloc-loopcounter-free time is long compared to the workload for each computation.