In MATLAB, let us say I have a 10 x 100 matrix, called M
. What I would like to do, is extract particular indicies of this matrix, and do an operation on them immediately, based on the row index, in a vectorized fashion.
For example, for the first row, I want to compute sum(M(1, 1:1:100))
. Then for row two, I want sum(M(2, 1:2:100))
. For row three, I want sum(M(3, 1:3:100))
, etc etc. For row ten, I have of course sum(M(10, 1:10:100))
.
I have this in a for loop, but I want to see if there is a way to extract this data without a for loop. Thank you.
You can try this for a one-liner
S=arrayfun(@(n) sum(M(n,1:n:100)), 1:10)
Alternatively, you can create a sparse matrix beforehand
A=sparse(100,10);
for n=1:10,
A(1:n:100, n)=1;
end
and find the sum by
S=diag(M*A);
This can be optimized further for larger matrices by defining A=sparse(10,100)
and
S=sum(M.*A,2);
my quick benchamrking
M=rand(10,100);
sz = size(M);
tic;
for k=1:10000,
for n=1:sz(1),
B(n)=sum(M(n,1:n:end));
end
end
toc
tic;
for k=1:10000,
B=arrayfun(@(n) sum(M(n,1:n:end)), 1:sz(1));
end
toc
tic;
for k=1:10000,
A=sparse(sz(2), sz(1));
for n=1:sz(1),
A(1:n:end, n)=1;
end
B=diag(M*A);
end
toc
tic;
A=sparse(sz(2),sz(1));
for n=1:sz(1),
A(1:n:end, n)=1;
end
for k=1:10000,
B=diag(M*A);
end
toc
tic;
A=sparse(sz(1),sz(2));
for n=1:sz(1),
A(n, 1:n:end)=1;
end
for k=1:10000,
B=sum(M.*A,2);
end
toc
returns
Elapsed time is 0.552470 seconds.
Elapsed time is 2.409102 seconds.
Elapsed time is 0.638072 seconds.
Elapsed time is 0.052246 seconds.
Elapsed time is 0.061893 seconds.
for 30-by-1000 matrix
Elapsed time is 1.785664 seconds.
Elapsed time is 3.954034 seconds.
Elapsed time is 4.760436 seconds.
Elapsed time is 0.926118 seconds.
Elapsed time is 0.865330 seconds.
and for 1000-by-100 matrix
Elapsed time is 51.389322 seconds.
Elapsed time is 63.443414 seconds.
Elapsed time is 68.327187 seconds.
Elapsed time is 29.056304 seconds.
Elapsed time is 1.147215 seconds.
Proposed solution
I was able to finally crack it for a truly vectorized solution that uses logical indexing
to select the elements from the input matrix that are to be summed up. This magic was achieved with bsxfun
with its optional functional handle @mod
. The code is listed below -
[m,n] = size(M);
mask = bsxfun(@mod,1:n,(1:m)')==1; %//'# all of magic happens here as it creates
%// a logical mask of 1's at places in input matrix
%// whose elements are to be summed and 0's elsewhere.
mask(1,:) = 1; %// set the first row as all ones as we need to sum all of those
sumvals = sum(mask.*M,2); %// finally get the sum values
Benchmarking
In this benchmarking section I am including the four approaches - The one listed earlier in this post and its GPU ported version,
arrayfun
and sparse
based approaches listed in the other solution.
Three sets of input data were used for the benchmarking -
Set 1
: Number of columns is kept at a multiple factor of 10
with respect to the number of rows in the input matrix as used in the question.
Set 2
: The number of rows are extended so that number of rows is now 10x
number of columns. This would really test out the loopy codes, as the
number of iterations would be more in this case.
Set 3
: This set is an extension of set2
, to further increase the number of rows and thus would be another great test between truly vectorized approaches
against others.
The function codes used for benchmarking are listed next -
function sumvals = sumrows_stepped_bsxfun(M)
//... same as the code posted earlier
return
function sumvals = sumrows_stepped_bsxfun_gpu(M)
gM = gpuArray(M);
[m,n] = size(gM);
mask = bsxfun(@mod,gpuArray.colon(1,n),gpuArray.colon(1,m)')==1; %//'
sumvals = gather(sum(mask.*gM,2));
sumvals(1) = sum(M(1,:));
return
function S = sumrows_stepped_arrayfun(M)
[m,n] = size(M);
S = arrayfun(@(x) sum(M(x,1:x:n)), 1:m);
return
function B = sumrows_stepped_sparse(M)
sz = size(M);
A=sparse(sz(1),sz(2));
for n=1:sz(1),
A(n, 1:n:end)=1;
end
B=full(sum(M.*A,2));
return
Please note that timeit
was used for timing CPU based
codes and gputimeit
for GPU based
codes.
System Configuration used for testing -
MATLAB Version: 8.3.0.532 (R2014a)
Operating System: Windows 7
RAM: 3GB
CPU Model: Intel® Pentium® Processor E5400 (2M Cache, 2.70 GHz)
GPU Model: GTX 750Ti 2GB
The benchmarking results thus obtained -
Conclusions
For datasizes with number of rows lesser than the number of columns, the number of iterations being a small number, loopy codes seem to have the upper-hand.
As we increase the number of rows, the benefit of truly vectorized approaches become clear. You would also notice that bsxfun on CPU
based approach does well for set 3, until around 12000 x 300
mark against non vectorized approaches and the reason behind that is, bsxfun
creates this huge logical mask and at that point the memory bandwidth requirement becomes too high to cope up with the compute capability of bsxfun
. This makes sense as vectorized operations by definition mean performing operations on many elements in one go, so memory bandwidth is essential. So, if you have a better machine with more RAM, that 12000 x 300
mark should extend further.
If you could extend the number of rows further, the benefits of a vectorized solution would become more clear as long as the memory bandwidth is kept in check.
Benchmarking Code
Finally here's the benchmarking code if anyone would like to test it out on their systems -
clear all; clc; close all
outputfile = 'results.xlsx';
delete(outputfile); %// remove file, so that new results could be written into
base_datasize_array = 40:60:400;
methods = {'BSXFUN on GPU','BSXFUN on CPU','ARRAYFUN','SPARSE'};
num_approaches = numel(methods);
num_sets = 3;
timeall_all = zeros(num_approaches,numel(base_datasize_array),num_sets);
datasize_lbs = cell(numel(base_datasize_array),num_sets);
for set_id = 1:num_sets
switch set_id
case 1
N1_arr = base_datasize_array*2;
N2_arr = N1_arr*10;
case 2
N2_arr = base_datasize_array*2;
N1_arr = N2_arr*10;
case 3
N2_arr = base_datasize_array;
N1_arr = N2_arr*40;
end
timeall = zeros(num_approaches,numel(N1_arr));
for iter = 1:numel(N1_arr)
M = rand(N1_arr(iter),N2_arr(iter));
f = @() sumrows_stepped_bsxfun_gpu(M);
timeall(1,iter) = gputimeit(f); clear f
f = @() sumrows_stepped_bsxfun(M);
timeall(2,iter) = timeit(f); clear f
f = @() sumrows_stepped_arrayfun(M);
timeall(3,iter) = timeit(f); clear f
f = @() sumrows_stepped_sparse(M);
timeall(4,iter) = timeit(f); clear f
end
timeall_all(:,:,set_id) = timeall;
wp = repmat({' '},numel(N1_arr),1);
datasize_lbs(:,set_id) = strcat(cellstr(num2str(N1_arr.')),' x ',...
wp,cellstr(num2str(N2_arr.')));
end
for set_id=1:num_sets
out_cellarr = cell(numel(methods)+1,numel(N1_arr)+1);
out_cellarr(1,1) = {'Methods'};
out_cellarr(2:end,1) = methods;
out_cellarr(1,2:end) = datasize_lbs(:,set_id);
out_cellarr(2:end,2:end) = cellfun(@(x) num2str(x),...
num2cell(timeall_all(:,:,set_id)),'Uni',0);
xlswrite(outputfile, out_cellarr,set_id);
end
Since there is an interesting performance effect in the sparse/matrixmult approach I will post some adidtional results:
M = rand(1000,100);
sz = size(M);
% PLAIN LOOP
tic
out1 = zeros(sz(1),1);
for k = 1:10000
for n = 1:sz(1)
out1(n) = sum(M(n,1:n:100));
end
end
toc
% SPARSE MATRIXMULT
tic
A = sparse(sz);
for n = 1:sz(1)
A(1:n:sz(2),n) = 1;
end
for k = 1:10000
out2 = diag(M*A);
end
toc
isequal(out1,out2) % ok
Plain loop: 11.441380 seconds.
Sparse/matrixmult: 27.503829 seconds.
As the dimension of the matrix grows, the plain loop is more efficient.