Extract data from MATLAB matrix without for-loop

2019-05-18 10:32发布

问题:

In MATLAB, let us say I have a 10 x 100 matrix, called M. What I would like to do, is extract particular indicies of this matrix, and do an operation on them immediately, based on the row index, in a vectorized fashion.

For example, for the first row, I want to compute sum(M(1, 1:1:100)). Then for row two, I want sum(M(2, 1:2:100)). For row three, I want sum(M(3, 1:3:100)), etc etc. For row ten, I have of course sum(M(10, 1:10:100)).

I have this in a for loop, but I want to see if there is a way to extract this data without a for loop. Thank you.

回答1:

You can try this for a one-liner

S=arrayfun(@(n) sum(M(n,1:n:100)), 1:10)

Alternatively, you can create a sparse matrix beforehand

A=sparse(100,10);
for n=1:10, 
   A(1:n:100, n)=1; 
end

and find the sum by

S=diag(M*A);

This can be optimized further for larger matrices by defining A=sparse(10,100) and

S=sum(M.*A,2);

my quick benchamrking

M=rand(10,100);
sz = size(M);
tic;
for k=1:10000,
    for n=1:sz(1),
        B(n)=sum(M(n,1:n:end));
    end
end
toc

tic;
for k=1:10000,
    B=arrayfun(@(n) sum(M(n,1:n:end)), 1:sz(1));
end
toc

tic;
for k=1:10000,
    A=sparse(sz(2), sz(1));
    for n=1:sz(1),
        A(1:n:end, n)=1;
    end
    B=diag(M*A);
end
toc

tic;
A=sparse(sz(2),sz(1));
for n=1:sz(1),
    A(1:n:end, n)=1;
end
for k=1:10000,
    B=diag(M*A);
end
toc

tic;
A=sparse(sz(1),sz(2));
for n=1:sz(1),
    A(n, 1:n:end)=1;
end
for k=1:10000,
    B=sum(M.*A,2);
end
toc

returns

Elapsed time is 0.552470 seconds.
Elapsed time is 2.409102 seconds.
Elapsed time is 0.638072 seconds.
Elapsed time is 0.052246 seconds.
Elapsed time is 0.061893 seconds.

for 30-by-1000 matrix

Elapsed time is 1.785664 seconds.
Elapsed time is 3.954034 seconds.
Elapsed time is 4.760436 seconds.
Elapsed time is 0.926118 seconds.
Elapsed time is 0.865330 seconds.

and for 1000-by-100 matrix

Elapsed time is 51.389322 seconds.
Elapsed time is 63.443414 seconds.
Elapsed time is 68.327187 seconds.
Elapsed time is 29.056304 seconds.
Elapsed time is 1.147215 seconds.


回答2:

Proposed solution

I was able to finally crack it for a truly vectorized solution that uses logical indexing to select the elements from the input matrix that are to be summed up. This magic was achieved with bsxfun with its optional functional handle @mod. The code is listed below -

[m,n] = size(M);
mask = bsxfun(@mod,1:n,(1:m)')==1; %//'# all of magic happens here as it creates 
                             %// a logical mask of 1's at places in input matrix
                             %// whose elements are to be summed and 0's elsewhere.
mask(1,:) = 1; %// set the first row as all ones as we need to sum all of those
sumvals = sum(mask.*M,2); %// finally get the sum values

Benchmarking

In this benchmarking section I am including the four approaches - The one listed earlier in this post and its GPU ported version, arrayfun and sparse based approaches listed in the other solution.

Three sets of input data were used for the benchmarking -

  • Set 1: Number of columns is kept at a multiple factor of 10 with respect to the number of rows in the input matrix as used in the question.
  • Set 2: The number of rows are extended so that number of rows is now 10x number of columns. This would really test out the loopy codes, as the number of iterations would be more in this case.
  • Set 3: This set is an extension of set2, to further increase the number of rows and thus would be another great test between truly vectorized approaches against others.

The function codes used for benchmarking are listed next -

function sumvals = sumrows_stepped_bsxfun(M)
//... same as the code posted earlier
return

function sumvals = sumrows_stepped_bsxfun_gpu(M)
gM = gpuArray(M);
[m,n] = size(gM);
mask = bsxfun(@mod,gpuArray.colon(1,n),gpuArray.colon(1,m)')==1; %//'
sumvals = gather(sum(mask.*gM,2));
sumvals(1) = sum(M(1,:));
return

function S = sumrows_stepped_arrayfun(M)
[m,n] = size(M);
S = arrayfun(@(x) sum(M(x,1:x:n)), 1:m);
return

function B = sumrows_stepped_sparse(M)
sz = size(M);
A=sparse(sz(1),sz(2));
for n=1:sz(1),
    A(n, 1:n:end)=1;
end
B=full(sum(M.*A,2));
return

Please note that timeit was used for timing CPU based codes and gputimeit for GPU based codes.

System Configuration used for testing -

MATLAB Version: 8.3.0.532 (R2014a)
Operating System: Windows 7
RAM: 3GB
CPU Model: Intel® Pentium® Processor E5400 (2M Cache, 2.70 GHz)
GPU Model: GTX 750Ti 2GB

The benchmarking results thus obtained -

Conclusions

  1. For datasizes with number of rows lesser than the number of columns, the number of iterations being a small number, loopy codes seem to have the upper-hand.

  2. As we increase the number of rows, the benefit of truly vectorized approaches become clear. You would also notice that bsxfun on CPU based approach does well for set 3, until around 12000 x 300 mark against non vectorized approaches and the reason behind that is, bsxfun creates this huge logical mask and at that point the memory bandwidth requirement becomes too high to cope up with the compute capability of bsxfun. This makes sense as vectorized operations by definition mean performing operations on many elements in one go, so memory bandwidth is essential. So, if you have a better machine with more RAM, that 12000 x 300 mark should extend further.

  3. If you could extend the number of rows further, the benefits of a vectorized solution would become more clear as long as the memory bandwidth is kept in check.

Benchmarking Code

Finally here's the benchmarking code if anyone would like to test it out on their systems -

clear all; clc; close all

outputfile = 'results.xlsx';
delete(outputfile); %// remove file, so that new results could be written into

base_datasize_array = 40:60:400;
methods = {'BSXFUN on GPU','BSXFUN on CPU','ARRAYFUN','SPARSE'};
num_approaches = numel(methods);
num_sets = 3;

timeall_all = zeros(num_approaches,numel(base_datasize_array),num_sets);
datasize_lbs = cell(numel(base_datasize_array),num_sets);
for set_id = 1:num_sets
    switch set_id
        case 1
            N1_arr = base_datasize_array*2;
            N2_arr = N1_arr*10;
        case 2
            N2_arr = base_datasize_array*2;
            N1_arr = N2_arr*10;
        case 3
            N2_arr = base_datasize_array;
            N1_arr = N2_arr*40;
    end

    timeall = zeros(num_approaches,numel(N1_arr));
    for iter = 1:numel(N1_arr)
        M = rand(N1_arr(iter),N2_arr(iter));

        f = @() sumrows_stepped_bsxfun_gpu(M);
        timeall(1,iter) = gputimeit(f); clear f

        f = @() sumrows_stepped_bsxfun(M);
        timeall(2,iter) = timeit(f); clear f

        f = @() sumrows_stepped_arrayfun(M);
        timeall(3,iter) = timeit(f); clear f

        f = @() sumrows_stepped_sparse(M);
        timeall(4,iter) = timeit(f); clear f

    end
    timeall_all(:,:,set_id) = timeall;

    wp = repmat({' '},numel(N1_arr),1);
    datasize_lbs(:,set_id) = strcat(cellstr(num2str(N1_arr.')),' x ',...
        wp,cellstr(num2str(N2_arr.')));
end

for set_id=1:num_sets
    out_cellarr = cell(numel(methods)+1,numel(N1_arr)+1);
    out_cellarr(1,1) = {'Methods'};
    out_cellarr(2:end,1) = methods;
    out_cellarr(1,2:end) = datasize_lbs(:,set_id);
    out_cellarr(2:end,2:end) = cellfun(@(x) num2str(x),...
        num2cell(timeall_all(:,:,set_id)),'Uni',0);
    xlswrite(outputfile, out_cellarr,set_id);
end


回答3:

Since there is an interesting performance effect in the sparse/matrixmult approach I will post some adidtional results:

M  = rand(1000,100);
sz = size(M);

% PLAIN LOOP
tic
out1 = zeros(sz(1),1);
for k = 1:10000
    for n = 1:sz(1)
        out1(n) = sum(M(n,1:n:100));
    end
end
toc

% SPARSE MATRIXMULT
tic
A = sparse(sz);
for n = 1:sz(1)
    A(1:n:sz(2),n) = 1;
end
for k = 1:10000
    out2 = diag(M*A);
end
toc

isequal(out1,out2) % ok  

Plain loop:        11.441380 seconds.
Sparse/matrixmult: 27.503829 seconds.

As the dimension of the matrix grows, the plain loop is more efficient.