Mean of columns with same label

2019-07-19 06:52发布

问题:

I have two vectors

data vector: A = [1 2 2 1 2 6; 2 3 2 3 3 5]
label vector: B = [1 2 1 2 3 NaN]

I want to take the mean of all columns that have the same label and output these as a matrix sorted by label number, ignoring NaNs. So, in this example I would want:

labelmean(A,B) = [1.5 1.5 2; 2 3 3]

This can be done with a for-loop like this.

function out = labelmean(data,label)
out=[];
for i=unique(label)
    if isnan(i); continue; end
    out = [out, mean(data(:,label==i),2)];
end 

However, I'm dealing with huge arrays containing many datapoints and labels. Additionally, this code snippet will be executed often. I'm wondering if there is a more efficient way to do this without looping over every individual label.

回答1:

Here's one approach:

  1. Get the indices of labels not containing NaNs.
  2. Create a sparse matrix of zeros and ones that multiplied by A would give the desired row sums.
  3. Divide that matrix by the sum of each column, so that the sums become averages.
  4. Apply matrix multiplication to get the result, and convert to a full matrix.

Code:

I = find(~isnan(B));                                 % step 1
t = sparse(I, B(I), 1, size(A,2), max(B(I)));        % step 2
t = bsxfun(@rdivide, t, sum(t,1));                   % step 3
result = full(A*t);                                  % step 4


回答2:

This would be a good case of using accumarray. Think of accumarray as a miniature MapReduce paradigm. There are keys and values and so the job of accumarray is to group all of the values that share the same key together and you do something with those values. In your case, the keys would be the elements in B but what the values are going to be are the row locations that you need for the corresponding values in B. Basically, for each value in B, the position in B tells you which row you need to access in A. Therefore, we simply need to grab all of the row locations that map to the same ID, access the rows of A, then find the mean over all rows. We need to be careful in that we ignore values that are NaN. We can filter these out before calling accumarray. The "something" that you do in accumarray traditionally should output a single number, but we are in fact outputting a column vector for each label. Therefore, a trick is to wrap the output into a cell array, then use cat combined with comma-separated lists to convert the output into a matrix.

As such, something like this should work:

% Sample data
A = [1 2 2 1 2 6; 2 3 2 3 3 5];
B = [1 2 1 2 3 NaN];

% Find non-NaN locations
mask = ~isnan(B);

% Generate row locations that are not NaN as well as the labels
ind = 1 : numel(B);
Bf = B(mask).';
ind = ind(mask).';

% Find label-wise means
C = accumarray(Bf, ind, [], @(x) {mean(A(:,x), 2)});

% Convert to numeric matrix
out = cat(2, C{:});

If you don't like the use of a temporary variable for finding those non-NaN values, we can do this with less lines of code, but you will still need the vector of row indices to determine where we need to sample from:

% Sample data
A = [1 2 2 1 2 6; 2 3 2 3 3 5];
B = [1 2 1 2 3 NaN];

% Solution
ind = 1 : numel(B);
C = accumarray(B(~isnan(B)).', ind(~isnan(B)).', [], @(x) {mean(A(:,x), 2)});
out = cat(2, C{:});

With your data, we get:

>> out

out =

    1.5000    1.5000    2.0000
    2.0000    3.0000    3.0000


回答3:

This answer is not a new method, but a benchmark of the given answers, because if you talk about performance, you always have to benchmark it.

clear all;
% I tried to make a real-life dataset (the original author may provide a
% better one)
A = [1:3e4; 1:10:3e5; 1:100:3e6]; % large dataset
B = repmat(1:1e3, 1, 3e1); % large number of labels

labelmean(A,B);
labelmeanLuisMendoA(A,B);
labelmeanLuisMendoB(A,B);
labelmeanRayryeng(A,B);

function out = labelmean(data,label)
    tic
    out=[];
    for i=unique(label)
        if isnan(i); continue; end
        out = [out, mean(data(:,label==i),2)];
    end
    toc
end

function out = labelmeanLuisMendoA(A,B)
    tic
    B2 = B(~isnan(B)); % remove NaN's
    t = full(sparse(1:numel(B2),B2,1,size(A,2),max(B2))); % template matrix
    out = A*t; % sum of columns that share a label
    out = bsxfun(@rdivide, out, sum(t,1)); % convert sum into mean
    toc
end

function out = labelmeanLuisMendoB(A,B)
    tic
    B2 = B(~isnan(B));                                   % step 1
    t = sparse(1:numel(B2), B2, 1, size(A,2), max(B2));  % step 2
    t = bsxfun(@rdivide, t, sum(t,1));                   % step 3
    out = full(A*t);                                  % step 4
    toc
end

function out = labelmeanRayryeng(A,B)
    tic
    ind = 1 : numel(B);
    C = accumarray(B(~isnan(B)).', ind(~isnan(B)).', [], @(x) {mean(A(:,x), 2)});
    out = cat(2, C{:});
    toc
end

The output is:

Elapsed time is 0.080415 seconds. % original
Elapsed time is 0.088427 seconds. % LuisMendo original answer
Elapsed time is 0.004223 seconds. % LuisMendo optimised version
Elapsed time is 0.037347 seconds. % rayryeng answer

For this dataset LuisMendo optimised version is the clear winner, whereas his first version was slower than the original one.

=> Don't forget to benchmark your performance!

EDIT: Test platform specifications

  • Matlab R2016b
  • Ubuntu 64-bit
  • 15.6 GiB RAM
  • Intel® Core™ i7-5600U CPU @ 2.60GHz × 4