Fastest way to find rows without NaNs in Matlab

2019-06-17 18:52发布

问题:

I would like to find the indexes of rows without any NaN in the fastest way possible since I need to do it thousands of times. So far I have tried the following two approaches:

find(~isnan(sum(data, 2)));
find(all(~isnan(data), 2));

Is there a clever way to speed this up or is this the best possible? The dimension of the data matrix is usually thousands by hundreds.

回答1:

Edit: matrix multiplication can be faster than sum, so the operation is almost twice faster for matrices above 500 x500 elements (in my Matlab 2012a machine). So my solution is:

find(~isnan(data*zeros(size(data,2),1)))

Out of the two methods you suggested (denoted f and g) in the question the first is faster (using timeit):

data=rand(4000);
nani=randi(numel(data),1,500);
data(nani)=NaN;
f= @() find(~isnan(sum(data, 2)));
g= @() find(all(~isnan(data), 2));
h= @() find(~isnan(data*zeros(size(data,2),1)));

timeit(f) 
ans =
     0.0263

timeit(g)
ans =
     0.1489

timeit(h)
ans =
     0.0146


回答2:

If the nan density is high enough, then a double loop will be the fastest method. This is because the search of a row can be discarded as soon as the first nan is found. For example, consider the following speed test:

%# Preallocate some parameters
T = 5000; %# Number of rows
N = 500; %# Number of columns
X = randi(5, T, N); %# Sample data matrix
M = 100; %# Number of simulation iterations
X(X == 1) = nan; %# Randomly set some elements of X to nan

%# Your first method
tic
for m = 1:M
    Soln1 = find(~isnan(sum(X, 2)));
end
toc

%# Your second method
tic
for m = 1:M
    Soln2 = find(all(~isnan(X), 2));
end
toc

%# A double loop
tic
for m = 1:M
    Soln3 = ones(T, 1);
    for t = 1:T
        for n = 1:N
            if isnan(X(t, n))
                Soln3(t) = 0;
                break
            end
        end
    end
    Soln3 = find(Soln3);
end
toc

The results are:

Elapsed time is 0.164880 seconds.
Elapsed time is 0.218950 seconds.
Elapsed time is 0.068168 seconds. %# The double loop method

Of course, the nan density is so high in this simulation that none of the rows are nan free. But you never said anything about the nan density of your matrix, so I figured I'd post this answer for general consumption and contemplation :-)



回答3:

Can you tell more about what you want to do with the indices

time = cputime;  
    A = rand(1000,100);              % Some matrix data
    for i = 1:100  
        A(randi(20,1,100)) = NaN;    % Randomly assigned NaN  
        B = isnan(A);                % B has 0 and 1  
        C = A(B == 0);               % C has all ~NaN elements
        ind(i,:) = find(B == 1);     % ind has all NaN indices
    end
    disp(cputime-time)

for 100 times in a loop, 0.1404 sec



回答4:

any() is faster than all() or sum(). try:

idx = find(~any(isnan(data), 2));

correction: it seems that sum() approach is faster:

idx = find(~isnan(sum(data, 2)));