I would like to find the indexes of rows without any NaN in the fastest way possible since I need to do it thousands of times. So far I have tried the following two approaches:
find(~isnan(sum(data, 2)));
find(all(~isnan(data), 2));
Is there a clever way to speed this up or is this the best possible? The dimension of the data matrix is usually thousands by hundreds.
Edit:
matrix multiplication can be faster than sum, so the operation is almost twice faster for matrices above 500 x500 elements (in my Matlab 2012a machine). So my solution is:
find(~isnan(data*zeros(size(data,2),1)))
Out of the two methods you suggested (denoted f
and g
) in the question the first is faster (using timeit
):
data=rand(4000);
nani=randi(numel(data),1,500);
data(nani)=NaN;
f= @() find(~isnan(sum(data, 2)));
g= @() find(all(~isnan(data), 2));
h= @() find(~isnan(data*zeros(size(data,2),1)));
timeit(f)
ans =
0.0263
timeit(g)
ans =
0.1489
timeit(h)
ans =
0.0146
If the nan
density is high enough, then a double loop will be the fastest method. This is because the search of a row can be discarded as soon as the first nan
is found. For example, consider the following speed test:
%# Preallocate some parameters
T = 5000; %# Number of rows
N = 500; %# Number of columns
X = randi(5, T, N); %# Sample data matrix
M = 100; %# Number of simulation iterations
X(X == 1) = nan; %# Randomly set some elements of X to nan
%# Your first method
tic
for m = 1:M
Soln1 = find(~isnan(sum(X, 2)));
end
toc
%# Your second method
tic
for m = 1:M
Soln2 = find(all(~isnan(X), 2));
end
toc
%# A double loop
tic
for m = 1:M
Soln3 = ones(T, 1);
for t = 1:T
for n = 1:N
if isnan(X(t, n))
Soln3(t) = 0;
break
end
end
end
Soln3 = find(Soln3);
end
toc
The results are:
Elapsed time is 0.164880 seconds.
Elapsed time is 0.218950 seconds.
Elapsed time is 0.068168 seconds. %# The double loop method
Of course, the nan
density is so high in this simulation that none of the rows are nan
free. But you never said anything about the nan
density of your matrix, so I figured I'd post this answer for general consumption and contemplation :-)
Can you tell more about what you want to do with the indices
time = cputime;
A = rand(1000,100); % Some matrix data
for i = 1:100
A(randi(20,1,100)) = NaN; % Randomly assigned NaN
B = isnan(A); % B has 0 and 1
C = A(B == 0); % C has all ~NaN elements
ind(i,:) = find(B == 1); % ind has all NaN indices
end
disp(cputime-time)
for 100 times in a loop, 0.1404 sec
any()
is faster than all()
or sum()
.
try:
idx = find(~any(isnan(data), 2));
correction: it seems that sum()
approach is faster:
idx = find(~isnan(sum(data, 2)));