>>> import numpy as np
>>> from scipy import stats
>>> a = np.r_[1., 2., np.nan, 4., 5.]
>>> stats.nanmean(a)
2.9999999999999996
>>> np.nansum(a)/np.sum(~np.isnan(a))
3.0
I'm aware of the limitation of floating point representation. Just curious why the more clumsy expression seems to give "better" result.
First of all, here is
scipy.nanmean()
so that we know what we're comparing to:Mathematically, the two methods are equivalent. Numerically, they are different.
Your method involves a single division, and it so happens that:
1. + 2. + 4. + 5.
) can be represented exactly as afloat
; and4.
) is a power of two.This means that the result of the division is exact,
3.
.stats.nanmean()
involves first computing the mean of[1., 2., 0., 4., 5.]
, and then adjusting it to account forNaNs
. As it happens, this mean (2.4
) cannot be represented exactly as afloat
, so from this point on the computation is inexact.I haven't given it a lot of thought, but it may be possible to construct an example where the roles would be reversed, and
stats.nanmean()
would give a more accurate result than the other method.What surprises me is that
stats.nanmean()
doesn't simply do something like:This seems to me to be a superior approach to what it does currently.
The answer is in the code of
stats.nanmean
:I believe it has something to do with the
1.0 - np.sum
, a substraction of the sum.As @eumiro mentioned, stats.nanmean calculated the mean in a circumlocutions way different from the straightforward one liner way you did
From the same reference code,
np.sum(np.isnan(x),axis)
returnsnumpy.int32
which when multiplied by *1.0
, results a floating point approximation as opposed to what it would have got when the result would have been integer resulting in a difference in result