Is there an idiomatic way to compare two NumPy arrays that would treat NaNs as being equal to each other (but not equal to anything other than a NaN).
For example, I want the following two arrays to compare equal:
np.array([1.0, np.NAN, 2.0])
np.array([1.0, np.NAN, 2.0])
and the following two arrays to compare unequal:
np.array([1.0, np.NAN, 2.0])
np.array([1.0, 0.0, 2.0])
I am looking for a method that would produce a scalar Boolean outcome.
The following would do it:
np.all((a == b) | (np.isnan(a) & np.isnan(b)))
but it's clunky and creates all those intermediate arrays.
Is there a way that's easier on the eye and makes better use of memory?
P.S. If it helps, the arrays are known to have the same shape and dtype.
If you really care about memory use (e.g. have very large arrays), then you should use numexpr and the following expression will work for you:
np.all(numexpr.evaluate('(a==b)|((a!=a)&(b!=b))'))
I've tested it on very big arrays with length of 3e8, and the code has the same performance on my machine as
np.all(a==b)
and uses the same amount of memory
Disclaimer: I don't recommend this for regular use, and I wouldn't use it myself, but I could imagine rare circumstances under which it might be useful.
If the arrays have the same shape and dtype, you could consider using the low-level memoryview
:
>>> import numpy as np
>>>
>>> a0 = np.array([1.0, np.NAN, 2.0])
>>> ac = a0 * (1+0j)
>>> b0 = np.array([1.0, np.NAN, 2.0])
>>> b1 = np.array([1.0, np.NAN, 2.0, np.NAN])
>>> c0 = np.array([1.0, 0.0, 2.0])
>>>
>>> memoryview(a0)
<memory at 0x85ba1bc>
>>> memoryview(a0) == memoryview(a0)
True
>>> memoryview(a0) == memoryview(ac) # equal but different dtype
False
>>> memoryview(a0) == memoryview(b0) # hooray!
True
>>> memoryview(a0) == memoryview(b1)
False
>>> memoryview(a0) == memoryview(c0)
False
But beware of subtle problems like this:
>>> zp = np.array([0.0])
>>> zm = -1*zp
>>> zp
array([ 0.])
>>> zm
array([-0.])
>>> zp == zm
array([ True], dtype=bool)
>>> memoryview(zp) == memoryview(zm)
False
which happens because the binary representations differ even though they compare equal (they have to, of course: that's how it knows to print the negative sign)
>>> memoryview(zp)[0]
'\x00\x00\x00\x00\x00\x00\x00\x00'
>>> memoryview(zm)[0]
'\x00\x00\x00\x00\x00\x00\x00\x80'
On the bright side, it short-circuits the way you might hope it would:
In [47]: a0 = np.arange(10**7)*1.0
In [48]: a0[-1] = np.NAN
In [49]: b0 = np.arange(10**7)*1.0
In [50]: b0[-1] = np.NAN
In [51]: timeit memoryview(a0) == memoryview(b0)
10 loops, best of 3: 31.7 ms per loop
In [52]: c0 = np.arange(10**7)*1.0
In [53]: c0[0] = np.NAN
In [54]: d0 = np.arange(10**7)*1.0
In [55]: d0[0] = 0.0
In [56]: timeit memoryview(c0) == memoryview(d0)
100000 loops, best of 3: 2.51 us per loop
and for comparison:
In [57]: timeit np.all((a0 == b0) | (np.isnan(a0) & np.isnan(b0)))
1 loops, best of 3: 296 ms per loop
In [58]: timeit np.all((c0 == d0) | (np.isnan(c0) & np.isnan(d0)))
1 loops, best of 3: 284 ms per loop
Numpy 1.10 added the equal_nan
keyword to np.allclose
(https://docs.scipy.org/doc/numpy/reference/generated/numpy.allclose.html).
So you can do now:
In [24]: np.allclose(np.array([1.0, np.NAN, 2.0]),
np.array([1.0, np.NAN, 2.0]), equal_nan=True)
Out[24]: True
Not sure this is any better, but a thought...
import numpy
class FloatOrNaN(numpy.float_):
def __eq__(self, other):
return (numpy.isnan(self) and numpy.isnan(other)) or super(FloatOrNaN,self).__eq__(other)
a = [1., np.nan, 2.]
one = numpy.array([FloatOrNaN(val) for val in a], dtype=object)
two = numpy.array([FloatOrNaN(val) for val in a], dtype=object)
print one == two # yields array([ True, True, True], dtype=bool)
This pushes the ugliness into the dtype, at the expense of making the inner workings python instead of c (Cython/etc would fix this). It does, however, greatly reduce memory costs.
Still kinda ugly though :(