-->

NaNs as key in dictionaries

2020-01-27 07:29发布

问题:

Can anyone explain the following behaviour to me?

>>> import numpy as np
>>> {np.nan: 5}[np.nan]
5
>>> {float64(np.nan): 5}[float64(np.nan)]
KeyError: nan

Why does it work in the first case, but not in the second? Additionally, I found that the following DOES work:

>>> a ={a: 5}[a]
float64(np.nan)

回答1:

The problem here is that NaN is not equal to itself, as defined in the IEEE standard for floating point numbers:

>>> float("nan") == float("nan")
False

When a dictionary looks up a key, it roughly does this:

  1. Compute the hash of the key to be looked up.

  2. For each key in the dict with the same hash, check if it matches the key to be looked up. This check consists of

    a. Checking for object identity: If the key in the dictionary and the key to be looked up are the same object as indicated by the is operator, the key was found.

    b. If the first check failed, check for equality using the __eq__ operator.

The first example succeeds, since np.nan and np.nan are the same object, so it does not matter they don't compare equal:

>>> numpy.nan is numpy.nan
True

In the second case, np.float64(np.nan) and np.float64(np.nan) are not the same object -- the two constructor calls create two distinct objects:

>>> numpy.float64(numpy.nan) is numpy.float64(numpy.nan)
False

Since the objects also do not compare equal, the dictionary concludes the key is not found and throws a KeyError.

You can even do this:

>>> a = float("nan")
>>> b = float("nan")
>>> {a: 1, b: 2}
{nan: 1, nan: 2}

In conclusion, it seems a saner idea to avoid NaN as a dictionary key.



回答2:

Please note this is not the case anymore in Python 3.6:

>>> d = float("nan") #object nan
>>> d
nan
>>> c = {"a": 3, d: 4}
>>> c["a"]
3
>>> c[d]
4

As I understand it:

c is a dictionary that contains 3 associated to "a" and 4 associated to nan. The way the Python 3.6 internally looks up the dictionary has changed, now it compares two pointers, and if they point to the same object they consider that the equality is preserved. Otherwise they compare the hash, if the hash is different then it is not the same object. After all that, if still necessary the keys are "manually" compared.

That means that although IEEE754 specifies that NAN isn't equal to itself:

>>> d == d
False

When looking up a dictionary first the pointers are taken into account, and because they point to the same nan object it returns 4.

Note also that:

>>> e = float("nan")
>>> e == d
False
>>> c[e]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: nan
>>> c[d]
4

So not every nan points to 4, so some kind of IEEE754 is preserved. This was implemented because respecting the standard that nan is never equal to itself reduces the efficiency way more than ignoring the standard. Precisely because you're storing something in a dictionary that you can't access any more in previous versions.



标签: python numpy nan