Checking for NaN presence in a container

2019-01-23 16:12发布

站内文章 / Python

42 0

迷人小祖宗

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

NaN is handled perfectly when I check for its presence in a list or a set. But I don't understand how. [UPDATE: no it's not; it is reported as present if the identical instance of NaN is found; if only non-identical instances of NaN are found, it is reported as absent.]

I thought presence in a list is tested by equality, so I expected NaN to not be found since NaN != NaN.
hash(NaN) and hash(0) are both 0. How do dictionaries and sets tell NaN and 0 apart?
Is it safe to check for NaN presence in an arbitrary container using in operator? Or is it implementation dependent?

My question is about Python 3.2.1; but if there are any changes existing/planned in future versions, I'd like to know that too.

NaN = float('nan')
print(NaN != NaN) # True
print(NaN == NaN) # False

list_ = (1, 2, NaN)
print(NaN in list_) # True; works fine but how?

set_ = {1, 2, NaN}
print(NaN in set_) # True; hash(NaN) is some fixed integer, so no surprise here
print(hash(0)) # 0
print(hash(NaN)) # 0
set_ = {1, 2, 0}
print(NaN in set_) # False; works fine, but how?

Note that if I add an instance of a user-defined class to a list, and then check for containment, the instance's __eq__ method is called (if defined) - at least in CPython. That's why I assumed that list containment is tested using operator ==.

EDIT:

Per Roman's answer, it would seem that __contains__ for list, tuple, set, dict behaves in a very strange way:

def __contains__(self, x):
  for element in self:
    if x is element:
      return True
    if x == element:
      return True
  return False

I say 'strange' because I didn't see it explained in the documentation (maybe I missed it), and I think this is something that shouldn't be left as an implementation choice.

Of course, one NaN object may not be identical (in the sense of id) to another NaN object. (This not really surprising; Python doesn't guarantee such identity. In fact, I never saw CPython share an instance of NaN created in different places, even though it shares an instance of a small number or a short string.) This means that testing for NaN presence in a built-in container is undefined.

This is very dangerous, and very subtle. Someone might run the very code I showed above, and incorrectly conclude that it's safe to test for NaN membership using in.

I don't think there is a perfect workaround to this issue. One, very safe approach, is to ensure that NaN's are never added to built-in containers. (It's a pain to check for that all over the code...)

Another alternative is watch out for cases where in might have NaN on the left side, and in such cases, test for NaN membership separately, using math.isnan(). In addition, other operations (e.g., set intersection) need to also be avoided or rewritten.

回答1:

Question #1: why is NaN found in a container when it's an identical object.

From the documentation:

For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression x in y is equivalent to any(x is e or x == e for e in y).

This is precisely what I observe with NaN, so everything is fine. Why this rule? I suspect it's because a dict/set wants to honestly report that it contains a certain object if that object is actually in it (even if __eq__() for whatever reason chooses to report that the object is not equal to itself).

Question #2: why is the hash value for NaN the same as for 0?

From the documentation:

Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. hash() should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.

Note that the requirement is only in one direction; objects that have the same hash do not have to be equal! At first I thought it's a typo, but then I realized that it's not. Hash collisions happen anyway, even with default __hash__() (see an excellent explanation here). The containers handle collisions without any problem. They do, of course, ultimately use the == operator to compare elements, hence they can easily end up with multiple values of NaN, as long as they are not identical! Try this:

>>> nan1 = float('nan')
>>> nan2 = float('nan')
>>> d = {}
>>> d[nan1] = 1
>>> d[nan2] = 2
>>> d[nan1]
1
>>> d[nan2]
2

So everything works as documented. But... it's very very dangerous! How many people knew that multiple values of NaN could live alongside each other in a dict? How many people would find this easy to debug?..

I would recommend to make NaN an instance of a subclass of float that doesn't support hashing and hence cannot be accidentally added to a set/dict. I'll submit this to python-ideas.

Finally, I found a mistake in the documentation here:

For user-defined classes which do not define __contains__() but do define __iter__(), x in y is true if some value z with x == z is produced while iterating over y. If an exception is raised during the iteration, it is as if in raised that exception.

Lastly, the old-style iteration protocol is tried: if a class defines __getitem__(), x in y is true if and only if there is a non-negative integer index i such that x == y[i], and all lower integer indices do not raise IndexError exception. (If any other exception is raised, it is as if in raised that exception).

You may notice that there is no mention of is here, unlike with built-in containers. I was surprised by this, so I tried:

>>> nan1 = float('nan')
>>> nan2 = float('nan')
>>> class Cont:
...   def __iter__(self):
...     yield nan1
...
>>> c = Cont()
>>> nan1 in c
True
>>> nan2 in c
False

As you can see, the identity is checked first, before == - consistent with the built-in containers. I'll submit a report to fix the docs.

回答2:

I can't repro you tuple/set cases using float('nan') instead of NaN.

So i assume that it worked only because id(NaN) == id(NaN), i.e. there is no interning for NaN objects:

>>> NaN = float('NaN')
>>> id(NaN)
34373956456
>>> id(float('NaN'))
34373956480

And

>>> NaN is NaN
True
>>> NaN is float('NaN')
False

I believe tuple/set lookups has some optimization related to comparison of the same objects.

Answering your question - it seam to be unsafe to relay on in operator while checking for presence of NaN. I'd recommend to use None, if possible.

Just a comment. __eq__ has nothing to do with is statement, and during lookups comparison of objects' ids seem to happen prior to any value comparisons:

>>> class A(object):
...     def __eq__(*args):
...             print '__eq__'
...
>>> A() == A()
__eq__          # as expected
>>> A() is A()
False           # `is` checks only ids
>>> A() in [A()]
__eq__          # as expected
False
>>> a = A()
>>> a in [a]
True            # surprise!