Indeterministic sets in Python 2 and 3

2020-05-07 06:56发布

Python 2

Sets are collections of unordered values. If I construct a set via a set literal, e.g.

s = {'a', 'b', 'c'}

and then print it, I get the elements in some scrambled order. However, it seems that in Python 2.7, the above example always results in the same ordering:

print(s)  # set(['a', 'c', 'b']) in Python 2.7

How does Python 2.7 decide on this ordering? Even the hashes of 'a', 'b' and 'c' are not in the order produced.

Python 3

In Python 3.x (including 3.6 where dict keys are ordered) the resulting order seems to be random, though always the same within a given Python process. That is, repeatedly re-building the set literal always lead to the same ordering, as long as I do not restart the Python interpreter.

To check the ordering across multiple Python processes, consider the bash code

(for _ in {1..50}; do python3 -c "s = {'a', 'b', 'c'}; print(s)"; done) | sort -u

This will (most often) show the 6 different ways the 3 elements can be arranged. Switching out python3 with python(2), we only see the ordering ['a', 'c', 'b']. What determines the ordering in Python 3?

I see that the hash value of objects are deterministic in Python 2 while random (though constant within a Python process) in Python 3. I'm sure this is key to the full explanation.

Edit

As deceze writes in his comment, I would like to know if Python explicitly does something just to achieve this randomization, or if it happens "for free".

1条回答
做个烂人
2楼-- · 2020-05-07 07:31

The reason for the difference in Python 3 (from Python 3.3 onwards) is that hash randomization is enabled by default, you could turn this off by setting the PYTHONHASHSEED environmental variable to a fixed value:

$ export PYTHONHASHSEED=0
$ (for _ in {1..50}; do python3  -c "s = {'a', 'b', 'c'}; print(s)"; done) | sort -u
{'a', 'b', 'c'}

Equally you can turn hash randomization on in Python 2 with the -R flag:

$ (for _ in {1..50}; do python2 -R -c "s = {'a', 'b', 'c'}; print(s)"; done) | sort -u
set(['a', 'b', 'c'])
set(['a', 'c', 'b'])
set(['b', 'c', 'a'])
set(['c', 'b', 'a'])

Note, you don't generally want to turn it off since having hash randomization enabled helps protect against certain denial-of-service attacks.

查看更多
登录 后发表回答