set issubset performance difference depending on t

2020-04-08 13:24发布

why this question ?

I was trying to answer this question: Check if All Values Exist as Keys in Dictionary with something better than a generator comprehension fed to all (python loops, even in comprehensions, slow down execution compared to implicit loops that some functions perform):

all(i in bar for i in foo)

where bar is a dictionary and foo is a list by using set.issubset (with a conversion to set of foo to be able to use foo.issubset(bar)), and didn't succeed to beat the times of the all solution (unless both containers are converted to sets).

my question:

From the documentation of set:

Note, the non-operator versions of union(), intersection(), difference(), and symmetric_difference(), issubset(), and issuperset() methods will accept any iterable as an argument. In contrast, their operator based counterparts require their arguments to be sets. This precludes error-prone constructions like set('abc') & 'cbs' in favor of the more readable set('abc').intersection('cbs').

Okay but the performance really depends on the type of argument, even if the complexity does not (The complextiy of Python issubset()):

import timeit
foo = {i for i in range(1, 10000, 2)}
bar = foo - {400}
n=10000
x = timeit.timeit(setup="foo = {str(i) for i in range(1, 10000, 2)};bar = foo - {'400'}",stmt="bar.issubset(foo)",number=n)
print("issubset(set)",x)
x = timeit.timeit(setup="foo = {str(i) for i in range(1, 10000, 2)};bar = foo - {'400'};foo=list(foo)",stmt="bar.issubset(foo)",number=n)
print("issubset(list)",x)
x = timeit.timeit(setup="foo = {str(i):i for i in range(1, 10000, 2)};bar = set(foo) - {'400'}",stmt="bar.issubset(foo)",number=n)
print("issubset(dict)",x)
x = timeit.timeit(setup="foo = {str(i):i for i in range(1, 10000, 2)}.keys();bar = set(foo) - {'400'}",stmt="bar.issubset(foo)",number=n)
print("issubset(dict_keys)",x)

my results (Python 3.4):

issubset(set) 1.6141405847648826
issubset(list) 3.698748032058883
issubset(dict) 3.6300025109004244
issubset(dict_keys) 4.224299651223102

So if a set is passed as the argument, the result is very fast.

Using a list is much slower. I figured out that it was because of the hash that must be done on the strings is costly. So I changed my test inputs with integers like this:

foo = {i for i in range(1, 10000, 2)}
bar = foo - {400}

and the results were globally faster but still a huge time difference:

issubset(set) 0.5981848205989139
issubset(list) 1.7991591232742143
issubset(dict) 1.889119736960271
issubset(dict_keys) 2.2531574114632678

I also tried to change dict by dict.keys() as in python 3 the keys is said to be (https://www.python.org/dev/peps/pep-3106/) "a set-like or unordered container object".

But in that case, the result is even worse than with dict or list.

So why does passing a set beats passing a list or a dict or a dict_keys object? I don't see anything mentionned in the documentation about this.

1条回答
地球回转人心会变
2楼-- · 2020-04-08 14:08

The set.issubset algorithm requires a set to work with (frozensets and subclasses count); if you pass it something else, it will make a set. It's basically all(elem in other for elem in self), and it needs to know that elem in other is efficient and means what it means for sets. The only way it knows how to guarantee that is to ensure other is a set. Making a set is expensive.

(I've glossed over some details. If you want to know exactly what's going on, particularly if you have a weird set subclass, read the source code in the link.)

查看更多
登录 后发表回答