why this question ?
I was trying to answer this question: Check if All Values Exist as Keys in Dictionary with something better than a generator comprehension fed to all
(python loops, even in comprehensions, slow down execution compared to implicit loops that some functions perform):
all(i in bar for i in foo)
where bar
is a dictionary and foo
is a list by using set.issubset
(with a conversion to set
of foo
to be able to use foo.issubset(bar)
), and didn't succeed to beat the times of the all
solution (unless both containers are converted to set
s).
my question:
From the documentation of set
:
Note, the non-operator versions of union(), intersection(), difference(), and symmetric_difference(), issubset(), and issuperset() methods will accept any iterable as an argument. In contrast, their operator based counterparts require their arguments to be sets. This precludes error-prone constructions like set('abc') & 'cbs' in favor of the more readable set('abc').intersection('cbs').
Okay but the performance really depends on the type of argument, even if the complexity does not (The complextiy of Python issubset()):
import timeit
foo = {i for i in range(1, 10000, 2)}
bar = foo - {400}
n=10000
x = timeit.timeit(setup="foo = {str(i) for i in range(1, 10000, 2)};bar = foo - {'400'}",stmt="bar.issubset(foo)",number=n)
print("issubset(set)",x)
x = timeit.timeit(setup="foo = {str(i) for i in range(1, 10000, 2)};bar = foo - {'400'};foo=list(foo)",stmt="bar.issubset(foo)",number=n)
print("issubset(list)",x)
x = timeit.timeit(setup="foo = {str(i):i for i in range(1, 10000, 2)};bar = set(foo) - {'400'}",stmt="bar.issubset(foo)",number=n)
print("issubset(dict)",x)
x = timeit.timeit(setup="foo = {str(i):i for i in range(1, 10000, 2)}.keys();bar = set(foo) - {'400'}",stmt="bar.issubset(foo)",number=n)
print("issubset(dict_keys)",x)
my results (Python 3.4):
issubset(set) 1.6141405847648826
issubset(list) 3.698748032058883
issubset(dict) 3.6300025109004244
issubset(dict_keys) 4.224299651223102
So if a set
is passed as the argument, the result is very fast.
Using a list
is much slower. I figured out that it was because of the hash that must be done on the strings is costly. So I changed my test inputs with integers like this:
foo = {i for i in range(1, 10000, 2)}
bar = foo - {400}
and the results were globally faster but still a huge time difference:
issubset(set) 0.5981848205989139
issubset(list) 1.7991591232742143
issubset(dict) 1.889119736960271
issubset(dict_keys) 2.2531574114632678
I also tried to change dict
by dict.keys()
as in python 3 the keys is said to be (https://www.python.org/dev/peps/pep-3106/) "a set-like or unordered container object".
But in that case, the result is even worse than with dict
or list
.
So why does passing a set
beats passing a list
or a dict
or a dict_keys
object? I don't see anything mentionned in the documentation about this.