I was wondering whether I should have my data structure as a set or a list. Mostly I will do set operations, but in the end I will need to sort it.
I wondered whether I should first make the set a list, and then use sorted(list(my_set))
, or just sort the set immediately sorted(my_set)
. Arguably, I might consider a general "to list" phase, since having an ordered iterable at that point in time might make sense anyway.
So I decided to test it, expecting a list to be quicker.
Benchmarker:
import time
def sorter(x):
t1 = time.time()
for i in range(1000000):
sorted(x)
return time.time() - t1
Data:
one = range(1000)
a1 = list(one)
b1 = set(one)
sorter(a1)
# time: 16.5 s
sorter(b1)
# time: 20.7 s
I then realized it might have to do with the fact that the elements are already in place, and remembered this amazing question & answer.
Then, I tried some random data:
two = numpy.random.randint(1, 1000, 1000)
a2 = list(two)
b2 = set(two)
With the results:
sorter(a2)
# time: 4min 49s
sorter(b2)
# time: 18.9 s
Huge difference, what is going on?
Bonus: It even appears at a timing of one minute, that sorted(set(a_list))
is impressively faster than sorted(a_list)
.
Indeed in the second case, there can be duplicates which would be filtered, and thus speed up sorting.
I extended your code a bit and hope that this will give you insight into what is happening:
this prints out:
This is because of the collision in the random number range
gives as output:
That
b2
would be faster because there are less elements is logical, but that this is even faster if you first make a list of the set must have some reason. That it again is slower if you shuffle that list is again logical and the shuffled result is rather close to the result for a2 when compensated for the length of the lists.So lets try to put something else in the list:
gives (I would have been rather surprised to have less than 1000 elements):
So it must have something to do with having numbers in the set:
gives you:
Instead of iterating , a set has a
pop()
function you can try and use:So lets arbitrarily retrieve elements from the set
b2
and see if there is something special:gives the same result:
So arbitrarily retrieving elements of set of number retrieves those numbers in order, independent off the ordering how these numbers were put in. As the iteration is how the list-making retrieves an element at a time to append to the list, the result of
list(b2)
is an ordered list and that gets sorted quite fast using the Timsort algorithm used in Python.