I have a python list where elements can repeat.
>>> a = [1,2,2,3,3,4,5,6]
I want to get the first n
unique elements from the list.
So, in this case, if i want the first 5 unique elements, they would be:
[1,2,3,4,5]
I have come up with a solution using generators:
def iterate(itr, upper=5):
count = 0
for index, element in enumerate(itr):
if index==0:
count += 1
yield element
elif element not in itr[:index] and count<upper:
count += 1
yield element
In use:
>>> i = iterate(a, 5)
>>> [e for e in i]
[1,2,3,4,5]
I have doubts on this being the most optimal solution. Is there an alternative strategy that i can implement to write it in a more pythonic and efficient way?
If your objects are hashable (
int
s are hashable) you can write utility function usingfromkeys
method ofcollections.OrderedDict
class (or starting from Python3.7 a plaindict
, since they became officially ordered) likeand then implementation of
iterate
can be simplified toor if you want always a
list
as an outputImprovements
As @Chris_Rands mentioned this solution walks through entire collection and we can improve this by writing
nub
utility in a form of generator like others already did:Using
set
withsorted+ key
You can adapt the popular
itertools
unique_everseen
recipe:Alternatively, as suggested by @Chris_Rands, you can use
itertools.islice
to extract a fixed number of values from a non-limited generator:Note the
unique_everseen
recipe is available in 3rd party libraries viamore_itertools.unique_everseen
ortoolz.unique
, so you could use:There are really amazing answers for this question, which are fast, compact and brilliant! The reason I am putting here this code is that I believe there are plenty of cases when you don't care about 1 microsecond time loose nor you want additional libraries in your code for one-time solving a simple task.
Why not use something like this?
I would use a
set
to remember what was seen and return from the generator when you haveseen
enough:Output:
According to PEP-479 you should
return
from generators, notraise StopIteration
- thanks to @khelwood & @iBug for that piece of comment - one never learns out.With 3.6 you get a deprecated warning, with 3.7 it gives RuntimeErrors: Transition Plan if still using
raise StopIteration
Your solution using
elif element not in itr[:index] and count<upper:
usesO(k)
lookups - withk
being the length of the slice - using a set reduces this toO(1)
lookups but uses more memory because the set has to be kept as well. It is a speed vs. memory tradeoff - what is better is application/data dependend.Consider
[1,2,3,4,4,4,4,5]
vs[1]*1000+[2]*1000+[3]*1000+[4]*1000+[5]*1000+[6]
:For 6 uniques (in longer list):
O(1)+O(2)+...+O(5001)
5001*O(1)
lookup + memory forset( {1,2,3,4,5,6})