How can I find the duplicates in a Python list and create another list of the duplicates? The list only contains integers.
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
To remove duplicates use
set(a)
, to print duplicates - something likeNote that
Counter
is not particularly efficient (timings) and probably an overkill here,set
will perform better. This code computes a list of unique elements in the source order:or, more concisely:
I don't recommend the latter style, because it is not obvious what
not seen.add(x)
is doing (the setadd()
method always returnsNone
, hence the need fornot
).To compute the list of duplicated elements without libraries,
If list elements are not hashable, you cannot use set/dicts and have to resort to a quadratic time solution (compare each which each), for example:
You don't need the count, just whether or not the item was seen before. Adapted that answer to this problem:
Just in case speed matters, here are some timings:
Here are the results: (well done @JohnLaRooy!)
Interestingly, besides the timings itself, also the ranking slightly changes when pypy is used. Most interestingly, the Counter-based approach benefits hugely from pypy's optimizations, whereas the method caching approach I have suggested seems to have almost no effect.
Apparantly this effect is related to the "duplicatedness" of the input data. I have set
l = [random.randrange(1000000) for i in xrange(10000)]
and got these results:I came across this question whilst looking in to something related - and wonder why no-one offered a generator based solution? Solving this problem would be:
I was concerned with scalability, so tested several approaches, including naive items that work well on small lists, but scale horribly as lists get larger (note- would have been better to use timeit, but this is illustrative).
I included @moooeeeep for comparison (it is impressively fast: fastest if the input list is completely random) and an itertools approach that is even faster again for mostly sorted lists... Now includes pandas approach from @firelynx -- slow, but not horribly so, and simple. Note - sort/tee/zip approach is consistently fastest on my machine for large mostly ordered lists, moooeeeep is fastest for shuffled lists, but your mileage may vary.
Advantages
Assumptions
Fastest solution, 1m entries:
Approaches tested
The results for the 'all dupes' test were consistent, finding "first" duplicate then "all" duplicates in this array:
When the lists are shuffled first, the price of the sort becomes apparent - the efficiency drops noticeably and the @moooeeeep approach dominates, with set & dict approaches being similar but lessor performers:
Use the
sort()
function. Duplicates can be identified by looping over it and checkingl1[i] == l1[i+1]
.When using toolz:
Without converting to list and probably the simplest way would be something like below. This may be useful during a interview when they ask not to use sets
======= else to get 2 separate lists of unique values and duplicate values