How can I find the duplicates in a Python list and create another list of the duplicates? The list only contains integers.
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
I am entering much much late in to this discussion. Even though, I would like to deal with this problem with one liners . Because that's the charm of Python. if we just want to get the duplicates in to a separate list (or any collection),I would suggest to do as below.Say we have a duplicated list which we can call as 'target'
Now if we want to get the duplicates,we can use the one liner as below:
This code will put the duplicated records as key and count as value in to the dictionary 'duplicates'.'duplicate' dictionary will look like as below:
If you just want all the records with duplicates alone in a list, its again much shorter code:
Output will be:
This works perfectly in python 2.7.x + versions
How about simply loop through each element in the list by checking the number of occurrences, then adding them to a set which will then print the duplicates. Hope this helps someone out there.
There are a lot of answers up here, but I think this is relatively a very readable and easy to understand approach:
Notes:
You can use
iteration_utilities.duplicates
:or if you only want one of each duplicate this can be combined with
iteration_utilities.unique_everseen
:It can also handle unhashable elements (however at the cost of performance):
That's something that only a few of the other approaches here can handle.
Benchmarks
I did a quick benchmark containing most (but not all) of the approaches mentioned here.
The first benchmark included only a small range of list-lengths because some approaches have
O(n**2)
behavior.In the graphs the y-axis represents the time, so a lower value means better. It's also plotted log-log so the wide range of values can be visualized better:
Removing the
O(n**2)
approaches I did another benchmark up to half a million elements in a list:As you can see the
iteration_utilities.duplicates
approach is faster than any of the other approaches and even chainingunique_everseen(duplicates(...))
was faster or equally fast than the other approaches.One additional interesting thing to note here is that the pandas approaches are very slow for small lists but can easily compete for longer lists.
However as these benchmarks show most of the approaches perform roughly equally, so it doesn't matter much which one is used (except for the 3 that had
O(n**2)
runtime).Benchmark 1
Benchmark 2
Disclaimer
1 This is from a third-party library I have written:
iteration_utilities
.Here's a neat and concise solution -