Is there a standard way to represent a "set" that can contain duplicate elements.
As I understand it, a set has exactly one or zero of an element. I want functionality to have any number.
I am currently using a dictionary with elements as keys, and quantity as values, but this seems wrong for many reasons.
Motivation: I believe there are many applications for such a collection. For example, a survey of favourite colours could be represented by: survey = ['blue', 'red', 'blue', 'green']
Here, I do not care about the order, but I do about quantities. I want to do things like:
survey.add('blue')
# would give survey == ['blue', 'red', 'blue', 'green', 'blue']
...and maybe even
survey.remove('blue')
# would give survey == ['blue', 'red', 'green']
Notes: Yes, set is not the correct term for this kind of collection. Is there a more correct one?
A list of course would work, but the collection required is unordered. Not to mention that the method naming for sets seems to me to be more appropriate.
Your approach with dict with element/count seems ok to me. You probably need some more functionality. Have a look at
collections.Counter
.element in list
andlist.count(element)
)counter.elements()
looks like a list with all duplicatesYou can use a plain
list
and uselist.count(element)
whenever you want to access the "number" of elements.If you need duplicates, use a list, and transform it to a set when you need operate as a set.
What you're looking for is indeed a multiset (or bag), a collection of not necessarily distinct elements (whereas a set does not contain duplicates).
There's an implementation for multisets here: https://github.com/mlenzen/collections-extended (Pypy's collections extended module).
The data structure for multisets is called
bag
. Abag
is a subclass of theSet
class fromcollections
module with an extra dictionary to keep track of the multiplicities of elements.A nice method for
bag
isnlargest
(similar toCounter
for lists), that returns the multiplicities of all elements blazingly fast since the number of occurrences of each element is kept up-to-date in the bag's dictionary:An alternative Python multiset implementation uses a sorted list data structure. There are a couple implementations on PyPI. One option is the sortedcontainers module which implements a SortedList data type that efficiently implements set-like methods like
add
,remove
, andcontains
. The sortedcontainers module is implemented in pure-Python, fast-as-C implementations (even faster), has 100% unit test coverage, and hours of stress testing.Installation is easy from PyPI:
If you can't
pip install
then simply pull the sortedlist.py file down from the open-source repository.Use it as you would a set:
The sortedcontainers module also maintains a performance comparison with other popular implementations.
You are looking for a multiset.
Python's closest datatype is
collections.Counter
:For an actual implementation of a multiset, use the
bag
class from the data-structures package on pypi. Note that this is for Python 3 only. If you need Python 2, here is a recipe for abag
written for Python 2.4.