Python “set” with duplicate/repeated elements

2019-03-14 07:33发布

Is there a standard way to represent a "set" that can contain duplicate elements.

As I understand it, a set has exactly one or zero of an element. I want functionality to have any number.

I am currently using a dictionary with elements as keys, and quantity as values, but this seems wrong for many reasons.

Motivation: I believe there are many applications for such a collection. For example, a survey of favourite colours could be represented by: survey = ['blue', 'red', 'blue', 'green']

Here, I do not care about the order, but I do about quantities. I want to do things like:

survey.add('blue')
# would give survey == ['blue', 'red', 'blue', 'green', 'blue']

...and maybe even

survey.remove('blue')
# would give survey == ['blue', 'red', 'green']

Notes: Yes, set is not the correct term for this kind of collection. Is there a more correct one?

A list of course would work, but the collection required is unordered. Not to mention that the method naming for sets seems to me to be more appropriate.

6条回答
萌系小妹纸
2楼-- · 2019-03-14 07:46

Your approach with dict with element/count seems ok to me. You probably need some more functionality. Have a look at collections.Counter.

  • O(1) test whether an element is present and current count retrieval (faster than with element in list and list.count(element))
  • counter.elements() looks like a list with all duplicates
  • easy manipulation union/difference with other Counters
查看更多
男人必须洒脱
3楼-- · 2019-03-14 07:46

You can use a plain list and use list.count(element) whenever you want to access the "number" of elements.

my_list = [1, 1, 2, 3, 3, 3]

my_list.count(1) # will return 2
查看更多
成全新的幸福
4楼-- · 2019-03-14 07:49

If you need duplicates, use a list, and transform it to a set when you need operate as a set.

查看更多
爷、活的狠高调
5楼-- · 2019-03-14 07:53

What you're looking for is indeed a multiset (or bag), a collection of not necessarily distinct elements (whereas a set does not contain duplicates).

There's an implementation for multisets here: https://github.com/mlenzen/collections-extended (Pypy's collections extended module).

The data structure for multisets is called bag. A bag is a subclass of the Set class from collections module with an extra dictionary to keep track of the multiplicities of elements.

class _basebag(Set):
    """
    Base class for bag and frozenbag.   Is not mutable and not hashable, so there's
    no reason to use this instead of either bag or frozenbag.
    """
    # Basic object methods

    def __init__(self, iterable=None):
        """Create a new basebag.

        If iterable isn't given, is None or is empty then the bag starts empty.
        Otherwise each element from iterable will be added to the bag
        however many times it appears.

        This runs in O(len(iterable))
        """
        self._dict = dict()
        self._size = 0
        if iterable:
            if isinstance(iterable, _basebag):
                for elem, count in iterable._dict.items():
                    self._inc(elem, count)
            else:
                for value in iterable:
                    self._inc(value)

A nice method for bag is nlargest (similar to Counter for lists), that returns the multiplicities of all elements blazingly fast since the number of occurrences of each element is kept up-to-date in the bag's dictionary:

>>> b=bag(random.choice(string.ascii_letters) for x in xrange(10))
>>> b.nlargest()
[('p', 2), ('A', 1), ('d', 1), ('m', 1), ('J', 1), ('M', 1), ('l', 1), ('n', 1), ('W', 1)]
>>> Counter(b)
Counter({'p': 2, 'A': 1, 'd': 1, 'm': 1, 'J': 1, 'M': 1, 'l': 1, 'n': 1, 'W': 1}) 
查看更多
Fickle 薄情
6楼-- · 2019-03-14 08:00

An alternative Python multiset implementation uses a sorted list data structure. There are a couple implementations on PyPI. One option is the sortedcontainers module which implements a SortedList data type that efficiently implements set-like methods like add, remove, and contains. The sortedcontainers module is implemented in pure-Python, fast-as-C implementations (even faster), has 100% unit test coverage, and hours of stress testing.

Installation is easy from PyPI:

pip install sortedcontainers

If you can't pip install then simply pull the sortedlist.py file down from the open-source repository.

Use it as you would a set:

from sortedcontainers import SortedList
survey = SortedList(['blue', 'red', 'blue', 'green']]
survey.add('blue')
print survey.count('blue') # "3"
survey.remove('blue')

The sortedcontainers module also maintains a performance comparison with other popular implementations.

查看更多
男人必须洒脱
7楼-- · 2019-03-14 08:03

You are looking for a multiset.

Python's closest datatype is collections.Counter:

A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. The Counter class is similar to bags or multisets in other languages.

For an actual implementation of a multiset, use the bag class from the data-structures package on pypi. Note that this is for Python 3 only. If you need Python 2, here is a recipe for a bag written for Python 2.4.

查看更多
登录 后发表回答