Given a list of sets:
allsets = [set([1, 2, 4]), set([4, 5, 6]), set([4, 5, 7])]
What is a pythonic way to compute the corresponding list of sets of elements having no overlap with other sets?
only = [set([1, 2]), set([6]), set([7])]
Is there a way to do this with a list comprehension?
To avoid quadratic runtime, you'd want to make an initial pass to figure out which elements appear in more than one set:
import itertools
import collections
element_counts = collections.Counter(itertools.chain.from_iterable(allsets))
Then you can simply make a list of sets retaining all elements that only appear once:
nondupes = [{elem for elem in original if element_counts[elem] == 1}
for original in allsets]
Alternatively, instead of constructing nondupes
from element_counts
directly, we can make an additional pass to construct a set of all elements that appear in exactly one input. This requires an additional statement, but it allows us to take advantage of the &
operator for set intersection to make the list comprehension shorter and more efficient:
element_counts = collections.Counter(itertools.chain.from_iterable(allsets))
all_uniques = {elem for elem, count in element_counts.items() if count == 1}
# ^ viewitems() in Python 2.7
nondupes = [original & all_uniques for original in allsets]
Timing seems to indicate that using an all_uniques
set produces a substantial speedup for the overall duplicate-elimination process. It's up to about a 3.5x speedup on Python 3 for heavily-duplicate input sets, though only about a 30% speedup for the overall duplicate-elimination process on Python 2 due to more of the runtime being dominated by constructing the Counter. This speedup is fairly substantial, though not nearly as important as avoiding quadratic runtime by using element_counts
in the first place. If you're on Python 2 and this code is speed-critical, you'd want to use an ordinary dict
or a collections.defaultdict
instead of a Counter
.
Another way would be to construct a dupes
set from element_counts
and use original - dupes
instead of original & all_uniques
in the list comprehension, as suggested by munk. Whether this performs better or worse than using an all_uniques
set and &
would depend on the degree of duplication in your input and what Python version you're on, but it doesn't seem to make much of a difference either way.
Yes it can be done but is hardly pythonic
>>> [(i-set.union(*[j for j in allsets if j!= i])) for i in allsets]
[set([1, 2]), set([6]), set([7])]
Some reference on sets can be found in the documentation. The *
operator is called unpacking operator.
A slightly different solution using Counter and comprehensions, to take advantage of the -
operator for set difference.
from itertools import chain
from collections import Counter
allsets = [{1, 2, 4}, {4, 5, 6}, {4, 5, 7}]
element_counts = Counter(chain.from_iterable(allsets))
dupes = {key for key in element_counts
if element_counts[key] > 1}
only = [s - dupes for s in allsets]
Another solution with itertools.chain
:
>>> from itertools import chain
>>> [x - set(chain(*(y for y in allsets if y!=x))) for x in allsets]
[set([1, 2]), set([6]), set([7])]
Also doable without the unpacking and using chain.from_iterable
instead.