I have a collection of unique sets (represented as bit masks) and would like to eliminate all elements that are proper subsets of another element. For example:
input = [{1, 2, 3}, {1, 2}, {2, 3}, {2, 4}, {}]
output = [{1, 2, 3}, {2, 4}]
I have not been able to find a standard algorithm for this, or even a name for this problem, so I am calling it "maximal subsets" for lack of anything else. Here is an O(n^2) algorithm (in Python for concreteness), assuming is_subset_func
is O(1):1
def eliminate_subsets(a, cardinality_func, is_subset_func):
out = []
for element in sorted(a, reverse=True, key=cardinality_func):
for existing in out:
if is_subset_func(element, existing):
break
else:
out.append(element)
return out
Is there a more efficient algorithm, hopefully O(n log n) or better?
1 For bit masks of constant size, as is true in my case, is_subset_func
is just element & existing == element
, which runs in constant time.
Suppose you label all the input sets.
Now build intermediate sets, one per element in the universe, containing the labels of the sets where it appears:
Now for each input set compute the intersection of all the label sets of its elements:
If the intersection contains some label other than for the set itself, then it's s a subset of that set. Here there is no other element, so the answer is no. But,
The cost of this method depends on the implementation of sets. Suppose bitmaps (as you hinted). Say there are n input sets of maximum size m and |U| items in the universe. Then the intermediate set construction produces |U| sets of size n bits, so there is O(|U|n) time to initialize them. Setting the bits requires O(nm) time. Computing each intersection as at
(*)
above requires O(mn); O(mn^2) for all.Putting all these together we have O(|U|n) + O(nm) +O(mn^2) = O(|U|n + mn^2). Using the same conventions, your "all pairs" algorithm is O(|U|^2 n^2). Since m <= |U|, this algorithm is asymptotically faster. It's likely to be faster in practice as well because there's no elaborate bookkeeping to add constant factors.
Addition: On Line Version
The OP asked if there is an online version of this algorithm, i.e. one where the set of maximal sets can be maintained incrementally as input sets arrive one-by-one. The answer seems to be yes. The intermediate sets tell us quickly if a new set is a subset of one already seen. But how to tell quickly if it's a superset? And, if so, of which existing maximal sets? For in this case those maximal sets are no longer maximal and must be replaced by the new one.
The key is to note that
A
is a superset ofB
iffA'
is a subset ofB'
(the tick' denoting set complement).Following this inspiration, we maintain the intermediate set as before. When a new input set
S
arrives, do the same test as described above: LetI(e)
be the intermediate set for input elemente
. Then this test is(In this case it's greater than zero rather than one as above because
S
is not yet inI
.) If the test succeeds, then the new set is a (possibly improper) subset of an existing maximal set, so it can be discarded.Otherwise we must add
S
as a new maximal set, but before doing this, compute:where again the tick' is set complement. The union form may be a bit faster to compute.
Y
contains the maximal sets that have been superceded byS
. They must be removed from the maximal collection and fromI
. Finally addS
as a maximal set and updateI
withS
's elements.Let's work through our example. When
A
arrives, we add it toI
and haveWhen
B
arrives, we findX = {A} intersect {A} = {A}
, so throwB
away and continue. The same happens forC
. WhenD
arrives we findX = {A} intersect {} = {}
, so continue withY = I'(1) intersect I'(3) = {} intersect {}
. This correctly tells us that maximal setA
is not contained inD
, so there is nothing to delete. But it must be added as a new maximal set, andI
becomesThe arrival of
E
causes no change. Posit the arrival then of a new setF={2, 3, 4, 5}
. We findso we cannot throw
F
away. Continue withThis tells us
D
is a subset ofF
, so should be discarded whileF
is added, leavingThe computation of the complements is both tricky and nice due to the algorithm's online nature. The universe for input complements need only include input elements seen so far. The universe for intermediate sets consists only of tags of sets in the current maximal collection. For many input streams the size of this set will stabilize or decrease over time.
I hope this is helpful.
Summary
The general principle at work here is a powerful idea that crops of often in algorithm design. It's the reverse map. Whenever you find yourself doing a linear search to find an item with a given attribute, consider building a map from the attribute back to item. Often it is cheap to construct this map, and it strongly reduces search time. The premier example is a permutation map
p[i]
that tells you what position thei
'th element will occupy after an array is permuted. If you need to search out the item that ends up in a given locationa
, you must searchp
fora
, a linear time operation. On the other hand, an inverse mappi
such thatpi[p[i]] == i
takes no longer to compute than doesp
(so its cost is "hidden"), butpi[a]
produces the desired result in constant time.Implementation by Original Poster
Off the top of my head there is an O(D*N*log(N)) where D is the number of unique numbers.
The recursive function "helper" works as follows: @arguments is sets and domain (number of unique numbers in sets): Base cases:
Iterative case:
Note that the runtime depends on the Set implementation used. If a doubly linked list is used to store the set then:
Steps 1-5,7 take O(N) Step 6's union is O(N*log(N)) by sorting and then merging
Therefore the overall algorithm is O(D*N*log(N))
Here is java code to perform the following
*New years is disruptive
This problem has been studied in literature. Given S_1,...,S_k which are subsets of {1,...,n}, Yellin [1] gave an algorithm to find the maximal subset of {S_1,...,S_k} in time O(kdm) where d is the average size of the S_i, and m is the cardinality of the the maximal subset of {S_1,...,S_k}. This was later improved for some range of parameters by Yellin and Jutla [2] to O((kd)^2/sqrt(log(kd))). It is believed that a truly sub-quadratic algorithm to this problem does not exist.
[1] Daniel M. Yellin: Algorithms for Subset Testing and Finding Maximal Sets. SODA 1992: 386-392.
[2] Daniel M. Yellin, Charanjit S. Jutla: Finding Extremal Sets in Less than Quadratic Time. Inf. Process. Lett. 48(1): 29-34 (1993).
Pre-process assumptions:
Approach #2 - Use a bucket approach
Same assumptions. Can uniqueness be assumed? (i.e. there is not {1,4,6},{1,4,6}) Otherwise, you would need to check for distinct at some point, probably once the buckets are created.
semi psuedo
Approach #1 (
O( n(n+1)/2 )
) ... not efficient enoughsemi psuedo
This will take the set of sets, and iterate through each one. With each one, it will iterate through each set in the set again. As the nested iteration takes place, it will compare if the outer set is the same as the nested set (from the inner iteration) (if they are, no checking is done), it will also compare if the outer set has a total greater than the nested set (if the total is greater, then the outer set cannot be a proper subset), it will then compare if the outer set has a smaller amount of items than the nested set.
Once those checks are complete it begins with the first item of the outer set, and compares it with the first item of the nested set. If they are not equal, it will check the next item of the nested set. If they are equal, then it adds one to a counter, and will then compare the next item of the outer set with where it left off in the inner set.
If it reaches a point where the amount of matched comparisons equal the number of items in the outer set, then the outer set has been found to be a proper subset of the inner set. It is flagged to be excluded, and the comparisons are halted.