Efficient algorithm for finding all maximal subset

I have a collection of unique sets (represented as bit masks) and would like to eliminate all elements that are proper subsets of another element. For example:

input = [{1, 2, 3}, {1, 2}, {2, 3}, {2, 4}, {}]
output = [{1, 2, 3}, {2, 4}]

I have not been able to find a standard algorithm for this, or even a name for this problem, so I am calling it "maximal subsets" for lack of anything else. Here is an O(n^2) algorithm (in Python for concreteness), assuming is_subset_func is O(1):¹

def eliminate_subsets(a, cardinality_func, is_subset_func):
    out = []
    for element in sorted(a, reverse=True, key=cardinality_func):
        for existing in out:
            if is_subset_func(element, existing):
                break
        else:
            out.append(element)
    return out

Is there a more efficient algorithm, hopefully O(n log n) or better?

¹ For bit masks of constant size, as is true in my case, is_subset_func is just element & existing == element, which runs in constant time.

标签： algorithm set

4条回答

Viruses.

2楼-- · 2019-01-16 19:48

Suppose you label all the input sets.

A={1, 2, 3}, B={1, 2}, C={2, 3}, D={2, 4}, E={}

Now build intermediate sets, one per element in the universe, containing the labels of the sets where it appears:

1={A,B}
2={A,B,C,D}
3={A,C}
4={D}

Now for each input set compute the intersection of all the label sets of its elements:

For A, {A,B} intesect {A,B,C,D} intersect {A,C} = {A}   (*)

If the intersection contains some label other than for the set itself, then it's s a subset of that set. Here there is no other element, so the answer is no. But,

For C, {A,B,C,D} intersect {A,C} = {A,C}, which means that it's a subset of A.

The cost of this method depends on the implementation of sets. Suppose bitmaps (as you hinted). Say there are n input sets of maximum size m and |U| items in the universe. Then the intermediate set construction produces |U| sets of size n bits, so there is O(|U|n) time to initialize them. Setting the bits requires O(nm) time. Computing each intersection as at (*) above requires O(mn); O(mn^2) for all.

Putting all these together we have O(|U|n) + O(nm) +O(mn^2) = O(|U|n + mn^2). Using the same conventions, your "all pairs" algorithm is O(|U|^2 n^2). Since m <= |U|, this algorithm is asymptotically faster. It's likely to be faster in practice as well because there's no elaborate bookkeeping to add constant factors.

Addition: On Line Version

The OP asked if there is an online version of this algorithm, i.e. one where the set of maximal sets can be maintained incrementally as input sets arrive one-by-one. The answer seems to be yes. The intermediate sets tell us quickly if a new set is a subset of one already seen. But how to tell quickly if it's a superset? And, if so, of which existing maximal sets? For in this case those maximal sets are no longer maximal and must be replaced by the new one.

The key is to note that A is a superset of B iff A' is a subset of B' (the tick' denoting set complement).

Following this inspiration, we maintain the intermediate set as before. When a new input set S arrives, do the same test as described above: Let I(e) be the intermediate set for input element e. Then this test is

For X = \intersect_{e \in S} . I(e), |X| > 0

(In this case it's greater than zero rather than one as above because S is not yet in I.) If the test succeeds, then the new set is a (possibly improper) subset of an existing maximal set, so it can be discarded.

Otherwise we must add S as a new maximal set, but before doing this, compute:

Y = \intersect_{e \in S'} . I'(e) = ( \union_{e \in S'} . I(e) )'

where again the tick' is set complement. The union form may be a bit faster to compute. Y contains the maximal sets that have been superceded by S. They must be removed from the maximal collection and from I. Finally add S as a maximal set and update I with S's elements.

Let's work through our example. When A arrives, we add it to I and have

1={A}  2={A}  3={A}

When B arrives, we find X = {A} intersect {A} = {A}, so throw B away and continue. The same happens for C. When D arrives we find X = {A} intersect {} = {}, so continue with Y = I'(1) intersect I'(3) = {} intersect {}. This correctly tells us that maximal set A is not contained in D, so there is nothing to delete. But it must be added as a new maximal set, and I becomes

1={A}  2={A,D}  3={A}  4={D}

The arrival of E causes no change. Posit the arrival then of a new set F={2, 3, 4, 5}. We find

X = {A} isect {A,D} isect {A} isect {D} isect {}

so we cannot throw F away. Continue with

Y = \intersect_{e in {1}} I'(e) = I'(1) = {D}

This tells us D is a subset of F, so should be discarded while F is added, leaving

1={A} 2={A,F} 3={A,F} 4={F} 5={F}

The computation of the complements is both tricky and nice due to the algorithm's online nature. The universe for input complements need only include input elements seen so far. The universe for intermediate sets consists only of tags of sets in the current maximal collection. For many input streams the size of this set will stabilize or decrease over time.

I hope this is helpful.

Summary

The general principle at work here is a powerful idea that crops of often in algorithm design. It's the reverse map. Whenever you find yourself doing a linear search to find an item with a given attribute, consider building a map from the attribute back to item. Often it is cheap to construct this map, and it strongly reduces search time. The premier example is a permutation map p[i] that tells you what position the i'th element will occupy after an array is permuted. If you need to search out the item that ends up in a given location a, you must search p for a, a linear time operation. On the other hand, an inverse map pi such that pi[p[i]] == i takes no longer to compute than does p (so its cost is "hidden"), but pi[a] produces the desired result in constant time.

Implementation by Original Poster

import collections
import operator

def is_power_of_two(n):
    """Returns True iff n is a power of two.  Assumes n > 0."""
    return (n & (n - 1)) == 0

def eliminate_subsets(sequence_of_sets):
    """Return a list of the elements of `sequence_of_sets`, removing all
    elements that are subsets of other elements.  Assumes that each
    element is a set or frozenset and that no element is repeated."""
    # The code below does not handle the case of a sequence containing
    # only the empty set, so let's just handle all easy cases now.
    if len(sequence_of_sets) <= 1:
        return list(sequence_of_sets)
    # We need an indexable sequence so that we can use a bitmap to
    # represent each set.
    if not isinstance(sequence_of_sets, collections.Sequence):
        sequence_of_sets = list(sequence_of_sets)
    # For each element, construct the list of all sets containing that
    # element.
    sets_containing_element = {}
    for i, s in enumerate(sequence_of_sets):
        for element in s:
            try:
                sets_containing_element[element] |= 1 << i
            except KeyError:
                sets_containing_element[element] = 1 << i
    # For each set, if the intersection of all of the lists in which it is
    # contained has length != 1, this set can be eliminated.
    out = [s for s in sequence_of_sets
           if s and is_power_of_two(reduce(
               operator.and_, (sets_containing_element[x] for x in s)))]
    return out

0人赞添加讨论(0) 举报

劫难

3楼-- · 2019-01-16 19:49

Off the top of my head there is an O(D*N*log(N)) where D is the number of unique numbers.

The recursive function "helper" works as follows: @arguments is sets and domain (number of unique numbers in sets): Base cases:

If the domain is empty, return
If sets is empty or sets has length equal to 1, return

Iterative case:

Remove all empty sets from sets
Pick an element D in domain
Remove D from the domain
Separate sets into two sets (set1 & set2) based on whether the set contains D
Remove D from each set in sets
Set result = union ( helper(set1,domain), helper(set2,domain) )
For each set in set1 add D back
return result

Note that the runtime depends on the Set implementation used. If a doubly linked list is used to store the set then:

Steps 1-5,7 take O(N) Step 6's union is O(N*log(N)) by sorting and then merging

Therefore the overall algorithm is O(D*N*log(N))

Here is java code to perform the following

import java.util.*;

public class MyMain {

    public static Set<Set<Integer>> eliminate_subsets(Set<Set<Integer>> sets) throws Exception {
        Set<Integer> domain = new HashSet<Integer>();
        for (Set<Integer> set : sets) {
            for (Integer i : set) {
                domain.add(i);
            }
        }
        return helper(sets,domain);
    }

    public static Set<Set<Integer>> helper(Set<Set<Integer>> sets, Set<Integer> domain) throws Exception {
        if (domain.isEmpty()) { return sets; }
        if (sets.isEmpty()) { return sets; }
        else if (sets.size() == 1) { return sets; }

        sets.remove(new HashSet<Integer>());

        // Pop some value from domain
        Iterator<Integer> it = domain.iterator();
        Integer splitNum = it.next();
        it.remove();

        Set<Set<Integer>> set1 = new HashSet<Set<Integer>>(); 
        Set<Set<Integer>> set2 = new HashSet<Set<Integer>>();
        for (Set<Integer> set : sets) {
            if (set.contains(splitNum)) {
                set.remove(splitNum);
                set1.add(set);
            }
            else {
                set2.add(set);
            }
        }

        Set<Set<Integer>> ret = helper(set1,domain);
        ret.addAll(helper(set2,domain));

        for (Set<Integer> set : set1) {
            set.add(splitNum);
        }
        return ret;
    }

    /**
     * @param args
     * @throws Exception 
     */
    public static void main(String[] args) throws Exception {
        // TODO Auto-generated method stub
        Set<Set<Integer>> s=new HashSet<Set<Integer>>();
        Set<Integer> tmp = new HashSet<Integer>();
        tmp.add(new Integer(1)); tmp.add(new Integer(2)); tmp.add(new Integer(3));
        s.add(tmp);

        tmp = new HashSet<Integer>();
        tmp.add(new Integer(1)); tmp.add(new Integer(2));
        s.add(tmp);

        tmp = new HashSet<Integer>();
        tmp.add(new Integer(3)); tmp.add(new Integer(4));
        s.add(tmp);
        System.out.println(eliminate_subsets(s).toString());
    }


}

*New years is disruptive

0人赞添加讨论(0) 举报

何必那么认真

4楼-- · 2019-01-16 20:01

This problem has been studied in literature. Given S_1,...,S_k which are subsets of {1,...,n}, Yellin [1] gave an algorithm to find the maximal subset of {S_1,...,S_k} in time O(kdm) where d is the average size of the S_i, and m is the cardinality of the the maximal subset of {S_1,...,S_k}. This was later improved for some range of parameters by Yellin and Jutla [2] to O((kd)^2/sqrt(log(kd))). It is believed that a truly sub-quadratic algorithm to this problem does not exist.

[1] Daniel M. Yellin: Algorithms for Subset Testing and Finding Maximal Sets. SODA 1992: 386-392.

[2] Daniel M. Yellin, Charanjit S. Jutla: Finding Extremal Sets in Less than Quadratic Time. Inf. Process. Lett. 48(1): 29-34 (1993).

0人赞添加讨论(0) 举报

ゆ、 Hurt°

5楼-- · 2019-01-16 20:06

Pre-process assumptions:

Input set is sorted by descending lengths

Each set is sorted ascending by value

There is access to a total and length for each set

Approach #2 - Use a bucket approach

Same assumptions. Can uniqueness be assumed? (i.e. there is not {1,4,6},{1,4,6}) Otherwise, you would need to check for distinct at some point, probably once the buckets are created.

semi psuedo

List<Set> Sets;//input
List<Set> Output;
List<List<Set>> Buckets;
int length = Sets[0].length;//"by descending lengths"
List<Set> Bucket = new List<Set>();//current bucket

//Place each set with shared length in its own bucket
for( Set set in Sets )
{
 if( set.length == length )//current Bucket
 {
  Bucket.add(set);
 }else//new Bucket
 {
  length = set.length;
  Buckets.Add(Bucket);
  Bucket = new Bucket();
  Bucket.Add(set);
 }
}
Buckets.add(Bucket);



//Based on the assumption of uniqueness, everything in the first bucket is
//larger than every other set and since it is unique, they are not proper subsets
Output.AddRange(Buckets[0]);

//Iterate through the buckets
for( int i = 1; i < Buckets.length; i++ )
{
 List<Set> currentBucket = Buckets[i];

 //Iterate through the sets in the current bucket
 for( int a = 0; a < currentBucket.length; a++ )
 {
  Set currentSet = currentBucket[a];
  bool addSet = true;
  //Iterate through buckets with greater length
  for( int b = 0; b < i; b++ )
  {
   List<Set> testBucket = Buckets[b];

   //Iterate through the sets in testBucket
   for( int c = 0; c < testBucket.length; c++ )
   {
    Set testSet = testBucket[c];
    int testMatches = 0;

    //Iterate through the values in the current set
    for( int d = 0; d < currentSet.length; d++ )
    {
     int testIndex = 0;

     //Iterate through the values in the test set
     for( ; testIndex < testSet.length; testIndex++ )
     {
      if( currentSet[d] < testSet[testIndex] )
      {
       setClear = true;
       break;
      }
      if( currentSet[d] == testSet[testIndex] )
      {
       testMatches++;
       if( testMatches == currentSet.length )
       {
        addSet = false;
        setClear = true;
        break;
       }
      }
     }//testIndex
     if( setClear ) break;
    }//d
    if( !addSet ) break;
   }//c
   if( !addSet ) break;
  }//b
  if( addSet ) Output.Add( currentSet );
 }//a
}//i

Approach #1 (`O( n(n+1)/2 )`) ... not efficient enough

semi psuedo

//input Sets
List<Set> results;
for( int current = 0; current < Sets.length; current++ )
{
 bool addCurrent = true;
 Set currentSet = Sets[current];
 for( int other = 0; other < current; other++)
 {
  Set otherSet = Sets[other];
  //is current a subset of other?
  if( currentSet.total > otherSet.total 
   || currentSet.length >= otherSet.length) continue;
  int max = currentSet.length;
  int matches = 0;
  int otherIndex = 0, len = otherSet.length;
  for( int i = 0; i < max; i++ )
  {
   for( ; otherIndex < len; otherIndex++ )
   {
     if( currentSet[i] == otherSet[otherInex] )
     {
      matches++;
      break;
     }
   }
   if( matches == max )
   {
    addCurrent = false;
    break;
   }
  }
  if( addCurrent ) results.Add(currentSet);
 }
}

This will take the set of sets, and iterate through each one. With each one, it will iterate through each set in the set again. As the nested iteration takes place, it will compare if the outer set is the same as the nested set (from the inner iteration) (if they are, no checking is done), it will also compare if the outer set has a total greater than the nested set (if the total is greater, then the outer set cannot be a proper subset), it will then compare if the outer set has a smaller amount of items than the nested set.

Once those checks are complete it begins with the first item of the outer set, and compares it with the first item of the nested set. If they are not equal, it will check the next item of the nested set. If they are equal, then it adds one to a counter, and will then compare the next item of the outer set with where it left off in the inner set.

If it reaches a point where the amount of matched comparisons equal the number of items in the outer set, then the outer set has been found to be a proper subset of the inner set. It is flagged to be excluded, and the comparisons are halted.

0人赞添加讨论(0) 举报