I have a large set of sets e.g. {{2,4,5} , {4,5}, ...}.
Given one of these subsets, I would like to iterate through all other subsets which are strict subsets of this subset. That is, if I am interested in set A
, e.g. {2,4,5}
, I want to find all sets B
where the relative complement B / A = {},
the empty set. Some possibilities could be {2,4}
, {2,5}
but not {2,3}
I could of course search linearly and check each time, but am looking for an efficient data structure both for the larger set and the subset (if it matters). The number of subsets is typically in the 10s of thousands, but if it makes a difference I would be interested in cases where it could be in the hundreds of millions. The size of the subsets is typically in 10s.
I am programming in C++
Thanks
I would suggest storing all of the sets in a tree. Each node of the tree would represent all sets that contained a specified initial list of integers. I would have the nodes contain the following pieces of information:
Given this tree, and a subset you can do a search with recursion and backtracking for all subsets of the set. In your search you start with the first element of the subset, look for all subsets that contain that element, then you search for all subsets that don't contain that element.
Building this tree takes time and space at most
O(n * m * k)
wheren
is the number of subsetsm
is the average number of elements per subset, andk
is the size of the universe of elements that can be in the sets. With random sets of sets that are much smaller than the possible universe of subsets of yourk
elements you won't construct most of the tree, and it will takeO(n * m)
for your tree.In theory traversing this tree could be time
O(n)
. But in practice you'll trim branches of the tree fairly early, and won't traverse most of the other subsets. A back of the envelope calculation suggests that if you haven
random sets out of ak
element universe withn << 2k
then a search of the tree isO(n0.5k)
. (At each integer, half the time it is in your set you're searching for subsets of and you split your search into 2, and half the time it isn't in your set and you eliminate half of your space. Afterj
integers you've got2j/2
searches going of sets of sets of size2-jn
. Thus by the time you get the searches down to single other subsets to compare with, there areO(n0.5)
searches going. The final comparison of bitmaps isO(k)
.)Note: I'm convinced by this back of the envelope calculation that the average performance is
o(n0.5+epsilon)
for everyepsilon > 0
, but the convergence is very slow. More precisely I suspect that the arithmetic average of the performance isn0.5 + O(sqrt(log(n))))
. But thatsqrt(log(n))
piece takes a long time to converge.Note that using the number of additional elements in the smallest set at this point or below in the tree lets your search trivially filter out all sets that are too large to be subsets. Depending on your dataset, this may or may not lead to useful speedups.
The approach suggested by PengOne would work, but it is not very efficient. To see why it fails, consider the following pathological example:
Suppose you have a universe U, which has n distinct elements, and let the set of all the sets you are searching over consist of all subsets of U with exactly k elements. Then it is true that no pair of sets here are strictly contained in one another; and so in the worst case you would have to search over all n choose k possible sets! In other words, using his proposed data structure is no better than a naive linear search in the worst case.
Clearly you can do much better than this, and the correct data structure to use would be a trie: http://en.wikipedia.org/wiki/Trie
To adapt a trie to work for sets instead of just strings, it is sufficient to fix an ordering on the elements of the universal set, then encode each of your subsets as a binary string of finite length, where the ith character is 0 or 1 depending on whether the set contains the ith element. Here is an implementation in python
Now here is an example usage:
Note that the amount of work performed by query_tree is proportional to the size of the subtree which represents the set of all results returned by query_tree. Thus our goal is to compute the size of one of the subtries (on average) and then as a secondary goal to minimize this quantity. One way to do this is to reorder the elements of the universal in terms of descending frequency, so that they are repeated as few times as possible in the lower levels of the tree. This optimization is also done in the above code. A secondary optimization is to cache the trees which have already been searched to avoid having to redo unnecessary work.
EDIT: Just after I got done typing this up, I saw btilly's answer, which is comes to more or less the same conclusion about the problem (modulo some technical nitpicks which I have moved into the comments on his post.)
EDIT 2: Realized that this is really just a special case of a binary decision diagram. Don't really have enough energy to fix the write up right now, so will leave it as is. Perhaps fix it tomorrow. http://en.wikipedia.org/wiki/Binary_decision_diagram
Mathematically, you should construct the Hasse diagram for your sets, which will be the partially ordered set with vertices your sets and arrows given by containment. Essentially, you want to create a directed, acyclic graph with an arrow
A --> B
ifA
strictly containsB
and there is noC
such thatA
strictly containsC
andC
strictly containsB
.This is actually going to be a ranked poset, meaning that you can keep track of "levels" of the digraph based on the cardinality of the sets. This is sort of like creating a hash table to jump to the right set.
From
A
, just do a BFS down the graph to find all proper subsets ofA
.How to implement this: (in pseudocode)
To make this and all the subroutines fast, you can encode each set an a binary where digit
i
is1
ifi
is inC
and0
otherwise. This makes testing containment and determining rank trivial.The above method works if you have all possible subsets. Since you may be missing some, you'll have to check more things. For the pseudocode, you'll need to change
rank(C)-1
to the largest integerl < rank(C)
such that some element of the HasseDiagram has rankl
, and similarly forrank(C)+1
. Then, when you're adding the setC
to the diagram:If
A
coversC
, then you only need to check lower ranked setsB
that are also covered byA
.If
C
coversB
, then you only need to check higher ranked setsA
that also cover byB
.By "
X
coversY
" I mean there is an arrowX -> Y
, not just a path.Furthermore, when you insert
C
betweenA
andB
using one of the above checks, you will need to remove the arrowA --> B
when you addA --> C
andC --> B
.Have a look at this python library that implements Hasse diagrams python-lattice]1
This is interesting. I like the Hasse diagram approach PengOne suggests but I think you can build the Hasse diagram really quickly using a prime number trick. Lets say the union of all of the sets results in natural numbers 1 to N. Map each of these numbers with corresponding primes, like:
Next, caluclate a 'score' for each set by multiplying each of the prime numbers corresponding to the number in the set. For instance a set {1,2,3} would have a score 2*3*5 = 30. Now, for a set A to be a proper subset of another set B score(A) must divide score(B) (scores for {1,2}, {2,3} and {1,3} are 6, 15 and 10, each of which divide 30). Use this score to build your Hasse diagram.
Edit: This seems like one of the nice theoretical solutions. Probably not the way to go. Bitsets as suggested by yi_H is just as good and does not suffer from big integer troubles.