I'm looking for an algorithm to solve the following problem. I have a number of subsets (1-n) of a given set (a-h). I want to find the smallest collection of subsets that will allow me to construct, by combination, all of the given subsets. This collection can contain subsets that do not exist in 1-n yet.
a b c d e f g h
1 1
2 1 1
3 1 1 1
4 1 1
5 1 1
6 1 1 1 1
7 1 1 1 1
8 1 1 1
9 1 1 1
Below are two possible collections, the smallest of which contains seven subsets. I have denoted new subsets with an x.
1 1
x 1
x 1
x 1
x 1
x 1
x 1
x 1
1 1
x 1
x 1
x 1
x 1
x 1 1
x 1
I believe this must be a known problem, but I'm not very familiar with algorithms. Any help is very much appreciated, as is a suggestion for a better topic title.
Thanks!
Update
Graph coloring gets me a long way, thanks. However, in my case subsets are allowed to overlap. For example:
a b c d
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1
5 1 1 1 1
Graph coloring gives me this solution:
x 1 1
x 1
x 1
But this one is valid as well, and is smaller:
1 1 1 1
4 1 1
This problem is known as Set Basis, and it is NP-complete (Larry J. Stockmeyer: The set basis problem is NP-complete. Technical Report RC-5431, IBM, 1975). Its formulation as a graph problem is Bipartite Dimension. Since it is very hard to solve in general, it might be useful to look if there are any helpful properties of your data (e.g., are the sets small? Is the solution small? Can all sets occur?)
I cannot think of an easy ILP formulation. Instead, you could try to reduce the problem to Clique Cover, which is better studied, using either the reduction from Kou&Wong or the one from Nor et al.. I have coauthered a paper discussing algorithms for Clique Cover, and written a Clique cover solver with both an exact solver and two heuristics.
This problem was shown in one the video's of Coursera's Discrete Optimization lectures. IIRC, it's called the set cover problem.
IIRC, it's NP-complete or NP-hard, so look into the typical algorithms (exact algo's for small datasets, metaheuristics for medium/big datasets) and typical frameworks (OptaPlanner, ...)
For this variant of the Set Cover
problem, here is an Integer Programming formulation approach, with row generation.
Let's denote the components a,b,c,d... by their Column number. a=1, b=2 etc.
The rows are 'subsets.' Let's say that the EXISTING subsets are S1,...Sm. (These are the ones that HAVE to be covered.)
Notation for NEW subsets
This is the step where we introduce NEW subsets.
Let's call the 'atomic' subsets as a_x
. All a
subsets have only one component.
a1 is the subset {1,0,0,0}
a2 is the subset {0,1,0,0}
a3 is the subset {1,0,1,0}
...
Let bxy
be subsets with two components.
So `b13` is the subset with component 1 and 3 being present.
b13 = {1, 0, 1, 0}
b34 = {0, 0, 1, 1} etc.
cxyz are subsets with three components.
For example, c124 = { 1, 1, 0, 1} etc.
d subsets will have 4 components
e subsets will have 5 components
and so on.
Row Generation Step
Given an EXISTING Set, we generate only the needed NEW a, b, c ... subsets as we need.
For example, let's take the subset S1 = {1, 0, 1, 1}
Meaningful sets needed that can help create S1 are
a1, a3, a4. (Note that a2 is not needed since component b is not a component in S1}
b11, b13, b34.
c134
PREPROCESSING STEP: For each Sj in EXISTING SETS, generate new sub sets, using the procedure mentioned above. We create only as many ax, bxy, cxyz dxyzw... as needed. This step is needed before the formulation step.
In the worst case, there are (2^num_components-1) subsets needed per Sj. But they are easy to generate.
Example Problem
Now the formulation for the following problem:
a b c d
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1
5 1 1 1 1
We will have one constraint for each ROW. Each set has to be "covered"
For the problem above, here's the formulation
Formulation
Objective Minimize sum of all Subsets.
Min sum (a_x) + sum (b_xy) + sum (c_xyz) + sum (d_xyzw)
Subject to:
a1 + a2 + a3 + b11 + b12 + b13 + c123 >= 1 \\ Set 1 has to be formed
a1 + a2 + a3 + b11 + b12 + b13 + c123 >= 1 \\ Set 2 has to be formed
a1 + a2 + a3 + b11 + b12 + b13 + c123 >= 1 \\ Set 3 has to be formed
a4 + a5 + b34 >= 1 \\ Set 4 has to be formed
a1 + a2 + a3 + a4 + b11 + b12 + ..+ b34 + c123 + ...+ d1234 >= 1 \\ Set 5 has to be formed
a's, b's, c's, d's Binary
Upper bound: By exploiting the fact that you need at most j subsets (Number of existing Subsets) you can even add a cut. Objective function has to be j or lower.
Hope that helps.