There is a set S containing N integers each with value 1<=X<=10^6. The problem is to partition the set S into k partitions. The value of a partition is the sum of the elements present in it. Partition is to be done in such a way the total value of the set S is fairly distributed amongst the k partitions. The mathematical meaning of fair also needs to be defined (e.g. the objective could be to minimize the standard deviation of the values of the partitions from the average value of the set S (which is, sum(S)/k))
e.g. S = {10, 15, 12, 13, 30, 5}, k=3
A good partitioning would be {30}, {10, 15}, {12, 13, 5}
A bad partitioning would be {30, 5}, {10, 15}, {12, 13}
First question is to mathematically express condition for one partition to be better than the other. Second question is to how to solve the problem. The problem is NP-Hard. Are there any heuristics?
In the problem that I am trying to solve N <= (k*logX)^2 and K varies from 2 to 7.
==================================================================================
Based on other related SO questions, there are two reasonable functions to evaluate a distribution:
a) Minimize the value of the partition with the maximum value.
On a second thought, this is not a good metric. Consider, a set {100, 40, 40} to be partitioned into three subsets. This metric does not distinguish between the following two distributions even though one is clearly better than the other.
Distribution 1: {100}, {40}, {40} and Distribution 2: {100}, {40, 40}, {}
b) Minimize the maximum of the difference of any two values in a given partition i.e minimize max|A-B| for any A, B
One heuristic would be to spread the larger weights among the bags as evenly as possible, leaving enough smaller weights that you're now left with a subproblem with a large number of degrees of freedom. Repeat into sub-subproblems if necessary. This heuristic assumes your distribution is not too geometric, e.g.
{1000} and {100, 10, 1}
, and slightly assumes that your penalty function will penalize nil-assignments or very large outliers.For example:
Let the metric be to minimize max(sum(si) - sum(sj)) where si and sj are any two subsets in the resulting partition of the set S.
Lets say we have a distribution D and we need to include another element x to the distribution D. Add it to the subset s such that the metric above is minimized.
Could not prove any bounds, but intuition says that it will give a good approximation to the optimal? Anyone good at proving bounds?
I think a good metric will be:
the upside: a perfect distribution will yield always 0!
the downside: if there is none perferct solution, the best result will not yield 0.
a greedy heuristic for this problem will be:
where find_min() yields s such that sum(s) <= sum(si) for each si.
this solution's will yield f (metrics defined above) such that
f(sol) <= (k-1)*max{S}
(from here it is a proof for this bound):claim: for each subset s,
MAX- sum(s) <= max{S}
proof - by induction: at each step, the claim is true for the temporary solution.
in each step, let MAX be max{sum(si)} at the start of the iteration (before the add)!
because for each set
MAX-sum(si) <= max{S}
(and obviously, for the max set,MAX-sum(si)=0
), at overallSigma(MAX-sum(si)) <= (k-1)*max{S}
, as promised.EDIT :
I had some spare time, so I programmed both heuristics suggested by me and by @Akhil, and both metrics, first of all, both results are conclusive (according to Wilcoxon's pair-t test), but which is better is defined by which metric you choose, surprisingly, the algorithm which tried to minimize f() (@Akhil`s) scored lower for this same f, but higher for the second metric.