fair partitioning of set S into k partitions

2019-02-15 11:20发布

问题:

There is a set S containing N integers each with value 1<=X<=10^6. The problem is to partition the set S into k partitions. The value of a partition is the sum of the elements present in it. Partition is to be done in such a way the total value of the set S is fairly distributed amongst the k partitions. The mathematical meaning of fair also needs to be defined (e.g. the objective could be to minimize the standard deviation of the values of the partitions from the average value of the set S (which is, sum(S)/k))

e.g. S = {10, 15, 12, 13, 30, 5}, k=3

A good partitioning would be {30}, {10, 15}, {12, 13, 5}

A bad partitioning would be {30, 5}, {10, 15}, {12, 13}

First question is to mathematically express condition for one partition to be better than the other. Second question is to how to solve the problem. The problem is NP-Hard. Are there any heuristics?

In the problem that I am trying to solve N <= (k*logX)^2 and K varies from 2 to 7.

==================================================================================

Based on other related SO questions, there are two reasonable functions to evaluate a distribution:

a) Minimize the value of the partition with the maximum value.

On a second thought, this is not a good metric. Consider, a set {100, 40, 40} to be partitioned into three subsets. This metric does not distinguish between the following two distributions even though one is clearly better than the other.

Distribution 1: {100}, {40}, {40} and Distribution 2: {100}, {40, 40}, {}

b) Minimize the maximum of the difference of any two values in a given partition i.e minimize max|A-B| for any A, B

回答1:

I think a good metric will be:

let the result set be s1,s2,...,sk
let MAX be max{sum(si) for each i}
f({s1,...,sk}) = Sigma(MAX-sum(si)) for each i)

the upside: a perfect distribution will yield always 0!
the downside: if there is none perferct solution, the best result will not yield 0.

a greedy heuristic for this problem will be:

sorted<-sort(S) (let's say sorted[0] is the highest)
s1=s2=...=sk= {}
for each x in sorted:
   s <- find_min() (*)
   s.add(x)

where find_min() yields s such that sum(s) <= sum(si) for each si.

this solution's will yield f (metrics defined above) such that f(sol) <= (k-1)*max{S} (from here it is a proof for this bound):


claim: for each subset s, MAX- sum(s) <= max{S}
proof - by induction: at each step, the claim is true for the temporary solution.
in each step, let MAX be max{sum(si)} at the start of the iteration (before the add)!

base: the set of subsets at start is {},{},.. MAX=sum(si)=0 for each si. 
step: assume the assumption is true for iteration i, we'll show it is also true for iteration i+1:
let s be the set that x was added to, then MAX-sum(s) <= max{S} (induction assumption).
if sum(s) + x <= MAX: we are done, MAX was not changed.
else: we sorted the elements at start, so x <= max{S}, and thus if s was chosen
   (sum(si) >= sum(s) for each si) and sum(s) + x > MAX then: for each si, sum(si) + x >=
   sum(s) + x, so sum(s)+x - sum(si) <= x <= max{S}. since sum(s)+x will be the MAX next 
   iteration, we are done.

because for each set MAX-sum(si) <= max{S} (and obviously, for the max set, MAX-sum(si)=0), at overall Sigma(MAX-sum(si)) <= (k-1)*max{S}, as promised.

EDIT :
I had some spare time, so I programmed both heuristics suggested by me and by @Akhil, and both metrics, first of all, both results are conclusive (according to Wilcoxon's pair-t test), but which is better is defined by which metric you choose, surprisingly, the algorithm which tried to minimize f() (@Akhil`s) scored lower for this same f, but higher for the second metric.



回答2:

One heuristic would be to spread the larger weights among the bags as evenly as possible, leaving enough smaller weights that you're now left with a subproblem with a large number of degrees of freedom. Repeat into sub-subproblems if necessary. This heuristic assumes your distribution is not too geometric, e.g. {1000} and {100, 10, 1}, and slightly assumes that your penalty function will penalize nil-assignments or very large outliers.

For example:

distributeFairly(numbers, bins):
    distributeFairlySubproblem(numbers, bins):
        n = len(numbers)
        numElementsToDefer = min(-n//3,20*k)  # modify as appropriate, e.g. to avoid len(toPlace)<len(toDefer)

        toDefer = numbers[-numElementsToDefer:]
        toPlace = numbers[:-numElementsToDefer]

        newBins = shoveThemIn(toPlace, copy(bins))
        return distributeFairlySubproblem(toDefer, newBins)

    initialGuess = distributeFairlySubproblem(sorted(numbers,reverse=True), [[]]*k)
    return anneal(initialGuess)


回答3:

Let the metric be to minimize max(sum(si) - sum(sj)) where si and sj are any two subsets in the resulting partition of the set S.

Lets say we have a distribution D and we need to include another element x to the distribution D. Add it to the subset s such that the metric above is minimized.

Could not prove any bounds, but intuition says that it will give a good approximation to the optimal? Anyone good at proving bounds?