Algorithm to split an array into P subarrays of ba

2019-01-17 11:30发布

站内文章 / 后端开发

27 0

祖国的老花朵

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have an big array of length N, let's say something like:

2 4 6 7 6 3 3 3 4 3 4 4 4 3 3 1

I need to split this array into P subarrays (in this example, P=4 would be reasonable), such that the sum of the elements in each subarray is as close as possible to sigma, being:

sigma=(sum of all elements in original array)/P

In this example, sigma=15.

For the sake of clarity, one possible result would be:

2 4 6    7 6 3 3   3 4 3 4    4 4 3 3 1
(sums: 12,19,14,15)

I have written a very naive algorithm based in how I would do the divisions by hand, but I don't know how to impose the condition that a division whose sums are (14,14,14,14,19) is worse than one that is (15,14,16,14,16).

Thank you in advance.

回答1:

First, let’s formalize your optimization problem by specifying the input, output, and the measure for each possible solution (I hope this is in your interest):

Given an array A of positive integers and a positive integer P, separate the array A into P non-overlapping subarrays such that the difference between the sum of each subarray and the perfect sum of the subarrays (sum(A)/P) is minimal.

Input: Array A of positive integers; P is a positive integer.
Output: Array SA of P non-negative integers representing the length of each subarray of A where the sum of these subarray lengths is equal to the length of A.
Measure: abs(sum(sa)-sum(A)/P) is minimal for each sa ∈ {sa | sa = (A_i, …, A_{i+‍SA_j}) for i = (Σ SA_j), j from 0 to P-1}.

The input and output define the set of valid solutions. The measure defines a measure to compare multiple valid solutions. And since we’re looking for a solution with the least difference to the perfect solution (minimization problem), measure should also be minimal.

With this information, it is quite easy to implement the measure function (here in Python):

def measure(a, sa):
    sigma = sum(a)/len(sa)
    diff = 0
    i = 0
    for j in xrange(0, len(sa)):
        diff += abs(sum(a[i:i+sa[j]])-sigma)
        i += sa[j]
    return diff

print measure([2,4,6,7,6,3,3,3,4,3,4,4,4,3,3,1], [3,4,4,5]) # prints 8

Now finding an optimal solution is a little harder.

We can use the Backtracking algorithm for finding valid solutions and use the measure function to rate them. We basically try all possible combinations of P non-negative integer numbers that sum up to length(A) to represent all possible valid solutions. Although this ensures not to miss a valid solution, it is basically a brute-force approach with the benefit that we can omit some branches that cannot be any better than our yet best solution. E.g. in the example above, we wouldn’t need to test solutions with [9,…] (measure > 38) if we already have a solution with measure ≤ 38.

Following the pseudocode pattern from Wikipedia, our bt function looks as follows:

def bt(c):
    global P, optimum, optimum_diff
    if reject(P,c):
        return
    if accept(P,c):
        print "%r with %d" % (c, measure(P,c))
        if measure(P,c) < optimum_diff:
            optimum = c
            optimum_diff = measure(P,c)
        return
    s = first(P,c)
    while s is not None:
        bt(list(s))
        s = next(P,s)

The global variables P, optimum, and optimum_diff represent the problem instance holding the values for A, P, and sigma, as well as the optimal solution and its measure:

class MinimalSumOfSubArraySumsProblem:
    def __init__(self, a, p):
        self.a = a
        self.p = p
        self.sigma = sum(a)/p

Next we specify the reject and accept functions that are quite straight forward:

def reject(P,c):
    return optimum_diff < measure(P,c)
def accept(P,c):
    return None not in c

This simply rejects any candidate whose measure is already more than our yet optimal solution. And we’re accepting any valid solution.

The measure function is also slightly changed due to the fact that c can now contain None values:

def measure(P, c):
    diff = 0
    i = 0
    for j in xrange(0, P.p):
        if c[j] is None:
            break;
        diff += abs(sum(P.a[i:i+c[j]])-P.sigma)
        i += c[j]
    return diff

The remaining two function first and next are a little more complicated:

def first(P,c):
    t = 0
    is_complete = True
    for i in xrange(0, len(c)):
        if c[i] is None:
            if i+1 < len(c):
                c[i] = 0
            else:
                c[i] = len(P.a) - t
            is_complete = False
            break;
        else:
            t += c[i]
    if is_complete:
        return None
    return c

def next(P,s):
    t = 0
    for i in xrange(0, len(s)):
        t += s[i]
        if i+1 >= len(s) or s[i+1] is None:
            if t+1 > len(P.a):
                return None
            else:
                s[i] += 1
            return s

Basically, first either replaces the next None value in the list with either 0 if it’s not the last value in the list or with the remainder to represent a valid solution (little optimization here) if it’s the last value in the list, or it return None if there is no None value in the list. next simply increments the rightmost integer by one or returns None if an increment would breach the total limit.

Now all you need is to create a problem instance, initialize the global variables and call bt with the root:

P = MinimalSumOfSubArraySumsProblem([2,4,6,7,6,3,3,3,4,3,4,4,4,3,3,1], 4)
optimum = None
optimum_diff = float("inf")
bt([None]*P.p)

回答2:

Working code below (I used php language). This code decides part quantity itself;

$main = array(2,4,6,1,6,3,2,3,4,3,4,1,4,7,3,1,2,1,3,4,1,7,2,4,1,2,3,1,1,1,1,4,5,7,8,9,8,0);
$pa=0;
for($i=0;$i < count($main); $i++){
$p[]= $main[$i];
if(abs(15 - array_sum($p)) < abs(15 - (array_sum($p)+$main[$i+1])))
{
$pa=$pa+1;
$pi[] = $i+1;
$pc =  count($pi);

$ba = $pi[$pc-2] ;

$part[$pa] = array_slice( $main,  $ba, count($p));
unset($p);
}
}
print_r($part);
for($s=1;$s<count($part);$s++){
echo '<br>';
echo array_sum($part[$s]);
}

code will output part sums like as below

回答3:

If I am not mistaken here, one more approach is dynamic programming.

You can define P[ pos, n ] as the smallest possible "penalty" accumulated up to position pos if n subarrays were created. Obviously there is some position pos' such that

P[pos', n-1] + penalty(pos', pos) = P[pos, n]

You can just minimize over pos' = 1..pos.

The naive implementation will run in O(N^2 * M), where N - size of the original array and M - number of divisions.

回答4:

I'm wondering whether the following would work:

Go from the left, as soon as sum > sigma, branch into two, one including the value that pushes it over, and one that doesn't. Recursively process data to the right with rightSum = totalSum-leftSum and rightP = P-1.

So, at the start, sum = 60

2 4 6 7 6 3 3 3 4 3 4 4 4 3 3 1

Then for 2 4 6 7, sum = 19 > sigma, so split into:

2 4 6     7 6 3 3 3 4 3 4 4 4 3 3 1

2 4 6 7     6 3 3 3 4 3 4 4 4 3 3 1

Then we process 7 6 3 3 3 4 3 4 4 4 3 3 1 and 6 3 3 3 4 3 4 4 4 3 3 1 with P = 4-1 and sum = 60-12 and sum = 60-19 respectively.

This results in, I think, O(P*n).

It might be a problem when 1 or 2 values is by far the largest, but, for any value >= sigma, we can probably just put that in it's own partition (preprocessing the array to find these might be the best idea (and reduce sum appropriately)).

If it works, it should hopefully minimise sum-of-squared-error (or close to that), which seems like the desired measure.

回答5:

I propose an algorithm based on backtracking. The main function chosen randomly select an element from the original array and adds it to an array partitioned. For each addition will check to obtain a better solution than the original. This will be achieved by using a function that calculates the deviation, distinguishing each adding a new element to the page. Anyway, I thought it would be good to add an original variables in loops that you can not reach desired solution will force the program ends. By desired solution I means to add all elements with respect of condition imposed by condition from if.

sum=CalculateSum(vector)
Read P
sigma=sum/P
initialize P vectors, with names vector_partition[i], i=1..P
list_vector initialize a list what pointed this P vectors
initialize a diferences_vector with dimension of P
//that can easy visualize like a vector of vectors
//construct a non-recursive backtracking algorithm
function Deviation(vector) //function for calculate deviation of elements from a vector
{
  dev=0
  for i=0 to Size(vector)-1 do
  dev+=|vector[i+1]-vector[i]|
  return dev 
}
iteration=0
//fix some maximum number of iteration for while loop
Read max_iteration
//as the number of iterations will be higher the more it will get  
//a more accurate solution
while(!IsEmpty(vector))
{   
   for i=1 to Size(list_vector) do
   {
       if(IsEmpty(vector)) break from while loop
       initial_deviation=Deviation(list_vector[i])
       el=SelectElement(vector) //you can implement that function using a randomized   
                               //choice of element
       difference_vector[i]=|sigma-CalculateSum(list_vector[i])|
       PutOnBackVector(vector_list[i], el)
       if(initial_deviation>Deviation(difference_vector))
          ExtractFromBackVectorAndPutOnSecondVector(list_vector, vector)
    }
    iteration++
    //prevent to enter in some infinite loop
   if (iteration>max_iteration) break from while loop

} You can change this by adding in first if some code witch increment with a amount the calculated deviation. aditional_amount=0 iteration=0 while { ... if(initial_deviation>Deviation(difference_vector)+additional_amount) ExtractFromBackVectorAndPutOnSecondVector(list_vector, vector) if(iteration>max_iteration) { iteration=0 aditional_amout+=1/some_constant } iteration++ //delete second if from first version }

回答6:

You can use Max Flow algorithm.

回答7:

Your problem is very similar to, or the same as, the minimum makespan scheduling problem, depending on how you define your objective. In the case that you want to minimize the maximum |sum_i - sigma|, it is exactly that problem.

As referenced in the Wikipedia article, this problem is NP-complete for p > 2. Graham's list scheduling algorithm is optimal for p <= 3, and provides an approximation ratio of 2 - 1/p. You can check out the Wikipedia article for other algorithms and their approximation.

All the algorithms given on this page are either solving for a different objective, incorrect/suboptimal, or can be used to solve any problem in NP :)

回答8:

This is very similar to the case of the one-dimensional bin packing problem, see http://www.cs.sunysb.edu/~algorith/files/bin-packing.shtml. In the associated book, The Algorithm Design Manual, Skienna suggests a first-fit decreasing approach. I.e. figure out your bin size (mean = sum / N), and then allocate the largest remaining object into the first bin that has room for it. You either get to a point where you have to start over-filling a bin, or if you're lucky you get a perfect fit. As Skiena states "First-fit decreasing has an intuitive appeal to it, for we pack the bulky objects first and hope that little objects can fill up the cracks."

As a previous poster said, the problem looks like it's NP-complete, so you're not going to solve it perfectly in reasonable time, and you need to look for heuristics.

回答9:

I recently needed this and did as follows;

create an initial sub-arrays array of length given sub arrays count. sub arrays should have a sum property too. ie [[sum:0],[sum:0]...[sum:0]]
sort the main array descending.
search for the sub-array with the smallest sum and insert one item from main array and increment the sub arrays sum property by the inserted item's value.
repeat item 3 up until the end of main array is reached.
return the initial array.

This is the code in JS.

function groupTasks(tasks,groupCount){
  var  sum = tasks.reduce((p,c) => p+c),
   initial = [...Array(groupCount)].map(sa => (sa = [], sa.sum = 0, sa));
  return tasks.sort((a,b) => b-a)
              .reduce((groups,task) => { var group = groups.reduce((p,c) => p.sum < c.sum ? p : c);
                                         group.push(task);
                                         group.sum += task;
                                         return groups;
                                       },initial);
}

var tasks = [...Array(50)].map(_ => ~~(Math.random()*10)+1), // create an array of 100 random elements among 1 to 10
   result = groupTasks(tasks,7);                             // distribute them into 10 sub arrays with closest sums

console.log("input array:", JSON.stringify(tasks));
console.log(result.map(r=> [JSON.stringify(r),"sum: " + r.sum]));