Algorithm to generate k element subsets in order o

2019-05-14 15:15发布

问题:

If I have an unsorted large set of n integers (say 2^20 of them) and would like to generate subsets with k elements each (where k is small, say 5) in increasing order of their sums, what is the most efficient way to do so?

Why I need to generate these subsets in this fashion is that I would like to find the k-element subset with the smallest sum satisfying a certain condition, and I thus would apply the condition on each of the k-element subsets generated.

Also, what would be the complexity of the algorithm?

There is a similar question here: Algorithm to get every possible subset of a list, in order of their product, without building and sorting the entire list (i.e Generators) about generating subsets in order of their product, but it wouldn't fit my needs due to the extremely large size of the set n

I intend to implement the algorithm in Mathematica, but could do it in C++ or Python too.

回答1:

If your desired property of the small subsets (call it P) is fairly common, a probabilistic approach may work well:

  1. Sort the n integers (for millions of integers i.e. 10s to 100s of MB of ram, this should not be a problem), and sum the k-1 smallest. Call this total offset.
  2. Generate a random k-subset (say, by sampling k random numbers, mod n) and check it for P-ness.
  3. On a match, note the sum-total of the subset. Subtract offset from this to find an upper bound on the largest element of any k-subset of equivalent sum-total.
  4. Restrict your set of n integers to those less than or equal to this bound.
  5. Repeat (goto 2) until no matches are found within some fixed number of iterations.

Note the initial sort is O(n log n). The binary search implicit in step 4 is O(log n).

Obviously, if P is so rare that random pot-shots are unlikely to get a match, this does you no good.



回答2:

Even if only 1 in 1000 of the k-sized sets meets your condition, That's still far too many combinations to test. I believe runtime scales with nCk (n choose k), where n is the size of your unsorted list. The answer by Andrew Mao has a link to this value. 10^28/1000 is still 10^25. Even at 1000 tests per second, that's still 10^22 seconds. =10^14 years.

If you are allowed to, I think you need to eliminate duplicate numbers from your large set. Each duplicate you remove will drastically reduce the number of evaluations you need to perform. Sort the list, then kill the dupes.

Also, are you looking for the single best answer here? Who will verify the answer, and how long would that take? I suggest implementing a Genetic Algorithm and running a bunch of instances overnight (for as long as you have the time). This will yield a very good answer, in much less time than the duration of the universe.



回答3:

Do you mean 20 integers, or 2^20? If it's really 2^20, then you may need to go through a significant amount of (2^20 choose 5) subsets before you find one that satisfies your condition. On a modern 100k MIPS CPU, assuming just 1 instruction can compute a set and evaluate that condition, going through that entire set would still take 3 quadrillion years. So if you even need to go through a fraction of that, it's not going to finish in your lifetime.

Even if the number of integers is smaller, this seems to be a rather brute force way to solve this problem. I conjecture that you may be able to express your condition as a constraint in a mixed integer program, in which case solving the following could be a much faster way to obtain the solution than brute force enumeration. Assuming your integers are w_i, i from 1 to N:

min sum(i) w_i*x_i
    x_i binary
    sum over x_i = k
subject to (some constraints on w_i*x_i)

If it turns out that the linear programming relaxation of your MIP is tight, then you would be in luck and have a very efficient way to solve the problem, even for 2^20 integers (Example: max-flow/min-cut problem.) Also, you can use the approach of column generation to find a solution since you may have a very large number of values that cannot be solved for at the same time.

If you post a bit more about the constraint you are interested in, I or someone else may be able to propose a more concrete solution for you that doesn't involve brute force enumeration.



回答4:

Here's an approximate way to do what you're saying.

First, sort the list. Then, consider some length-5 index vector v, corresponding to the positions in the sorted list, where the maximum index is some number m, and some other index vector v', with some max index m' > m. The smallest sum for all such vectors v' is always greater than the smallest sum for all vectors v.

So, here's how you can loop through the elements with approximately increasing sum:

sort arr

for i = 1 to N
   for v = 5-element subsets of (1, ..., i)
     set = arr{v}
     if condition(set) is satisfied
       break_loop = true
       compute sum(set), keep set if it is the best so far
   break if break_loop

Basically, this means that you no longer need to check for 5-element combinations of (1, ..., n+1) if you find a satisfying assignment in (1, ..., n), since any satisfying assignment with max index n+1 will have a greater sum, and you can stop after that set. However, there is no easy way to loop through the 5-combinations of (1, ..., n) while guaranteeing that the sum is always increasing, but at least you can stop checking after you find a satisfying set at some n.



回答5:

This looks to be a perfect candidate for map-reduce (http://en.wikipedia.org/wiki/MapReduce). If you know of any way of partitioning them smartly so that passing candidates are equally present in each node then you can probably get a great throughput.

Complete sort may not really be needed as the map stage can take care of it. Each node can then verify the condition against the k-tuples and output results into a file that can be aggregated / reduced later.

If you know of the probability of occurrence and don't need all of the results try looking at probabilistic algorithms to converge to an answer.