If I have an unsorted large set of n
integers (say 2^20
of them) and would like to generate subsets with k
elements each (where k
is small, say 5
) in increasing order of their sums, what is the most efficient way to do so?
Why I need to generate these subsets in this fashion is that I would like to find the k-element subset with the smallest sum satisfying a certain condition, and I thus would apply the condition on each of the k-element subsets generated.
Also, what would be the complexity of the algorithm?
There is a similar question here: Algorithm to get every possible subset of a list, in order of their product, without building and sorting the entire list (i.e Generators) about generating subsets in order of their product, but it wouldn't fit my needs due to the extremely large size of the set n
I intend to implement the algorithm in Mathematica, but could do it in C++ or Python too.
Here's an approximate way to do what you're saying.
First, sort the list. Then, consider some length-5 index vector
v
, corresponding to the positions in the sorted list, where the maximum index is some numberm
, and some other index vectorv'
, with some max indexm' > m
. The smallest sum for all such vectorsv'
is always greater than the smallest sum for all vectorsv
.So, here's how you can loop through the elements with approximately increasing sum:
Basically, this means that you no longer need to check for 5-element combinations of
(1, ..., n+1)
if you find a satisfying assignment in(1, ..., n)
, since any satisfying assignment with max indexn+1
will have a greater sum, and you can stop after that set. However, there is no easy way to loop through the 5-combinations of(1, ..., n)
while guaranteeing that the sum is always increasing, but at least you can stop checking after you find a satisfying set at somen
.This looks to be a perfect candidate for map-reduce (http://en.wikipedia.org/wiki/MapReduce). If you know of any way of partitioning them smartly so that passing candidates are equally present in each node then you can probably get a great throughput.
Complete sort may not really be needed as the map stage can take care of it. Each node can then verify the condition against the k-tuples and output results into a file that can be aggregated / reduced later.
If you know of the probability of occurrence and don't need all of the results try looking at probabilistic algorithms to converge to an answer.
If your desired property of the small subsets (call it
P
) is fairly common, a probabilistic approach may work well:n
integers (for millions of integers i.e. 10s to 100s of MB of ram, this should not be a problem), and sum thek-1
smallest. Call this totaloffset
.k
-subset (say, by samplingk
random numbers, modn
) and check it forP
-ness.offset
from this to find an upper bound on the largest element of anyk
-subset of equivalent sum-total.n
integers to those less than or equal to this bound.Note the initial sort is
O(n log n)
. The binary search implicit in step 4 isO(log n)
.Obviously, if
P
is so rare that random pot-shots are unlikely to get a match, this does you no good.Even if only 1 in 1000 of the k-sized sets meets your condition, That's still far too many combinations to test. I believe runtime scales with nCk (n choose k), where n is the size of your unsorted list. The answer by Andrew Mao has a link to this value. 10^28/1000 is still 10^25. Even at 1000 tests per second, that's still 10^22 seconds. =10^14 years.
If you are allowed to, I think you need to eliminate duplicate numbers from your large set. Each duplicate you remove will drastically reduce the number of evaluations you need to perform. Sort the list, then kill the dupes.
Also, are you looking for the single best answer here? Who will verify the answer, and how long would that take? I suggest implementing a Genetic Algorithm and running a bunch of instances overnight (for as long as you have the time). This will yield a very good answer, in much less time than the duration of the universe.
Do you mean 20 integers, or 2^20? If it's really 2^20, then you may need to go through a significant amount of (2^20 choose 5) subsets before you find one that satisfies your condition. On a modern 100k MIPS CPU, assuming just 1 instruction can compute a set and evaluate that condition, going through that entire set would still take 3 quadrillion years. So if you even need to go through a fraction of that, it's not going to finish in your lifetime.
Even if the number of integers is smaller, this seems to be a rather brute force way to solve this problem. I conjecture that you may be able to express your condition as a constraint in a mixed integer program, in which case solving the following could be a much faster way to obtain the solution than brute force enumeration. Assuming your integers are
w_i
, i from 1 to N:If it turns out that the linear programming relaxation of your MIP is tight, then you would be in luck and have a very efficient way to solve the problem, even for 2^20 integers (Example: max-flow/min-cut problem.) Also, you can use the approach of column generation to find a solution since you may have a very large number of values that cannot be solved for at the same time.
If you post a bit more about the constraint you are interested in, I or someone else may be able to propose a more concrete solution for you that doesn't involve brute force enumeration.