Computing the subset giving the minimum standard d

2019-06-27 19:17发布

问题:

Let us have a vector of size N. For example:

x = rand(N,1)

I want to compute the minimum standard deviation of a subset of length K in the vector.

When N and K are small, it is easy to find the best subset since I can use nchoosek(N,K) to enumerate all possible subsets. But when the values of N and K are larger than let's say N=50 and K=25, nchoosek fails to compute the combinations since the size of the possible subsets are huge.

I wonder if there is a better algorithm to compute the subset that gives the smallest standard deviation in an array efficiently. For example via dynamic programming. Any ideas?

Update:

I've implemented it in a loop aftter the answers and compared to the combinatorial solution. The results are always the same but the speed gain are unprecedented.

n = 20;
k = 10;
x = rand(n,1);
C = nchoosek(x, k);

tic
mins = realmax;
for i = 1:size(C,1)
    s = std(C(i,:));
    if s < mins
        mins = s;
        bestC = C(i,:);
    end
end
toc

tic
[x2, j] = sort(x);
mins2 = realmax;
for i = 1:(n-k+1)
    s = std(x2(i:(i+k-1)));
    if s < mins2
        mins2 = s;
        idx = j((i:(i+k-1)));
    end
end
toc

if mins == mins2
    'Equal'
end

gives

Elapsed time is 7.786579 seconds.
Elapsed time is 0.002068 seconds.

ans =

Equal

回答1:

Sort the array and then compute this in one pass with rolling window of length K.

I'm sure this will give you the right answer, will think if I can prove it.

Hand-wavey argument, with a possible gap in logic in the "extend this" part:

Consider an element x from your list. Let's try and find the minimum standard deviation of a set of size 2 containing this element. We'll get this by choosing x and the closest element to x. Extend this to k elements and we'll get a set that is a contiguous part of the sorted list that contains x. To pick the smallest subset of k elements (i.e for any x) we therefore just have to iterate over the sorted list as described before.