Weighted sampling without replacement

2019-02-20 11:31发布

问题:

I have a population p of indices and corresponding weights in vector w. I want to get k samples from this population without replacement where the selection is done proportional to the weights in random.

I know that randsample can be used for selection with replacement by saying

J = randsample(p,k,true,w)

but when I call it with parameter false instead of true, I get

??? Error using ==> randsample at 184
Weighted sampling without replacement is not supported.

I wrote my own function as discussed in here:

p = 1:n;
J = zeros(1,k);
for i = 1:k
    J(i) = randsample(p,1,true,w);
    w(p == J(i)) = 0;
end

But since it has k iterations in the loop, I seek for a shorter/faster way to do this. Do you have any suggestions?

EDIT: I want to randomly select k unique columns of a matrix proportional to some weighting criteria. That is why I use sampling without replacement.

回答1:

I don't think it is possible to avoid some sort of loop, since sampling without replacement means that the samples are no longer independent. Besides, what does the weighting actually mean when sampling without replacement?

In any case, for relatively small sample sizes I don't think you will notice any problem with performance. All the solutions I can think of basically do what you have done, but possibly expand out what is going on in randsample.



回答2:

I think you should keep using the for, but I suggest to reduce the corresponding weight by one.

w(p == J(i)) = w(p == J(i)) -1;


回答3:

This still shows up in search results, so I wanted to add the datasample function as an option. The following code will provide a weighted sample of 5 units from fromVector according the corresponding vector myWeights.

mySample = datasample(fromVector, 5, 'Replace', false, 'Weights', myWeights)


回答4:

An alternative to the for loop approach of petrichor that performs well if the number of samples is much smaller than the number of elements is to compute a weighted random sample with replacement and then remove duplicates. Of course, this is a very bad idea if the number of samples k is near the number of elements n, as this will require many iterations, but by avoiding for loops, the wall clock performance is often better. Your mileage may vary.

function I=randsample_noreplace(n,k,w)
I = sort(randsample(n, k, true, w));
while 1
    Idup = find( I(2:end)-I(1:end-1) ==0);
    if length(Idup) == 0
            break
    else
            I(Idup)=randsample(n, length(Idup), true, w);
            I = sort(I);
    end
end


回答5:

If you want to select a large fraction of the columns (i.e., k is not very much smaller than n), or if the weights are very skewed, you can use this refinement of Jeff's solution, which ensures that each call to randsample produces samples distinct from the previous ones.

Moreover, it returns the samples in the order in which true sampling without replacement would return them, rather than sorted.

function I=randsample_noreplace(n,k,w)
I = randsample(n, k, true, w);
while 1
    [II, idx] = sort(I);
    Idup = [false, diff(II)==0];
    if ~any(Idup)
        break
    else
        w(I) = 0;            %% Don't replace samples
        Idup (idx) = Idup;   %% find duplicates in original list
        I = [I(~Idup),  (randsample(n, sum(Idup), true, w))];
    end
end

When selecting 29 out of 30 values with uniform weights (the case that gives least benefit), it takes 3 or 4 iterations, compared with 26 without the additional line. If the weights are chosen uniformly, it still takes 3 to 5 iterations compared with around 80 without the additional line.

Also, the number of iterations is bounded by k, however skewed the distribution is.