How to randomly pick up N numbers from a vector a
with weight assigned to each number?
Let's say:
a = 1:3; % possible numbers
weight = [0.3 0.1 0.2]; % corresponding weights
In this case probability to pick up 1 should be 3 times higher than to pick up 2.
Sum of all weights can be anything.
TL;DR
For maximum performance, if you only need a singe sample, use
and if you need multiple samples, use
Avoid
randsample
. Generating multiple samples upfront is three orders of magnitude faster than generating individual values.Performance metrics
Since this showed up near the top of my Google search, I just wanted to add some performance metrics to show that the right solution will depend very much on the value of N and the requirements of the application. Also that changing the design of the application can dramatically increase performance.
For large
N
, or indeedN > 1
:Results:
In this case,
histc
is fastestHowever, in the case where maybe it is not possible to generate all N values up front, perhaps because the weights are updated on each iterations, i.e.
N=1
:Results:
In this case, the custom
cumsum
approach (based on thebsxfun
version) is fastest.In any case,
randsample
certainly looks like a bad choice all round. It also goes to show that if an algorithm can be arranged to generate all random variables upfront then it will perform much better (note that there are three orders of magnitude less values generated in theN=1
case in a similar execution time).Code is available here.
amro gives a nice answer (that I rated up), but it will be highly intensive if you wish to generate many numbers from a large set. This is because the bsxfun operation can generate a huge array, which is then summed. For example, suppose I had a set of 10000 values to sample from, all with different weights? Now, generate 1000000 numbers from that sample.
This will take some work to do, since it will generate a 10000x1000000 array internally, with 10^10 elements in it. It will be a logical array, but even so, 10 gigabytes of ram must be allocated.
A better solution is to use histc. Thus...
However, for a large problem of the size I suggested above, it is fast.
Admittedly, my version takes 2 lines to write. The indexing operation must happen on a second line since it uses the second output of histc. Also note that I've used the ability of the new matlab release, with the tilde (~) operator as the first argument of histc. This causes that first argument to be immediately dumped in the bit bucket.
randsample is included in the Statistics Toolbox
Otherwise you can use some kind of roulette-wheel selection process. See this similar question (although not MATLAB specific). Here's my one-line implementation:
Explanation:
Consider the interval [0,1]. We assign for each element in the list (
1:3
) a sub-interval of length proportionate to the weight of each element; therefore1
get and interval of length0.3/(0.3+0.1+0.2)
, same for the others.Now if we generate a random number with uniform distribution over [0,1], then any number in [0,1] has an equal probability of being picked, thus the sub-intervals' lengths determine the probability of the random number falling in each interval.
This matches what I'm doing above: pick a number X~U[0,1] (more like
N
numbers), then find which interval it falls into in a vectorized way..You can check the results of the two techniques above by generating a large enough sequence
N=1000
:which more or less match the normalized weights
w./sum(w)
[0.5 0.16667 0.33333]
Amro has a really nice answer for this topic. However, one might want a super-fast implementation to sample from huge PDFs where the domain might contain several thousands. For such scenarios, it might be tedious to use bsxfun and cumsum very frequently. Motivated from Gnovice's answer, it would make sense to implement roulette wheel algorithm with a run length encoding schema. I performed a benchmark with Amro's solution and new code:
Some evaluations of the code above for very large data where domain of PDF contain huge length.
The idea is to create a run length encoding table where frequent values of the PDF are replicated more compared to non-frequent values. At the end of the day, we sample an index for weighted sample table, using uniform distribution, and use corresponding value.
It is memory intensive, but with this approach it is even possible to scale up to PDF lengths of hundred thousands. Hence access is super-fast.