Weighted random selection from array

2020-01-25 04:41发布

站内文章 / 后端开发

41 0

姐就是有狂的资本

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I would like to randomly select one element from an array, but each element has a known probability of selection.

All chances together (within the array) sums to 1.

What algorithm would you suggest as the fastest and most suitable for huge calculations?

Example:

id => chance
array[
    0 => 0.8
    1 => 0.2
]

for this pseudocode, the algorithm in question should on multiple calls statistically return four elements on id 0 for one element on id 1.

回答1:

Compute the discrete cumulative density function (CDF) of your list -- or in simple terms the array of cumulative sums of the weights. Then generate a random number in the range between 0 and the sum of all weights (might be 1 in your case), do a binary search to find this random number in your discrete CDF array and get the value corresponding to this entry -- this is your weighted random number.

回答2:

The algorithm is straight forward

rand_no = rand(0,1)
for each element in array 
     if(rand_num < element.probablity)
          select and break
     rand_num = rand_num - element.probability

回答3:

I have found this article to be the most useful at understanding this problem fully. This stackoverflow question may also be what you're looking for.

I believe the optimal solution is to use the Alias Method (wikipedia). It requires O(n) time to initialize, O(1) time to make a selection, and O(n) memory.

Here is the algorithm for generating the result of rolling a weighted n-sided die (from here it is trivial to select an element from a length-n array) as take from this article. The author assumes you have functions for rolling a fair die (floor(random() * n)) and flipping a biased coin (random() < p).

Algorithm: Vose's Alias Method

Initialization:

Create arrays Alias and Prob, each of size n.

Create two worklists, Small and Large.

Multiply each probability by n.

For each scaled probability p_i:

If p_i < 1, add i to Small.

Otherwise (p_i ≥ 1), add i to Large.

While Small and Large are not empty: (Large might be emptied first)

Remove the first element from Small; call it l.

Remove the first element from Large; call it g.

Set Prob[l]=p_l.

Set Alias[l]=g.

Set p_g := (p_g+p_l)−1. (This is a more numerically stable option.)

If p_g<1, add g to Small.

Otherwise (p_g ≥ 1), add g to Large.

While Large is not empty:

Remove the first element from Large; call it g.

Set Prob[g] = 1.

While Small is not empty: This is only possible due to numerical instability.

Remove the first element from Small; call it l.

Set Prob[l] = 1.

Generation:

Generate a fair die roll from an n-sided die; call the side i.

Flip a biased coin that comes up heads with probability Prob[i].

If the coin comes up "heads," return i.

Otherwise, return Alias[i].

回答4:

An example in ruby

#each element is associated with its probability
a = {1 => 0.25 ,2 => 0.5 ,3 => 0.2, 4 => 0.05}

#at some point, convert to ccumulative probability
acc = 0
a.each { |e,w| a[e] = acc+=w }

#to select an element, pick a random between 0 and 1 and find the first   
#cummulative probability that's greater than the random number
r = rand
selected = a.find{ |e,w| w>r }

p selected[0]

回答5:

This can be done in O(1) expected time per sample as follows.

Compute the CDF F(i) for each element i to be the sum of probabilities less than or equal to i.

Define the range r(i) of an element i to be the interval [F(i - 1), F(i)].

For each interval [(i - 1)/n, i/n], create a bucket consisting of the list of the elements whose range overlaps the interval. This takes O(n) time in total for the full array as long as you are reasonably careful.

When you randomly sample the array, you simply compute which bucket the random number is in, and compare with each element of the list until you find the interval that contains it.

The cost of a sample is O(the expected length of a randomly chosen list) <= 2.

回答6:

Another Ruby example:

def weighted_rand(weights = {})
  raise 'Probabilities must sum up to 1' unless weights.values.inject(&:+) == 1.0
  raise 'Probabilities must not be negative' unless weights.values.all? { |p| p >= 0 }
  # Do more sanity checks depending on the amount of trust in the software component using this method
  # E.g. don't allow duplicates, don't allow non-numeric values, etc.

  # Ignore elements with probability 0
  weights = weights.reject { |k, v| v == 0.0 }   # e.g. => {"a"=>0.4, "b"=>0.4, "c"=>0.2}

  # Accumulate probabilities and map them to a value
  u = 0.0
  ranges = weights.map { |v, p| [u += p, v] }   # e.g. => [[0.4, "a"], [0.8, "b"], [1.0, "c"]]

  # Generate a (pseudo-)random floating point number between 0.0(included) and 1.0(excluded)
  u = rand   # e.g. => 0.4651073966724186

  # Find the first value that has an accumulated probability greater than the random number u
  ranges.find { |p, v| p > u }.last   # e.g. => "b"
end

How to use:

weights = {'a' => 0.4, 'b' => 0.4, 'c' => 0.2, 'd' => 0.0}

weighted_rand weights

What to expect roughly:

sample = 1000.times.map{ weighted_rand weights }
sample.count('a') # 396
sample.count('b') # 406
sample.count('c') # 198
sample.count('d') # 0

回答7:

This is a PHP code I used in production:

/**
 * @return \App\Models\CdnServer
*/
protected function selectWeightedServer(Collection $servers)
{
    if ($servers->count() == 1) {
        return $servers->first();
    }

    $totalWeight = 0;

    foreach ($servers as $server) {
        $totalWeight += $server->getWeight();
    }

    // Select a random server using weighted choice
    $randWeight = mt_rand(1, $totalWeight);
    $accWeight = 0;

    foreach ($servers as $server) {
        $accWeight += $server->getWeight();

        if ($accWeight >= $randWeight) {
            return $server;
        }
    }
}

回答8:

Ruby solution using the pickup gem:

require 'pickup'

chances = {0=>80, 1=>20}
picker = Pickup.new(chances)

Example:

5.times.collect {
  picker.pick(5)
}

gave output:

[[0, 0, 0, 0, 0], 
 [0, 0, 0, 0, 0], 
 [0, 0, 0, 1, 1], 
 [0, 0, 0, 0, 0], 
 [0, 0, 0, 0, 1]]

回答9:

If the array is small, I would give the array a length of, in this case, five and assign the values as appropriate:

array[
    0 => 0
    1 => 0
    2 => 0
    3 => 0
    4 => 1
]

回答10:

the trick could be to sample an auxiliary array with elements repetitions which reflect the probability

Given the elements associated with their probability, as percentage:

h = {1 => 0.5, 2 => 0.3, 3 => 0.05, 4 => 0.05 }

auxiliary_array = h.inject([]){|memo,(k,v)| memo += Array.new((100*v).to_i,k) }   

ruby-1.9.3-p194 > auxiliary_array 
 => [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,                                 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4] 

auxiliary_array.sample

if you want to be as generic as possible, you need to calculate the multiplier based on the max number of fractional digits, and use it in the place of 100:

m = 10**h.values.collect{|e| e.to_s.split(".").last.size }.max

回答11:

I would imagine that numbers greater or equal than 0.8 but less than 1.0 selects the third element.

In other terms:

x is a random number between 0 and 1

if 0.0 >= x < 0.2 : Item 1

if 0.2 >= x < 0.8 : Item 2

if 0.8 >= x < 1.0 : Item 3

回答12:

I am going to improve on https://stackoverflow.com/users/626341/masciugo answer.

Basically you make one big array where the number of times an element shows up is proportional to the weight.

It has some drawbacks.

The weight might not be integer. Imagine element 1 has probability of pi and element 2 has probability of 1-pi. How do you divide that? Or imagine if there are hundreds of such elements.
The array created can be very big. Imagine if least common multiplier is 1 million, then we will need an array of 1 million element in the array we want to pick.

To counter that, this is what you do.

Create such array, but only insert an element randomly. The probability that an element is inserted is proportional the the weight.

Then select random element from usual.

So if there are 3 elements with various weight, you simply pick an element from an array of 1-3 elements.

Problems may arise if the constructed element is empty. That is it just happens that no elements show up in the array because their dice roll differently.

In which case, I propose that the probability an element is inserted is p(inserted)=wi/wmax.

That way, one element, namely the one that has the highest probability, will be inserted. The other elements will be inserted by the relative probability.

Say we have 2 objects.

element 1 shows up .20% of the time. element 2 shows up .40% of the time and has the highest probability.

In thearray, element 2 will show up all the time. Element 1 will show up half the time.

So element 2 will be called 2 times as many as element 1. For generality all other elements will be called proportional to their weight. Also the sum of all their probability are 1 because the array will always have at least 1 element.