Efficient algorithm to randomly select items with

Given an array of n word-frequency pairs:

[ (w₀, f₀), (w₁, f₁), ..., (w_n-1, f_n-1) ]

where w_i is a word, f_i is an integer frequencey, and the sum of the frequencies ∑f_i = m,

I want to use a pseudo-random number generator (pRNG) to select p words w_j₀, w_j₁, ..., w_{j_p-1} such that the probability of selecting any word is proportional to its frequency:

P(w_i = w_{j_k}) = P(i = j_k) = f_i / m

(Note, this is selection with replacement, so the same word could be chosen every time).

I've come up with three algorithms so far:

Create an array of size m, and populate it so the first f₀ entries are w₀, the next f₁ entries are w₁, and so on, so the last f_p-1 entries are w_p-1.
```
[ w₀, ..., w₀, w₁,..., w₁, ..., w_p-1, ..., w_p-1 ]
```
Then use the pRNG to select p indices in the range 0...m-1, and report the words stored at those indices.
This takes O(n + m + p) work, which isn't great, since m can be much much larger than n.
Step through the input array once, computing
```
m_i = ∑_h≤if_h = m_i-1 + f_i
```
and after computing m_i, use the pRNG to generate a number x_k in the range 0...m_i-1 for each k in 0...p-1 and select w_i for w_{j_k} (possibly replacing the current value of w_{j_k}) if x_k < f_i.
This requires O(n + np) work.
Compute m_i as in algorithm 2, and generate the following array on n word-frequency-partial-sum triples:
```
[ (w₀, f₀, m₀), (w₁, f₁, m₁), ..., (w_n-1, f_n-1, m_n-1) ]
```
and then, for each k in 0...p-1, use the pRNG to generate a number x_k in the range 0...m-1 then do binary search on the array of triples to find the i s.t. m_i-f_i ≤ x_k < m_i, and select w_i for w_{j_k}.
This requires O(n + p log n) work.

My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?

标签： algorithm random big-o

3条回答

小情绪 Triste *

2楼-- · 2019-02-08 22:33

Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:

There are n partitions, all of the same width r s.t. nr = m.
each partition contains two words in some ratio (which is stored with the partition).
for each word w_i, f_i = ∑_{partitions t s.t w_i ∈ t} r × ratio(t,w_i)

Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from 0...n-1 at random), and the partition's ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means the p selections can be done in O(p) work, given such a partition.

The reason that such a partitioning exists is that there exists a word w_i s.t. f_i < r, if and only if there exists a word w_i' s.t. f_i' > r, since r is the average of the frequencies.

Given such a pair w_i and w_i' we can replace them with a pseudo-word w'_i of frequency f'_i = r (that represents w_i with probability f_i/r and w_i' with probability 1 - f_i/r) and a new word w'_i' of adjusted frequency f'_i' = f_i' - (r - f_i) respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.

To construct this partition in O(n) time,

go through the list of the words once, constructing two lists:
- one of words with frequency ≤ r
- one of words with frequency > r
then pull a word from the first list
- if its frequency = r, then make it into a one element partition
- otherwise, pull a word from the other list, and use it to fill out a two-word partition. Then put the second word back into either the first or second list according to its adjusted frequency.

This actually still works if the number of partitions q > n (you just have to prove it differently). If you want to make sure that r is integral, and you can't easily find a factor q of m s.t. q > n, you can pad all the frequencies by a factor of n, so f'_i = nf_i, which updates m' = mn and sets r' = m when q = n.

In any case, this algorithm only takes O(n + p) work, which I have to think is optimal.

In ruby:

def weighted_sample_with_replacement(input, p)
  n = input.size
  m = input.inject(0) { |sum,(word,freq)| sum + freq }

  # find the words with frequency lesser and greater than average
  lessers, greaters = input.map do |word,freq| 
                        # pad the frequency so we can keep it integral
                        # when subdivided
                        [ word, freq*n ] 
                      end.partition do |word,adj_freq| 
                        adj_freq <= m 
                      end

  partitions = Array.new(n) do
    word, adj_freq = lessers.shift

    other_word = if adj_freq < m
                   # use part of another word's frequency to pad
                   # out the partition
                   other_word, other_adj_freq = greaters.shift
                   other_adj_freq -= (m - adj_freq)
                   (other_adj_freq <= m ? lessers : greaters) << [ other_word, other_adj_freq ]
                   other_word
                 end

    [ word, other_word , adj_freq ]
  end

  (0...p).map do 
    # pick a partition at random
    word, other_word, adj_freq = partitions[ rand(n) ]
    # select the first word in the partition with appropriate
    # probability
    if rand(m) < adj_freq
      word
    else
      other_word
    end
  end
end

0人赞添加讨论(0) 举报

劫难

3楼-- · 2019-02-08 22:46

This sounds like roulette wheel selection, mainly used for the selection process in genetic/evolutionary algorithms.

Look at Roulette Selection in Genetic Algorithms

0人赞添加讨论(0) 举报

霸刀☆藐视天下

4楼-- · 2019-02-08 22:47

You could create the target array, then loop through the words determining the probability that it should be picked, and replace the words in the array according to a random number.

For the first word the probability would be f₀/m₀ (where m_n=f₀+..+f_n), i.e. 100%, so all positions in the target array would be filled with w₀.

For the following words the probability falls, and when you reach the last word the target array is filled with randomly picked words accoding to the frequency.

Example code in C#:

public class WordFrequency {

    public string Word { get; private set; }
    public int Frequency { get; private set; }

    public WordFrequency(string word, int frequency) {
        Word = word;
        Frequency = frequency;
    }

}

WordFrequency[] words = new WordFrequency[] {
    new WordFrequency("Hero", 80),
    new WordFrequency("Monkey", 4),
    new WordFrequency("Shoe", 13),
    new WordFrequency("Highway", 3),
};

int p = 7;
string[] result = new string[p];
int sum = 0;
Random rnd = new Random();
foreach (WordFrequency wf in words) {
    sum += wf.Frequency;
    for (int i = 0; i < p; i++) {
        if (rnd.Next(sum) < wf.Frequency) {
            result[i] = wf.Word;
        }
    }
}

0人赞添加讨论(0) 举报

Efficient algorithm to randomly select items with

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间