Efficient algorithm to randomly select items with

2019-02-08 22:12发布

Given an array of n word-frequency pairs:

[ (w0, f0), (w1, f1), ..., (wn-1, fn-1) ]

where wi is a word, fi is an integer frequencey, and the sum of the frequencies ∑fi = m,

I want to use a pseudo-random number generator (pRNG) to select p words wj0, wj1, ..., wjp-1 such that the probability of selecting any word is proportional to its frequency:

P(wi = wjk) = P(i = jk) = fi / m

(Note, this is selection with replacement, so the same word could be chosen every time).

I've come up with three algorithms so far:

  1. Create an array of size m, and populate it so the first f0 entries are w0, the next f1 entries are w1, and so on, so the last fp-1 entries are wp-1.

    [ w0, ..., w0, w1,..., w1, ..., wp-1, ..., wp-1 ]
    Then use the pRNG to select p indices in the range 0...m-1, and report the words stored at those indices.
    This takes O(n + m + p) work, which isn't great, since m can be much much larger than n.

  2. Step through the input array once, computing

    mi = ∑h≤ifh = mi-1 + fi
    and after computing mi, use the pRNG to generate a number xk in the range 0...mi-1 for each k in 0...p-1 and select wi for wjk (possibly replacing the current value of wjk) if xk < fi.
    This requires O(n + np) work.

  3. Compute mi as in algorithm 2, and generate the following array on n word-frequency-partial-sum triples:
    [ (w0, f0, m0), (w1, f1, m1), ..., (wn-1, fn-1, mn-1) ]
    and then, for each k in 0...p-1, use the pRNG to generate a number xk in the range 0...m-1 then do binary search on the array of triples to find the i s.t. mi-fi ≤ xk < mi, and select wi for wjk.
    This requires O(n + p log n) work.

My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?

3条回答
小情绪 Triste *
2楼-- · 2019-02-08 22:33

Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:

  • There are n partitions, all of the same width r s.t. nr = m.
  • each partition contains two words in some ratio (which is stored with the partition).
  • for each word wi, fi = ∑partitions t s.t wi ∈ t r × ratio(t,wi)

Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from 0...n-1 at random), and the partition's ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means the p selections can be done in O(p) work, given such a partition.

The reason that such a partitioning exists is that there exists a word wi s.t. fi < r, if and only if there exists a word wi' s.t. fi' > r, since r is the average of the frequencies.

Given such a pair wi and wi' we can replace them with a pseudo-word w'i of frequency f'i = r (that represents wi with probability fi/r and wi' with probability 1 - fi/r) and a new word w'i' of adjusted frequency f'i' = fi' - (r - fi) respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.

To construct this partition in O(n) time,

  • go through the list of the words once, constructing two lists:
    • one of words with frequency ≤ r
    • one of words with frequency > r
  • then pull a word from the first list
    • if its frequency = r, then make it into a one element partition
    • otherwise, pull a word from the other list, and use it to fill out a two-word partition. Then put the second word back into either the first or second list according to its adjusted frequency.

This actually still works if the number of partitions q > n (you just have to prove it differently). If you want to make sure that r is integral, and you can't easily find a factor q of m s.t. q > n, you can pad all the frequencies by a factor of n, so f'i = nfi, which updates m' = mn and sets r' = m when q = n.

In any case, this algorithm only takes O(n + p) work, which I have to think is optimal.

In ruby:

def weighted_sample_with_replacement(input, p)
  n = input.size
  m = input.inject(0) { |sum,(word,freq)| sum + freq }

  # find the words with frequency lesser and greater than average
  lessers, greaters = input.map do |word,freq| 
                        # pad the frequency so we can keep it integral
                        # when subdivided
                        [ word, freq*n ] 
                      end.partition do |word,adj_freq| 
                        adj_freq <= m 
                      end

  partitions = Array.new(n) do
    word, adj_freq = lessers.shift

    other_word = if adj_freq < m
                   # use part of another word's frequency to pad
                   # out the partition
                   other_word, other_adj_freq = greaters.shift
                   other_adj_freq -= (m - adj_freq)
                   (other_adj_freq <= m ? lessers : greaters) << [ other_word, other_adj_freq ]
                   other_word
                 end

    [ word, other_word , adj_freq ]
  end

  (0...p).map do 
    # pick a partition at random
    word, other_word, adj_freq = partitions[ rand(n) ]
    # select the first word in the partition with appropriate
    # probability
    if rand(m) < adj_freq
      word
    else
      other_word
    end
  end
end
查看更多
劫难
3楼-- · 2019-02-08 22:46

This sounds like roulette wheel selection, mainly used for the selection process in genetic/evolutionary algorithms.

Look at Roulette Selection in Genetic Algorithms

查看更多
霸刀☆藐视天下
4楼-- · 2019-02-08 22:47

You could create the target array, then loop through the words determining the probability that it should be picked, and replace the words in the array according to a random number.

For the first word the probability would be f0/m0 (where mn=f0+..+fn), i.e. 100%, so all positions in the target array would be filled with w0.

For the following words the probability falls, and when you reach the last word the target array is filled with randomly picked words accoding to the frequency.

Example code in C#:

public class WordFrequency {

    public string Word { get; private set; }
    public int Frequency { get; private set; }

    public WordFrequency(string word, int frequency) {
        Word = word;
        Frequency = frequency;
    }

}

WordFrequency[] words = new WordFrequency[] {
    new WordFrequency("Hero", 80),
    new WordFrequency("Monkey", 4),
    new WordFrequency("Shoe", 13),
    new WordFrequency("Highway", 3),
};

int p = 7;
string[] result = new string[p];
int sum = 0;
Random rnd = new Random();
foreach (WordFrequency wf in words) {
    sum += wf.Frequency;
    for (int i = 0; i < p; i++) {
        if (rnd.Next(sum) < wf.Frequency) {
            result[i] = wf.Word;
        }
    }
}
查看更多
登录 后发表回答