Given an array of n
word-frequency pairs:
[ (w0, f0), (w1, f1), ..., (wn-1, fn-1) ]
where wi
is a word, fi
is an integer frequencey, and the sum of the frequencies ∑fi = m
,
I want to use a pseudo-random number generator (pRNG) to select p
words wj0, wj1, ..., wjp-1
such that
the probability of selecting any word is proportional to its frequency:
P(wi = wjk) = P(i = jk) = fi / m
(Note, this is selection with replacement, so the same word could be chosen every time).
I've come up with three algorithms so far:
Create an array of size
m
, and populate it so the firstf0
entries arew0
, the nextf1
entries arew1
, and so on, so the lastfp-1
entries arewp-1
.[ w0, ..., w0, w1,..., w1, ..., wp-1, ..., wp-1 ]
Then use the pRNG to selectp
indices in the range0...m-1
, and report the words stored at those indices.
This takesO(n + m + p)
work, which isn't great, sincem
can be much much larger than n.Step through the input array once, computing
mi = ∑h≤ifh = mi-1 + fi
and after computingmi
, use the pRNG to generate a numberxk
in the range0...mi-1
for eachk
in0...p-1
and selectwi
forwjk
(possibly replacing the current value ofwjk
) ifxk < fi
.
This requiresO(n + np)
work.- Compute
mi
as in algorithm 2, and generate the following array on n word-frequency-partial-sum triples:[ (w0, f0, m0), (w1, f1, m1), ..., (wn-1, fn-1, mn-1) ]
and then, for each k in0...p-1
, use the pRNG to generate a numberxk
in the range0...m-1
then do binary search on the array of triples to find thei
s.t.mi-fi ≤ xk < mi
, and selectwi
forwjk
.
This requiresO(n + p log n)
work.
My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?
Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:
n
partitions, all of the same widthr
s.t.nr = m
.wi
,fi = ∑partitions t s.t wi ∈ t r × ratio(t,wi)
Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from
0...n-1
at random), and the partition's ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means thep
selections can be done inO(p)
work, given such a partition.The reason that such a partitioning exists is that there exists a word
wi
s.t.fi < r
, if and only if there exists a wordwi'
s.t.fi' > r
, since r is the average of the frequencies.Given such a pair
wi
andwi'
we can replace them with a pseudo-wordw'i
of frequencyf'i = r
(that representswi
with probabilityfi/r
andwi'
with probability1 - fi/r
) and a new wordw'i'
of adjusted frequencyf'i' = fi' - (r - fi)
respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.To construct this partition in
O(n)
time,This actually still works if the number of partitions
q > n
(you just have to prove it differently). If you want to make sure that r is integral, and you can't easily find a factorq
ofm
s.t.q > n
, you can pad all the frequencies by a factor ofn
, sof'i = nfi
, which updatesm' = mn
and setsr' = m
whenq = n
.In any case, this algorithm only takes
O(n + p)
work, which I have to think is optimal.In ruby:
This sounds like roulette wheel selection, mainly used for the selection process in genetic/evolutionary algorithms.
Look at Roulette Selection in Genetic Algorithms
You could create the target array, then loop through the words determining the probability that it should be picked, and replace the words in the array according to a random number.
For the first word the probability would be f0/m0 (where mn=f0+..+fn), i.e. 100%, so all positions in the target array would be filled with w0.
For the following words the probability falls, and when you reach the last word the target array is filled with randomly picked words accoding to the frequency.
Example code in C#: