I needed to write a weighted version of random.choice (each element in the list has a different probability for being selected). This is what I came up with:
def weightedChoice(choices):
"""Like random.choice, but each element can have a different chance of
being selected.
choices can be any iterable containing iterables with two items each.
Technically, they can have more than two items, the rest will just be
ignored. The first item is the thing being chosen, the second item is
its weight. The weights can be any numeric values, what matters is the
relative differences between them.
"""
space = {}
current = 0
for choice, weight in choices:
if weight > 0:
space[current] = choice
current += weight
rand = random.uniform(0, current)
for key in sorted(space.keys() + [current]):
if rand < key:
return choice
choice = space[key]
return None
This function seems overly complex to me, and ugly. I'm hoping everyone here can offer some suggestions on improving it or alternate ways of doing this. Efficiency isn't as important to me as code cleanliness and readability.
I'd require the sum of choices is 1, but this works anyway
As of Python
v3.6
,random.choices
could be used to return alist
of elements of specified size from the given population with optional weights.population :
list
containing unique observations. (If empty, raisesIndexError
)weights : More precisely relative weights required to make selections.
cum_weights : cumulative weights required to make selections.
k : size(
len
) of thelist
to be outputted. (Defaultlen()=1
)Few Caveats:
1) It makes use of weighted sampling with replacement so the drawn items would be later replaced. The values in the weights sequence in itself do not matter, but their relative ratio does.
Unlike
np.random.choice
which can only take on probabilities as weights and also which must ensure summation of individual probabilities upto 1 criteria, there are no such regulations here. As long as they belong to numeric types (int/float/fraction
exceptDecimal
type) , these would still perform.2) If neither weights nor cum_weights are specified, selections are made with equal probability. If a weights sequence is supplied, it must be the same length as the population sequence.
Specifying both weights and cum_weights raises a
TypeError
.3) cum_weights are typically a result of
itertools.accumulate
function which are really handy in such situations.So, either supplying
weights=[12, 12, 4]
orcum_weights=[12, 24, 28]
for our contrived case produces the same outcome and the latter seems to be more faster / efficient.A general solution:
It depends on how many times you want to sample the distribution.
Suppose you want to sample the distribution K times. Then, the time complexity using
np.random.choice()
each time isO(K(n + log(n)))
whenn
is the number of items in the distribution.In my case, I needed to sample the same distribution multiple times of the order of 10^3 where n is of the order of 10^6. I used the below code, which precomputes the cumulative distribution and samples it in
O(log(n))
. Overall time complexity isO(n+K*log(n))
.I'm probably too late to contribute anything useful, but here's a simple, short, and very efficient snippet:
No need to sort your probabilities or create a vector with your cmf, and it terminates once it finds its choice. Memory: O(1), time: O(N), with average running time ~ N/2.
If you have weights, simply add one line: