I have a bunch of keys that each have an unlikeliness variable. I want to randomly choose one of these keys, yet I want it to be more unlikely for unlikely (key, values) to be chosen than a less unlikely (a more likely) object. I am wondering if you would have any suggestions, preferably an existing python module that I could use, else I will need to make it myself.
I have checked out the random module; it does not seem to provide this.
I have to make such choices many millions of times for 1000 different sets of objects each containing 2,455 objects. Each set will exchange objects among each other so the random chooser needs to be dynamic. With 1000 sets of 2,433 objects, that is 2,433 million objects; low memory consumption is crucial. And since these choices are not the bulk of the algorithm, I need this process to be quite fast; CPU-time is limited.
Thx
Update:
Ok, I tried to consider your suggestions wisely, but time is so limited...
I looked at the binary search tree approach and it seems too risky (complex and complicated). The other suggestions all resemble the ActiveState recipe. I took it and modified it a little in the hope of making more efficient:
def windex(dict, sum, max):
'''an attempt to make a random.choose() function that makes
weighted choices accepts a dictionary with the item_key and
certainty_value as a pair like:
>>> x = [('one', 20), ('two', 2), ('three', 50)], the
maximum certainty value (max) and the sum of all certainties.'''
n = random.uniform(0, 1)
sum = max*len(list)-sum
for key, certainty in dict.iteritems():
weight = float(max-certainty)/sum
if n < weight:
break
n = n - weight
return key
I am hoping to get an efficiency gain from dynamically maintaining the sum of certainties and the maximum certainty. Any further suggestions are welcome. You guys saves me so much time and effort, while increasing my effectiveness, it is crazy. Thx! Thx! Thx!
Update2:
I decided to make it more efficient by letting it choose more choices at once. This will result in an acceptable loss in precision in my algo for it is dynamic in nature. Anyway, here is what I have now:
def weightedChoices(dict, sum, max, choices=10):
'''an attempt to make a random.choose() function that makes
weighted choices accepts a dictionary with the item_key and
certainty_value as a pair like:
>>> x = [('one', 20), ('two', 2), ('three', 50)], the
maximum certainty value (max) and the sum of all certainties.'''
list = [random.uniform(0, 1) for i in range(choices)]
(n, list) = relavate(list.sort())
keys = []
sum = max*len(list)-sum
for key, certainty in dict.iteritems():
weight = float(max-certainty)/sum
if n < weight:
keys.append(key)
if list: (n, list) = relavate(list)
else: break
n = n - weight
return keys
def relavate(list):
min = list[0]
new = [l - min for l in list[1:]]
return (min, new)
I haven't tried it out yet. If you have any comments/suggestions, please do not hesitate. Thx!
Update3:
I have been working all day on a task-tailored version of Rex Logan's answer. Instead of a 2 arrays of objects and weights, it is actually a special dictionary class; which makes things quite complex since Rex's code generates a random index... I also coded a test case that kind of resembles what will happen in my algo (but I can't really know until I try!). The basic principle is: the more a key is randomly generated often, the more unlikely it will be generated again:
import random, time
import psyco
psyco.full()
class ProbDict():
"""
Modified version of Rex Logans RandomObject class. The more a key is randomly
chosen, the more unlikely it will further be randomly chosen.
"""
def __init__(self,keys_weights_values={}):
self._kw=keys_weights_values
self._keys=self._kw.keys()
self._len=len(self._keys)
self._findSeniors()
self._effort = 0.15
self._fails = 0
def __iter__(self):
return self.next()
def __getitem__(self, key):
return self._kw[key]
def __setitem__(self, key, value):
self.append(key, value)
def __len__(self):
return self._len
def next(self):
key=self._key()
while key:
yield key
key = self._key()
def __contains__(self, key):
return key in self._kw
def items(self):
return self._kw.items()
def pop(self, key):
try:
(w, value) = self._kw.pop(key)
self._len -=1
if w == self._seniorW:
self._seniors -= 1
if not self._seniors:
#costly but unlikely:
self._findSeniors()
return [w, value]
except KeyError:
return None
def popitem(self):
return self.pop(self._key())
def values(self):
values = []
for key in self._keys:
try:
values.append(self._kw[key][1])
except KeyError:
pass
return values
def weights(self):
weights = []
for key in self._keys:
try:
weights.append(self._kw[key][0])
except KeyError:
pass
return weights
def keys(self, imperfect=False):
if imperfect: return self._keys
return self._kw.keys()
def append(self, key, value=None):
if key not in self._kw:
self._len +=1
self._kw[key] = [0, value]
self._keys.append(key)
else:
self._kw[key][1]=value
def _key(self):
for i in range(int(self._effort*self._len)):
ri=random.randint(0,self._len-1) #choose a random object
rx=random.uniform(0,self._seniorW)
rkey = self._keys[ri]
try:
w = self._kw[rkey][0]
if rx >= w: # test to see if that is the value we want
w += 1
self._warnSeniors(w)
self._kw[rkey][0] = w
return rkey
except KeyError:
self._keys.pop(ri)
# if you do not find one after 100 tries then just get a random one
self._fails += 1 #for confirming effectiveness only
for key in self._keys:
if key in self._kw:
w = self._kw[key][0] + 1
self._warnSeniors(w)
self._kw[key][0] = w
return key
return None
def _findSeniors(self):
'''this function finds the seniors, counts them and assess their age. It
is costly but unlikely.'''
seniorW = 0
seniors = 0
for w in self._kw.itervalues():
if w >= seniorW:
if w == seniorW:
seniors += 1
else:
seniorsW = w
seniors = 1
self._seniors = seniors
self._seniorW = seniorW
def _warnSeniors(self, w):
#a weight can only be incremented...good
if w >= self._seniorW:
if w == self._seniorW:
self._seniors+=1
else:
self._seniors = 1
self._seniorW = w
def test():
#test code
iterations = 200000
size = 2500
nextkey = size
pd = ProbDict(dict([(i,[0,i]) for i in xrange(size)]))
start = time.clock()
for i in xrange(iterations):
key=pd._key()
w=pd[key][0]
if random.randint(0,1+pd._seniorW-w):
#the heavier the object, the more unlikely it will be removed
pd.pop(key)
probAppend = float(500+(size-len(pd)))/1000
if random.uniform(0,1) < probAppend:
nextkey+=1
pd.append(nextkey)
print (time.clock()-start)*1000/iterations, "msecs / iteration with", pd._fails, "failures /", iterations, "iterations"
weights = pd.weights()
weights.sort()
print "avg weight:", float(sum(weights))/pd._len, max(weights), pd._seniorW, pd._seniors, len(pd), len(weights)
print weights
test()
Any comments are still welcome. @Darius: your binary trees are too complex and complicated for me; and I do not think its leafs can be removed efficiently... Thx all
This activestate recipe gives an easy-to-follow approach, specifically the version in the comments that doesn't require you to pre-normalize your weights:
This will be slow if you have a large list of items. A binary search would probably be better in that case... but would also be more complicated to write, for little gain if you have a small sample size. Here's an example of the binary search approach in python if you want to follow that route.
(I'd recommend doing some quick performance testing of both methods on your dataset. The performance of different approaches to this sort of algorithm is often a bit unintuitive.)
Edit: I took my own advice, since I was curious, and did a few tests.
I compared four approaches:
The weighted_choice function above.
A binary-search choice function like so:
A compiling version of 1:
A compiling version of 2:
I then built a big list of choices like so:
And an excessively simple profiling function:
The results:
(Seconds taken for 1,000 calls to the function.)
The "compiled" results include the average time taken to compile the choice function once. (I timed 1,000 compiles, then divided that time by 1,000, and added the result to the choice function time.)
So: if you have a list of items+weights which change very rarely, the binary compiled method is by far the fastest.
About 3 years later...
If you use numpy, perhaps the simplest option is to use
np.random.choice
, which takes a list of possible values, and an optional sequence of probabilities associated with each value:A very easy and simple way of doing this is to set weights for each of the values, and it wouldn't require much memory.
You could probably use a hash/dictionary to do this.
What you'll want to do is to have the random number, x, multiplied and summed over the entire set of things you want selected, and divide that result over the number of objects in your set.
Pseudo-code:
EDIT: I just thought of how slow my code would be with very large sets (it's O(n)). The following pseudo-code is O(log(n)), and is basically using a binary search.
There are implementations of binary search in Python all over the 'net, so no need repeating here.
Here is a classic way to do it, in pseudocode, where random.random() gives you a random float from 0 to 1.
For an example: imagine you have two objects, one with weight 2, another with weight 4. You generate a number from 0 to 6. If
choice
is between 0 and 2, which will happen with 2/6 = 1/3 probability, then it will get subtracted by 2 and the first object is chosen. If choice is between 2 and 6, which will happen with 4/6 = 2/3 probability, then the first subtraction will still have choice being > 0, and the second subtraction will make the 2nd object get chosen.I suggest you port this PHP implementation of weighted random to Python. In particular, the binary-search-based second algorithm helps address your speed concerns.
Here's a better answer for a special probability distribution, the one Rex Logan's answer seems to be geared at. The distribution is like this: each object has an integer weight between 0 and 100, and its probability is in proportion to its weight. Since that's the currently accepted answer, I guess this is worth thinking about.
So keep an array of 101 bins. Each bin holds a list of all of the objects with its particular weight. Each bin also knows the total weight of all its objects.
To sample: pick a bin at random in proportion to its total weight. (Use one of the standard recipes for this -- linear or binary search.) Then pick an object from the bin uniformly at random.
To transfer an object: remove it from its bin, put it in its bin in the target, and update both bins' weights. (If you're using binary search for sampling, you must also update the running sums that uses. This is still reasonably fast since there aren't many bins.)