I have a bunch of keys that each have an unlikeliness variable. I want to randomly choose one of these keys, yet I want it to be more unlikely for unlikely (key, values) to be chosen than a less unlikely (a more likely) object. I am wondering if you would have any suggestions, preferably an existing python module that I could use, else I will need to make it myself.
I have checked out the random module; it does not seem to provide this.
I have to make such choices many millions of times for 1000 different sets of objects each containing 2,455 objects. Each set will exchange objects among each other so the random chooser needs to be dynamic. With 1000 sets of 2,433 objects, that is 2,433 million objects; low memory consumption is crucial. And since these choices are not the bulk of the algorithm, I need this process to be quite fast; CPU-time is limited.
Thx
Update:
Ok, I tried to consider your suggestions wisely, but time is so limited...
I looked at the binary search tree approach and it seems too risky (complex and complicated). The other suggestions all resemble the ActiveState recipe. I took it and modified it a little in the hope of making more efficient:
def windex(dict, sum, max):
'''an attempt to make a random.choose() function that makes
weighted choices accepts a dictionary with the item_key and
certainty_value as a pair like:
>>> x = [('one', 20), ('two', 2), ('three', 50)], the
maximum certainty value (max) and the sum of all certainties.'''
n = random.uniform(0, 1)
sum = max*len(list)-sum
for key, certainty in dict.iteritems():
weight = float(max-certainty)/sum
if n < weight:
break
n = n - weight
return key
I am hoping to get an efficiency gain from dynamically maintaining the sum of certainties and the maximum certainty. Any further suggestions are welcome. You guys saves me so much time and effort, while increasing my effectiveness, it is crazy. Thx! Thx! Thx!
Update2:
I decided to make it more efficient by letting it choose more choices at once. This will result in an acceptable loss in precision in my algo for it is dynamic in nature. Anyway, here is what I have now:
def weightedChoices(dict, sum, max, choices=10):
'''an attempt to make a random.choose() function that makes
weighted choices accepts a dictionary with the item_key and
certainty_value as a pair like:
>>> x = [('one', 20), ('two', 2), ('three', 50)], the
maximum certainty value (max) and the sum of all certainties.'''
list = [random.uniform(0, 1) for i in range(choices)]
(n, list) = relavate(list.sort())
keys = []
sum = max*len(list)-sum
for key, certainty in dict.iteritems():
weight = float(max-certainty)/sum
if n < weight:
keys.append(key)
if list: (n, list) = relavate(list)
else: break
n = n - weight
return keys
def relavate(list):
min = list[0]
new = [l - min for l in list[1:]]
return (min, new)
I haven't tried it out yet. If you have any comments/suggestions, please do not hesitate. Thx!
Update3:
I have been working all day on a task-tailored version of Rex Logan's answer. Instead of a 2 arrays of objects and weights, it is actually a special dictionary class; which makes things quite complex since Rex's code generates a random index... I also coded a test case that kind of resembles what will happen in my algo (but I can't really know until I try!). The basic principle is: the more a key is randomly generated often, the more unlikely it will be generated again:
import random, time
import psyco
psyco.full()
class ProbDict():
"""
Modified version of Rex Logans RandomObject class. The more a key is randomly
chosen, the more unlikely it will further be randomly chosen.
"""
def __init__(self,keys_weights_values={}):
self._kw=keys_weights_values
self._keys=self._kw.keys()
self._len=len(self._keys)
self._findSeniors()
self._effort = 0.15
self._fails = 0
def __iter__(self):
return self.next()
def __getitem__(self, key):
return self._kw[key]
def __setitem__(self, key, value):
self.append(key, value)
def __len__(self):
return self._len
def next(self):
key=self._key()
while key:
yield key
key = self._key()
def __contains__(self, key):
return key in self._kw
def items(self):
return self._kw.items()
def pop(self, key):
try:
(w, value) = self._kw.pop(key)
self._len -=1
if w == self._seniorW:
self._seniors -= 1
if not self._seniors:
#costly but unlikely:
self._findSeniors()
return [w, value]
except KeyError:
return None
def popitem(self):
return self.pop(self._key())
def values(self):
values = []
for key in self._keys:
try:
values.append(self._kw[key][1])
except KeyError:
pass
return values
def weights(self):
weights = []
for key in self._keys:
try:
weights.append(self._kw[key][0])
except KeyError:
pass
return weights
def keys(self, imperfect=False):
if imperfect: return self._keys
return self._kw.keys()
def append(self, key, value=None):
if key not in self._kw:
self._len +=1
self._kw[key] = [0, value]
self._keys.append(key)
else:
self._kw[key][1]=value
def _key(self):
for i in range(int(self._effort*self._len)):
ri=random.randint(0,self._len-1) #choose a random object
rx=random.uniform(0,self._seniorW)
rkey = self._keys[ri]
try:
w = self._kw[rkey][0]
if rx >= w: # test to see if that is the value we want
w += 1
self._warnSeniors(w)
self._kw[rkey][0] = w
return rkey
except KeyError:
self._keys.pop(ri)
# if you do not find one after 100 tries then just get a random one
self._fails += 1 #for confirming effectiveness only
for key in self._keys:
if key in self._kw:
w = self._kw[key][0] + 1
self._warnSeniors(w)
self._kw[key][0] = w
return key
return None
def _findSeniors(self):
'''this function finds the seniors, counts them and assess their age. It
is costly but unlikely.'''
seniorW = 0
seniors = 0
for w in self._kw.itervalues():
if w >= seniorW:
if w == seniorW:
seniors += 1
else:
seniorsW = w
seniors = 1
self._seniors = seniors
self._seniorW = seniorW
def _warnSeniors(self, w):
#a weight can only be incremented...good
if w >= self._seniorW:
if w == self._seniorW:
self._seniors+=1
else:
self._seniors = 1
self._seniorW = w
def test():
#test code
iterations = 200000
size = 2500
nextkey = size
pd = ProbDict(dict([(i,[0,i]) for i in xrange(size)]))
start = time.clock()
for i in xrange(iterations):
key=pd._key()
w=pd[key][0]
if random.randint(0,1+pd._seniorW-w):
#the heavier the object, the more unlikely it will be removed
pd.pop(key)
probAppend = float(500+(size-len(pd)))/1000
if random.uniform(0,1) < probAppend:
nextkey+=1
pd.append(nextkey)
print (time.clock()-start)*1000/iterations, "msecs / iteration with", pd._fails, "failures /", iterations, "iterations"
weights = pd.weights()
weights.sort()
print "avg weight:", float(sum(weights))/pd._len, max(weights), pd._seniorW, pd._seniors, len(pd), len(weights)
print weights
test()
Any comments are still welcome. @Darius: your binary trees are too complex and complicated for me; and I do not think its leafs can be removed efficiently... Thx all
I was needed in faster functions, for non very large numbers. So here it is, in Visual C++:
The simplest thing to do is to use random.choice (which uses a uniform distribution) and vary the frequency of occurrence on the object in the source collection.
... vs:
So your objects could have a base occurrence rate (n) and between 1 and n objects are added to the source collection as a function of the conviction rate. This method is really simple; however, it can have significant overhead if the number of distinct objects is large or the conviction rate needs to be very fine grained.
Alternatively, if you generate more that one random number using a uniform distribution and sum them, numbers occurring near the mean are more probable that those occurring near the extremes (think of rolling two dice and the probability of getting 7 versus 12 or 2). You can then order the objects by conviction rate and generate a number using multiple die rolls which you use to calculate and index into the objects. Use numbers near the mean to index low conviction objects and numbers near the extremes to index high conviction items. You can vary the precise probability that a given object will be selected by changing the "number of sides" and number of your "dice" (it may be simpler to put the objects into buckets and use dice with a small number of sides rather than trying to associate each object with a specific result):
You want to give each object a weight. The bigger the weight the more likely it will happen. More precisely probx =weight/sum_all_weights.
Then generate a random number in the range 0 to sum_all_weights and map it to each object.
This code allows you to generate a random index and it is mapped when the object is created for speed. If all of your sets of objects have the same distribution then you can get by with only one RandomIndex object.
(A year later) Walker's alias method for random objects with different probablities is very fast and very simple
In comments on the original post, Nicholas Leonard suggests that both the exchanging and the sampling need to be fast. Here's an idea for that case; I haven't tried it.
If only sampling had to be fast, we could use an array of the values together with the running sum of their probabilities, and do a binary search on the running sum (with key being a uniform random number) -- an O(log(n)) operation. But an exchange would require updating all of the running-sum values appearing after the entries exchanged -- an O(n) operation. (Could you choose to exchange only items near the end of their lists? I'll assume not.)
So let's aim for O(log(n)) in both operations. Instead of an array, keep a binary tree for each set to sample from. A leaf holds the sample value and its (unnormalized) probability. A branch node holds the total probability of its children.
To sample, generate a uniform random number
x
between 0 and the total probability of the root, and descend the tree. At each branch, choose the left child if the left child has total probability<= x
. Else subtract the left child's probability fromx
and go right. Return the leaf value you reach.To exchange, remove the leaf from its tree and adjust the branches that lead down to it (decreasing their total probability, and cutting out any single-child branch nodes). Insert the leaf into the destination tree: you have a choice of where to put it, so keep it balanced. Picking a random child at each level is probably good enough -- that's where I'd start. Increase each parent node's probability, back up to the root.
Now both sampling and exchange are O(log(n)) on average. (If you need guaranteed balance, a simple way is to add another field to the branch nodes holding the count of leaves in the whole subtree. When adding a leaf, at each level pick the child with fewer leaves. This leaves the possibility of a tree getting unbalanced solely by deletions; this can't be a problem if there's reasonably even traffic between the sets, but if it is, then choose rotations during deletion using the leaf-count information on each node in your traversal.)
Update: On request, here's a basic implementation. Haven't tuned it at all. Usage:
Code:
Update 2: In answering another problem, Jason Orendorff points out that the binary trees can be kept perfectly balanced by representing them in an array just like the classical heap structure. (This saves the space spent on pointers, too.) See my comments to that answer for how to adapt his code to this problem.
I would use this recipe . You will need to add a weight to your objects, but that is just a simple ratio and put them in a list of tuples (object, conviction/(sum of convictions)). This should be easy to do using a list comprehension.