I have a long python generator that I want to "thin out" by randomly selecting a subset of values. Unfortunately, random.sample()
will not work with arbitrary iterables. Apparently, it needs something that supports the len()
operation (and perhaps non-sequential access to the sequence, but that's not clear). And I don't want to build an enormous list just so I can thin it out.
As a matter of fact, it is possible to sample from a sequence uniformly in one pass, without knowing its length-- there's a nice algorithm in Programming perl
that does just that (edit: "reservoir sampling", thanks @user2357112!). But does anyone know of a standard python module that provides this functionality?
Demo of the problem (Python 3)
>>> import itertools, random
>>> random.sample(iter("abcd"), 2)
...
TypeError: Population must be a sequence or set. For dicts, use list(d).
On Python 2, the error is more transparent:
Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
random.sample(iter("abcd"), 2)
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/random.py", line 321, in sample
n = len(population)
TypeError: object of type 'iterator' has no len()
If there's no alternative to random.sample()
, I'd try my luck with wrapping the generator into an object that provides a __len__
method (I can find out the length in advance). So I'll accept an answer that shows how to do that cleanly.
Use the
itertools.compress()
function, with a random selector function:Use
O(n)
Algorithm R https://en.wikipedia.org/wiki/Reservoir_sampling, to selectk
random elements fromiterable
:Example:
reservoir_sample()
code is from this answer.One possible method is to build a generator around the iterator to select random elements:
This method would be useful when you don't know the length and the possible size of the iterator would be prohibitive. Note that guaranteeing the size of the final list is problematic.
Some sample runs:
Since you know the length the data returned by your iterable, you can use
xrange()
to quickly generate indices into your iterable. Then you can just run the iterable until you've grabbed all of the data:In the alternative, here is an implementation of resevior sampleing using "Algorithm R":
Note that algorithm R doesn't provide a random order for the results. In the example given,
'b'
will never precede'a'
in the results.If you needed a subset of the original iterator with fixed frequency (i.e., if the generator generates 10000 numbers, you want "statistically" 100 of them, and if it generates 1000000 numbers, you want 10000 of them - always 1%), you would have wrapped the iterator in a construct yielding the inner loop's results with probability of 1%.
So I guess you want instead a fixed number of samples from a source of unknown cardinality, as in the Perl algorithm you mention.
You can wrap the iterator in a construct holding a small memory of its own for the purpose of keeping track of the reservoir, and cycling it with decreasing probability.
So
might print out
I have tried generating one million rounds as above, and comparing the distributions of the three columns with this filter (I expected a Gaussian distribution).
and while not (yet) truly Gaussian, it looks good enough to me.