I feel like this one should be easy but after numerous searches and attempts I can't figure an answer out. Basically I have a very large number of items that I want to sample in a random order without replacement. In this case they are cells in a 2D array. The solution that I would use for a smaller array doesn't translate because it requires shuffling an in memory array. If the number I had to sample was small I could also just randomly sample items and keep a list of the values I'd tried. Unfortunately I often will have to sample a very large proportion of all the cells, as many as all.
What I'd like to create is an iterator using some combination of itertools, numpy and/or random that yields the next random cell (x and y indices). Another possible solution would be to create an iterator that would yield the next random number (without replacement) between 0 and (x_count * y_count) which I could map back to a cell location. Neither of which seems easily accomplished.
Thanks for any sugestions!
Here's my current solution.
import numpy as np
import itertools as itr
import random as rdm
#works great
x_count = 10
y_count = 5
#good luck!
#x_count = 10000
#y_count = 20000
x_indices = np.arange(x_count)
y_indices = np.arange(y_count)
cell_indices = itr.product(x_indices, y_indices)
list_cell_indices = list(cell_indices)
rdm.shuffle(list_cell_indices)
for i in range(25):
print list_cell_indices[i]
So based on the current responses and my attempt to translate perl which I know nothing about, I'm understanding that the best I can do is the following:
import numpy as np
import itertools as itr
import random as rdm
x_count = 10000
y_count = 5000
sample_count = 10000
keep_probability = 0.01
tried_cells = set()
kept_cells = set()
while len(kept_cells) < sample_count:
x = rdm.randint(0, x_count)
y = rdm.randint(0, y_count)
if (x, y) in tried_cells:
pass
else:
tried_cells.add((x, y))
keep = rdm.random() < keep_probability
if keep:
kept_cells.add((x,y))
print "worked"
In most cases the processing time and memory used isn't that bad. Maybe I could do a check of the average cell keep_probability and sample_count and throw an error for difficult cases.
I believe that there is simply no way to sample a sequence without replacement without using a significant amount of auxiliary storage for sample sizes close to
R * C
. And while there are clever ways to reduce the amount of storage for small sample sizes, if you expect to be sampling more than about a third of your dataset, you're better off just creating a separate list.random.sample
is a natural choice for this purpose; and frankly I would just pass a flattened version of your 2-d numpy array straight to it. (Unless you need the indices too, in which case, randomly sampling ints and translating them into coordinates, a la hexparrot's solution, is a reasonable way to go.)If you look at the implementation of
random.sample
, you'll see that for smaller sample sizes, it already does roughly what the perl code above does -- tracks previously selected items in a set and discards selections that are already in the set. For larger sample sizes, it creates a copy of the input -- which is more memory efficient than a set for larger values, because sets take up more space than lists per stored item -- and does a slightly modified Fisher-Yates shuffle, stopping when it hassample_size
items (i.e. it doesn't shuffle the whole list, so it's more efficient than shuffling the whole thing yourself.)Basically, my bet is that you won't do better than
random.sample
, rolling your own, unless you code something up in c.However -- I did find this, which you might find interesting:
numpy.random.choice
. This appears to do random sampling with or without replacement at c speed. The catch? It's new with Numpy 1.7!You're already using O(N=R*C) memory, so you might as well use O(N) memory for your iterator. Copy all elements and randomly sort them as you would do for the single-dimensional case. This is only a reasonable thing to do if you are going to visit each element, which you say is your case.
(for the record: Otherwise, memory isn't such a bad issue because you only need to "remember" where you were previously. Thus you can keep a list of indices that you've already visited. This is bad if you ever do plan to visit each element, because rejection sampling can take very long if implementing with a growing blacklist. (You can also implement it with a decreasing whitelist, which is equivalent to the first solution.))
You've said that your test is likely to fail for most of the cells in your grid. If that's the case, randomly sampling the cells may not be a good idea to do, since you'll have a hard time keeping track of the cells you've checked already without using a lot of memory.
Instead, you might be better off applying your test to the whole grid and selecting a random element from those that pass it.
This function returns a random element that passes the test (or None if they all fail). It uses very little memory.
You'd call it with something like:
If you're using Python 3, use range instead of xrange.
in perl you could do something like this:
No duplicates, pretty random, not too hard on memory.
How about this approach. I create first the x*y array and reshape it to 2-D. Then, knowing each cell can be uniquely identified by a single integer, get a sample from 0 to (x*y).
Which returns:
(first to simulate your existing 2-D array you created via product)
Then, it returns the 2-dim coordinates, which can be translated into your cell contents simply via array[x][y]
sample() states it is 'Used for random sampling without replacement' and this approach adheres to the advice 'This is especially fast and space efficient for sampling from a large population: sample(xrange(10000000), 60).' found on the python random page.
I note that while I use get_random_item() as a generator, the underlying sample() still is producing a full list, so the memory use is still y*x + sample_size, but it runs quite swiftly all the same.