I have a list of sets, and I wish to sample n different samples each containing an item from each set. What I do not want is to have it in order, so, for example, I will get all the samples necessarily with the same item from the first set. I also don't want to create all the Cartesian products as that might not be possible in terms of efficiency... Any idea of how to do it? Or even something to approximate this behaviour?
Example that does not work:
(prod for i, prod in zip(range(n), itertools.product(*list_of_sets)))
All the above solutions waste a lot of resources for filtering repeated results when it comes to the end of the iteration. That's why I have thought of a method that has (almost) linear speed from start until the very end.
The idea is: Give (only in your head) each result of the standard order cartesian product an index. That would be for example for
A
xB
xC
with2000
x1
x2
=4000
elements:So there are still some questions open:
2000*1*2=4000
and every number below that will be a valid index.n
, just userandom.sample(xrange(numer_of_indices), n)
. But if you don't know the sample size yet (more general case), you have to generate indices on the fly to not waste memory. In that case, you can just generateindex = random.randint(0, k - 1)
withk = numer_of_indices
to get the first index andk = number_of_indices - n
for then
th result. Just check my code below (be aware, that I use a one sided linked list there to store the done indices. It makes insert operations O(1) operations and we need a lot of insertions here).i
. Theni % 2000
will be the index ofA
for the result. Nowi // 2000
can be treated recursively as the index for the cartesian product of the remaining factors.So this is the code I came up with:
You can use
sample
from therandom
lib:for example:
A possible output will be:
EDIT:
If we want to avoid repetitions we can use a
while
loop and collect the results to aset
. In addition you can check thatn
is valid and return the Cartesian product for invalidn
values:Matmarbon's answer is valid, this is a complete version with an example and some modifies for easy understanding and easy use:
As I want no repetition, and sometimes it is not possible the code is not that short. But as @andreyF said,
random.sample
does the work. Perhaps there is also a better way that avoids resampling with repetition until enough non repetitive ones exist, this is the best I have so far.Note that the code assumes a list of frozen sets, a conversion of
random.sample(cluster, 1)[0]
should be done otherwise.The following generator function generates non-repetitive samples. It will only work performantly if the number of samples generated is much smaller than the number of possible samples. It also requires the elements of the sets to be hashable: