This question already has an answer here:
The question gives all necessary data: what is an efficient algorithm to generate a sequence of K non-repeating integers within a given interval [0,N-1]. The trivial algorithm (generating random numbers and, before adding them to the sequence, looking them up to see if they were already there) is very expensive if K is large and near enough to N.
The algorithm provided in Efficiently selecting a set of random elements from a linked list seems more complicated than necessary, and requires some implementation. I've just found another algorithm that seems to do the job fine, as long as you know all the relevant parameters, in a single pass.
Step 1: Generate your list of integers.
Step 2: Perform Knuth Shuffle.
Note that you don't need to shuffle the entire list, since the Knuth Shuffle algorithm allows you to apply only n shuffles, where n is the number of elements to return. Generating the list will still take time proportional to the size of the list, but you can reuse your existing list for any future shuffling needs (assuming the size stays the same) with no need to preshuffle the partially shuffled list before restarting the shuffling algorithm.
The basic algorithm for Knuth Shuffle is that you start with a list of integers. Then, you swap the first integer with any number in the list and return the current (new) first integer. Then, you swap the second integer with any number in the list (except the first) and return the current (new) second integer. Then...etc...
This is an absurdly simple algorithm, but be careful that you include the current item in the list when performing the swap or you will break the algorithm.
The following code (in C, unknown origin) seems to solve the problem extremely well:
Does anyone know where I can find more gems like this one?
This Ruby code showcases the Reservoir Sampling, Algorithm R method. In each cycle, I select
n=5
unique random integers from[0,N=10)
range:output:
all integer between 0-9 were chosen with nearly the same probability.
It's essentially Knuth's algorithm applied to arbitrary sequences (indeed, that answer has a LISP version of this). The algorithm is O(N) in time and can be O(1) in memory if the sequence is streamed into it as shown in @MichaelCramer's answer.
The random module from Python library makes it extremely easy and effective:
sample
function returns a list of K unique elements chosen from the given sequence.xrange
is a "list emulator", i.e. it behaves like a list of consecutive numbers without creating it in memory, which makes it super-fast for tasks like this one.My solution is C++ oriented, but I'm sure it could be translated to other languages since it's pretty simple.
This solution only involves two loop iterations, and no hash table lookups or anything of the sort. So in actual code:
The Reservoir Sampling version is pretty simple:
That's $N randomly selected rows from STDIN. Replace the <>/$_ stuff with something else if you're not using rows from a file, but it's a pretty straightforward algorithm.