How do you efficiently generate a list of K non-re

2019-01-02 23:56发布

This question already has an answer here:

The question gives all necessary data: what is an efficient algorithm to generate a sequence of K non-repeating integers within a given interval [0,N-1]. The trivial algorithm (generating random numbers and, before adding them to the sequence, looking them up to see if they were already there) is very expensive if K is large and near enough to N.

The algorithm provided in Efficiently selecting a set of random elements from a linked list seems more complicated than necessary, and requires some implementation. I've just found another algorithm that seems to do the job fine, as long as you know all the relevant parameters, in a single pass.

13条回答
仙女界的扛把子
2楼-- · 2019-01-03 00:11

Step 1: Generate your list of integers.
Step 2: Perform Knuth Shuffle.

Note that you don't need to shuffle the entire list, since the Knuth Shuffle algorithm allows you to apply only n shuffles, where n is the number of elements to return. Generating the list will still take time proportional to the size of the list, but you can reuse your existing list for any future shuffling needs (assuming the size stays the same) with no need to preshuffle the partially shuffled list before restarting the shuffling algorithm.

The basic algorithm for Knuth Shuffle is that you start with a list of integers. Then, you swap the first integer with any number in the list and return the current (new) first integer. Then, you swap the second integer with any number in the list (except the first) and return the current (new) second integer. Then...etc...

This is an absurdly simple algorithm, but be careful that you include the current item in the list when performing the swap or you will break the algorithm.

查看更多
Melony?
3楼-- · 2019-01-03 00:13

The following code (in C, unknown origin) seems to solve the problem extremely well:

 /* generate N sorted, non-duplicate integers in [0, max[ */
 int *generate(int n, int max) {
    int i, m, a;    
    int *g = (int *)calloc(n, sizeof(int));
    if ( ! g) return 0;

    m = 0;
    for (i=0; i<max; i++) {
        a = random_in_between(0, max - i);
        if (a < n - m) {
            g[m] = i;
            m ++;
        }
    }
    return g;
 }

Does anyone know where I can find more gems like this one?

查看更多
萌系小妹纸
4楼-- · 2019-01-03 00:13

This Ruby code showcases the Reservoir Sampling, Algorithm R method. In each cycle, I select n=5 unique random integers from [0,N=10) range:

t=0
m=0
N=10
n=5
s=0
distrib=Array.new(N,0)
for i in 1..500000 do
 t=0
 m=0
 s=0
 while m<n do

  u=rand()
  if (N-t)*u>=n-m then
   t=t+1
  else 
   distrib[s]+=1
   m=m+1
   t=t+1
  end #if
  s=s+1
 end #while
 if (i % 100000)==0 then puts i.to_s + ". cycle..." end
end #for
puts "--------------"
puts distrib

output:

100000. cycle...
200000. cycle...
300000. cycle...
400000. cycle...
500000. cycle...
--------------
250272
249924
249628
249894
250193
250202
249647
249606
250600
250034

all integer between 0-9 were chosen with nearly the same probability.

It's essentially Knuth's algorithm applied to arbitrary sequences (indeed, that answer has a LISP version of this). The algorithm is O(N) in time and can be O(1) in memory if the sequence is streamed into it as shown in @MichaelCramer's answer.

查看更多
疯言疯语
5楼-- · 2019-01-03 00:16

The random module from Python library makes it extremely easy and effective:

from random import sample
print sample(xrange(N), K)

sample function returns a list of K unique elements chosen from the given sequence.
xrange is a "list emulator", i.e. it behaves like a list of consecutive numbers without creating it in memory, which makes it super-fast for tasks like this one.

查看更多
闹够了就滚
6楼-- · 2019-01-03 00:17

My solution is C++ oriented, but I'm sure it could be translated to other languages since it's pretty simple.

  • First, generate a linked list with K elements, going from 0 to K
  • Then as long as the list isn't empty, generate a random number between 0 and the size of the vector
  • Take that element, push it into another vector, and remove it from the original list

This solution only involves two loop iterations, and no hash table lookups or anything of the sort. So in actual code:

// Assume K is the highest number in the list
std::vector<int> sorted_list;
std::vector<int> random_list;

for(int i = 0; i < K; ++i) {
    sorted_list.push_back(i);
}

// Loop to K - 1 elements, as this will cause problems when trying to erase
// the first element
while(!sorted_list.size() > 1) {
    int rand_index = rand() % sorted_list.size();
    random_list.push_back(sorted_list.at(rand_index));
    sorted_list.erase(sorted_list.begin() + rand_index);
}                 

// Finally push back the last remaining element to the random list
// The if() statement here is just a sanity check, in case K == 0
if(!sorted_list.empty()) {
    random_list.push_back(sorted_list.at(0));
}
查看更多
甜甜的少女心
7楼-- · 2019-01-03 00:19

The Reservoir Sampling version is pretty simple:

my $N = 20;
my $k;
my @r;

while(<>) {
  if(++$k <= $N) {
    push @r, $_;
  } elsif(rand(1) <= ($N/$k)) {
    $r[rand(@r)] = $_;
  }
}

print @r;

That's $N randomly selected rows from STDIN. Replace the <>/$_ stuff with something else if you're not using rows from a file, but it's a pretty straightforward algorithm.

查看更多
登录 后发表回答