How do you efficiently generate a list of K non-re

2019-01-02 23:56发布

This question already has an answer here:

The question gives all necessary data: what is an efficient algorithm to generate a sequence of K non-repeating integers within a given interval [0,N-1]. The trivial algorithm (generating random numbers and, before adding them to the sequence, looking them up to see if they were already there) is very expensive if K is large and near enough to N.

The algorithm provided in Efficiently selecting a set of random elements from a linked list seems more complicated than necessary, and requires some implementation. I've just found another algorithm that seems to do the job fine, as long as you know all the relevant parameters, in a single pass.

13条回答
一纸荒年 Trace。
2楼-- · 2019-01-02 23:59

Here's a way to do it in O(N) without extra storage. I'm pretty sure this is not a purely random distribution, but it's probably close enough for many uses.

/* generate N sorted, non-duplicate integers in [0, max[  in O(N))*/
 int *generate(int n, int max) {
    float step,a,v=0;
    int i;    
    int *g = (int *)calloc(n, sizeof(int));
    if ( ! g) return 0;

    for (i=0; i<n; i++) {
        step = (max-v)/(float)(n-i);
        v+ = floating_pt_random_in_between(0.0, step*2.0);
        if ((int)v == g[i-1]){
          v=(int)v+1;             //avoid collisions
        }
        g[i]=v;
    }
    while (g[i]>max) {
      g[i]=max;                   //fix up overflow
      max=g[i--]-1;
    }
    return g;
 }
查看更多
成全新的幸福
3楼-- · 2019-01-03 00:01

In The Art of Computer Programming, Volume 2: Seminumerical Algorithms, Third Edition, Knuth describes the following selection sampling algorithm:

Algorithm S (Selection sampling technique). To select n records at random from a set of N, where 0 < n ≤ N.

S1. [Initialize.] Set t ← 0, m ← 0. (During this algorithm, m represents the number of records selected so far, and t is the total number of input records that we have dealt with.)

S2. [Generate U.] Generate a random number U, uniformly distributed between zero and one.

S3. [Test.] If (N – t)U ≥ n – m, go to step S5.

S4. [Select.] Select the next record for the sample, and increase m and t by 1. If m < n, go to step S2; otherwise the sample is complete and the algorithm terminates.

S5. [Skip.] Skip the next record (do not include it in the sample), increase t by 1, and go back to step S2.

An implementation may be easier to follow than the description. Here is a Common Lisp implementation that select n random members from a list:

(defun sample-list (n list &optional (length (length list)) result)
  (cond ((= length 0) result)
        ((< (* length (random 1.0)) n)
         (sample-list (1- n) (cdr list) (1- length)
                      (cons (car list) result)))
        (t (sample-list n (cdr list) (1- length) result))))

And here is an implementation that does not use recursion, and which works with all kinds of sequences:

(defun sample (n sequence)
  (let ((length (length sequence))
        (result (subseq sequence 0 n)))
    (loop
       with m = 0
       for i from 0 and u = (random 1.0)
       do (when (< (* (- length i) u) 
                   (- n m))
            (setf (elt result m) (elt sequence i))
            (incf m))
       until (= m n))
    result))
查看更多
对你真心纯属浪费
4楼-- · 2019-01-03 00:02

It is actually possible to do this in space proportional to the number of elements selected, rather than the size of the set you're selecting from, regardless of what proportion of the total set you're selecting. You do this by generating a random permutation, then selecting from it like this:

Pick a block cipher, such as TEA or XTEA. Use XOR folding to reduce the block size to the smallest power of two larger than the set you're selecting from. Use the random seed as the key to the cipher. To generate an element n in the permutation, encrypt n with the cipher. If the output number is not in your set, encrypt that. Repeat until the number is inside the set. On average you will have to do less than two encryptions per generated number. This has the added benefit that if your seed is cryptographically secure, so is your entire permutation.

I wrote about this in much more detail here.

查看更多
我命由我不由天
5楼-- · 2019-01-03 00:02

Speed up the trivial algorithm by storing the K numbers in a hashing store. Knowing K before you start takes away all the inefficiency of inserting into a hash map, and you still get the benefit of fast look-up.

查看更多
Luminary・发光体
6楼-- · 2019-01-03 00:05

If the list is sorted, for example, if you want to extract K elements out of N, but you do not care about their relative order, an efficient algorithm is proposed in the paper An Efficient Algorithm for Sequential Random Sampling (Jeffrey Scott Vitter, ACM Transactions on Mathematical Software, Vol. 13, No. 1, March 1987, Pages 56-67.).

edited to add the code in c++ using boost. I've just typed it and there might be many errors. The random numbers come from the boost library, with a stupid seed, so don't do anything serious with this.

/* Sampling according to [Vitter87].
 * 
 * Bibliography
 * [Vitter 87]
 *   Jeffrey Scott Vitter, 
 *   An Efficient Algorithm for Sequential Random Sampling
 *   ACM Transactions on MAthematical Software, 13 (1), 58 (1987).
 */

#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <string>
#include <iostream>

#include <iomanip>

#include <boost/random/linear_congruential.hpp>
#include <boost/random/variate_generator.hpp>
#include <boost/random/uniform_real.hpp>

using namespace std;

// This is a typedef for a random number generator.
// Try boost::mt19937 or boost::ecuyer1988 instead of boost::minstd_rand
typedef boost::minstd_rand base_generator_type;

    // Define a random number generator and initialize it with a reproducible
    // seed.
    // (The seed is unsigned, otherwise the wrong overload may be selected
    // when using mt19937 as the base_generator_type.)
    base_generator_type generator(0xBB84u);
    //TODO : change the seed above !
    // Defines the suitable uniform ditribution.
    boost::uniform_real<> uni_dist(0,1);
    boost::variate_generator<base_generator_type&, boost::uniform_real<> > uni(generator, uni_dist);



void SequentialSamplesMethodA(int K, int N) 
// Outputs K sorted random integers out of 0..N, taken according to 
// [Vitter87], method A.
    {
    int top=N-K, S, curr=0, currsample=-1;
    double Nreal=N, quot=1., V;

    while (K>=2)
        {
        V=uni();
        S=0;
        quot=top/Nreal;
        while (quot > V)
            {
            S++; top--; Nreal--;
            quot *= top/Nreal;
            }
        currsample+=1+S;
        cout << curr << " : " << currsample << "\n";
        Nreal--; K--;curr++;
        }
    // special case K=1 to avoid overflow
    S=floor(round(Nreal)*uni());
    currsample+=1+S;
    cout << curr << " : " << currsample << "\n";
    }

void SequentialSamplesMethodD(int K, int N)
// Outputs K sorted random integers out of 0..N, taken according to 
// [Vitter87], method D. 
    {
    const int negalphainv=-13; //between -20 and -7 according to [Vitter87]
    //optimized for an implementation in 1987 !!!
    int curr=0, currsample=0;
    int threshold=-negalphainv*K;
    double Kreal=K, Kinv=1./Kreal, Nreal=N;
    double Vprime=exp(log(uni())*Kinv);
    int qu1=N+1-K; double qu1real=qu1;
    double Kmin1inv, X, U, negSreal, y1, y2, top, bottom;
    int S, limit;
    while ((K>1)&&(threshold<N))
        {
        Kmin1inv=1./(Kreal-1.);
        while(1)
            {//Step D2: generate X and U
            while(1)
                {
                X=Nreal*(1-Vprime);
                S=floor(X);
                if (S<qu1) {break;}
                Vprime=exp(log(uni())*Kinv);
                }
            U=uni();
            negSreal=-S;
            //step D3: Accept ?
            y1=exp(log(U*Nreal/qu1real)*Kmin1inv);
            Vprime=y1*(1. - X/Nreal)*(qu1real/(negSreal+qu1real));
            if (Vprime <=1.) {break;} //Accept ! Test [Vitter87](2.8) is true
            //step D4 Accept ?
            y2=0; top=Nreal-1.;
            if (K-1 > S)
                {bottom=Nreal-Kreal; limit=N-S;}
            else {bottom=Nreal+negSreal-1.; limit=qu1;}
            for(int t=N-1;t>=limit;t--)
                {y2*=top/bottom;top--; bottom--;}
            if (Nreal/(Nreal-X)>=y1*exp(log(y2)*Kmin1inv))
                {//Accept !
                Vprime=exp(log(uni())*Kmin1inv);
                break;
                }
            Vprime=exp(log(uni())*Kmin1inv);
            }
        // Step D5: Select the (S+1)th record
        currsample+=1+S;
        cout << curr << " : " << currsample << "\n";
        curr++;
        N-=S+1; Nreal+=negSreal-1.;
        K-=1; Kreal-=1; Kinv=Kmin1inv;
        qu1-=S; qu1real+=negSreal;
        threshold+=negalphainv;
        }
    if (K>1) {SequentialSamplesMethodA(K, N);}
    else {
        S=floor(N*Vprime);
        currsample+=1+S;
        cout << curr << " : " << currsample << "\n";
        }
    }


int main(void)
    {
    int Ntest=10000000, Ktest=Ntest/100;
    SequentialSamplesMethodD(Ktest,Ntest);
    return 0;
    }

$ time ./sampling|tail

gives the following ouptut on my laptop

99990 : 9998882
99991 : 9998885
99992 : 9999021
99993 : 9999058
99994 : 9999339
99995 : 9999359
99996 : 9999411
99997 : 9999427
99998 : 9999584
99999 : 9999745

real    0m0.075s
user    0m0.060s
sys 0m0.000s
查看更多
Luminary・发光体
7楼-- · 2019-01-03 00:11

Generate an array 0...N-1 filled a[i] = i.

Then shuffle the first K items.

Shuffling:

  • Start J = N-1
  • Pick a random number 0...J (say, R)
  • swap a[R] with a[J]
    • since R can be equal to J, the element may be swapped with itself
  • subtract 1 from J and repeat.

Finally, take K last elements.

This essentially picks a random element from the list, moves it out, then picks a random element from the remaining list, and so on.

Works in O(K) and O(N) time, requires O(N) storage.

The shuffling part is called Fisher-Yates shuffle or Knuth's shuffle, described in the 2nd volume of The Art of Computer Programming.

查看更多
登录 后发表回答