Generating m distinct random numbers in the range

2019-01-06 12:52发布

问题:

I have two methods of generating m distinct random numbers in the range [0..n-1]

Method 1:

//C++-ish pseudocode
int result[m];
for(i = 0; i < m; ++i)
{
   int r;
   do
   {
      r = rand()%n;
   }while(r is found in result array at indices from 0 to i)
   result[i] = r;   
}

Method 2:

//C++-ish pseudocode
int arr[n];
for(int i = 0; i < n; ++i)
    arr[i] = i;
random_shuffle(arr, arr+n);
result = first m elements in arr;

The first method is more efficient when n is much larger than m, whereas the second is more efficient otherwise. But "much larger" isn't that strict a notion, is it? :)

Question: What formula of n and m should I use to determine whether method1 or method2 will be more efficient? (in terms of mathematical expectation of the running time)

回答1:

Pure mathematics:
Let's calculate the quantity of rand() function calls in both cases and compare the results:

Case 1: let's see the mathematical expectation of calls on step i = k, when you already have k numbers chosen. The probability to get a number with one rand() call is equal to p = (n-k)/n. We need to know the mathematical expectation of such calls quantity which leads to obtaining a number we don't have yet.

The probability to get it using 1 call is p. Using 2 calls - q * p, where q = 1 - p. In general case, the probability to get it exactly after n calls is (q^(n-1))*p. Thus, the mathematical expectation is
Sum[ n * q^(n-1) * p ], n = 1 --> INF. This sum is equal to 1/p (proved by wolfram alpha).

So, on the step i = k you will perform 1/p = n/(n-k) calls of the rand() function.

Now let's sum it overall:

Sum[ n/(n - k) ], k = 0 --> m - 1 = n * T - the number of rand calls in method 1.
Here T = Sum[ 1/(n - k) ], k = 0 --> m - 1

Case 2:

Here rand() is called inside random_shuffle n - 1 times (in most implementations).

Now, to choose the method, we have to compare these two values: n * T ? n - 1.
So, to choose the appropriate method, calculate T as described above. If T < (n - 1)/n it's better to use the first method. Use the second method otherwise.



回答2:

Check the Wikipedia description of the original Fisher-Yates algorithm. It advocates using essentially your method 1 for up to n/2, and your method 2 for the remainder.



回答3:

Personally, I would use Method 1, and then if M > N/2, choose N-M values, and then invert the array (return the numbers that were not picked). So for example, if N is 1000 and you want 950 of them, chose 50 values using Method 1, and then return the other 950.

Edit: Though, if consistent performance is your goal, I would use a modified method 2, which doesn't do the full shuffle, but only shuffles the first M elements of your N length array.

int arr[n];
for(int i = 0; i < n; ++i)
    arr[i] = i;

for (int i =0; i < m; ++i) {
   int j = rand(n-i); // Pick random number from 0 <= r < n-i.  Pick favorite method
   // j == 0 means don't swap, otherwise swap with the element j away
   if (j != 0) { 
      std::swap(arr[i], arr[i+j]);
   }
}
result = first m elements in arr;


回答4:

Here's an algorithm that will work in O(n) memory and O(n) time (where n is the number of returned results, not the size of the set you're selecting from) for any result set. It's in Python for convenience because it uses a hashtable:

def random_elements(num_elements, set_size):
    state = {}
    for i in range(num_elements):
        # Swap state[i] with a random element
        swap_with = random.randint(i, set_size - 1)
        state[i], state[swap_with] = state.get(swap_with, swap_with), state.get(i, i)
    return [state[i] for i in range(num_elements) # effectively state[:num_elements] if it were a list/array.

This is just a partial fisher-yates shuffle, with the array being shuffled implemented as a sparse hashtable - any element that is not present is equal to its index. We shuffle the first num_elements indices, and return those values. In the case that set_size = 1, this is equivalent to picking a random number in the range, and in the case that num_elements = set_size, this is equivalent to a standard fisher-yates shuffle.

It's trivial to observe that this is O(n) time, and because each iteration of the loop initializes at most two new indices in the hashtable, it's O(n) space, too.



回答5:

What about a third method?

int result[m];
for(i = 0; i < m; ++i)
{
   int r;
   r = rand()%(n-i);
   r += (number of items in result <= r)
   result[i] = r;   
}

Edit it should be <=. and it would actually additional logic to avoid collisions.

This is better, an example using the Modern Method from Fisher-Yates

//C++-ish pseudocode
int arr[n];
for(int i = 0; i < n; ++i)
    arr[i] = i;

for(i = 0; i < m; ++i)
    swap(arr, n-i, rand()%(n-i) );

result = last m elements in arr;


回答6:

Talking about mathematical expectation, it's pretty useless but I will post it anyway :D

Shuffle is simple O(m).

Now the other algorithm is a bit more complex. The number of steps needed to generate the next number is the expected value of the number of trials, and the probability of the trial length is a geomtric distribution. So...

p=1          E[X1]=1            = 1           = 1
p=1-1/n      E[x2]=1/(1-1/n)    = 1 + 1/(n-1) = 1 + 1/(n-1) 
p=1-2/n      E[x3]=1/(1-1/n)    = 1 + 2/(n-2) = 1 + 1/(n-2) + 1/(n-2)
p=1-3/n      E[X4]=1/(1-2/n)    = 1 + 3/(n-3) = 1 + 1/(n-3) + 1/(n-3) + 1(n-3)
....
p=1-(m-1)/n) E[Xm]=1/(1-(m-1)/n))

Note that the sum can be split up into a triangle shape, see right hand side.

Let's use the formula for the harmonic series: H_n = Sum k=0->n (1/k) = approx ln(k)

Sum(E[Xk]) = m + ln(n-1)-ln(n-m-1) + ln(n-2)-ln(n-m-1) + ... = m + ln(n-1) + ln(n-2) + ... - (m-1)*ln(n-m-1) ..

And there is some forumla for the sum of harmonic series, if you are still interested I will look it up...

Update: actually it's quite nice formula (thanks to the brilliant Concrete Mathematics book)

Sum(H_k) k=0->n = n*H_n - n

So the expected number of steps:

Sum(E[Xk]) = m + (n-1)*ln(n-1) - (n-1) - (n-m-1)*ln(n-m-1) - (n-m-1)) - (m-1)*ln(n-m-1).

Note: I haven't verified it.



回答7:

This is a bit of a long shot, but it could work, depending on your system.

  1. Start with some reasonable ratio, like 0.5.
  2. When a request comes in, process it with whichever method you get from the current value of the threshold ratio.
  3. Record the time it takes and when you have "empty" time, perform the same task with the other method.
  4. If the alternative solution is much faster than the original one, adjust the threshold up or down.

The obvious flaw in this method is that on highly variable load systems your "offline" test won't be too reliable.



回答8:

There was suggested Fisher-Yates shuffle. Don't know if the next code generates equally distributed integers, but it is at least compact and one-pass:

std::random_device rd;
std::mt19937 g(rd());
for (size_type i = 1; i < std::size(v); ++i) {
    v[i] = std::exchange(v[g() % i], i);
}


回答9:

What about using set instead of array, i think it is much easier than array

set<int> Numbers;
while (Numbers.size() < m) {
   Numbers.insert(rand() % n);
}


回答10:

Quite possibly it would be more simple to start it in debug mode (and keep one method as a note) for a couple of times to get an average, then use the other method to get an average from that



回答11:

I don't advise this method but it works

#include <iostream>
#include <random>
#include <ctime>

using namespace std;

int randArray[26];
int index = 0;

bool unique(int rand) {

    for (int i = 0; i < index; i++)
        if (rand == randArray[i])
            return false;
    index++;
    return true;
}


int main()
{
    srand(time(NULL));

    for (int i = 1; i < 26; i++)
        randArray[i] = -1;

    for (int i = 0; i < 26; i++) {

        randArray[i] = rand() % 26;

        while (!unique(randArray[i])) {
            randArray[i] = rand() % 26;
        }
    }

    for (int i = 0; i < 26; i++) {
        cout << randArray[i] << " ";
    }

    cout << "\n" << index << endl;


    return 0;
}