Which of these algorithm is better in performance

2019-05-11 15:15发布

问题:

1

Take an array of n elements: { 1, 2, 3, .... n }. Shuffle the array using any of the standard algorithms of randomly shuffling arrays. The first N elements of the modified array is what you are looking for.

2

Simply use Random.Next() in a loop and check if it already exists or not in an Dictionary, until we have N numbers.

Notice that N << n ( N is very smaller than n)

回答1:

Partial Fisher-Yates, with some tweaks*:

Algorithm to generate 1000 distinct integers in the range [0,8000]?

&

Algorithm to select a single, random combination of values?

* The main takeaway here is that memory usage is reduced, so it is now proportional to the number of items selected at most, not the number of items selectable. This can provide major savings if N << n (as you have mentioned). (Space usage is capped at O(n/2), no matter how close N is to n.)

Running time is O(N).

Apart from this, it is a fairly usual partial Fisher-Yates, aka Knuth shuffle.



回答2:

Neither of those is best. You need a Fisher-Yates shuffle. The problem with the random solution is that you do a lot of unnecessary work up front. The problem with the second solution is that the chances of duplicates becomes higher with time so that you're throwing away a lot of values.

For a very efficient solution that gives you a subset of your values with zero possibility of duplicates (and no unnecessary up-front sorting), Fisher-Yates is the way to go.

dim n[N]                  // gives n[0] through n[N-1]
for each i in 0..N-1:
    n[i] = i              // initialise them to their indexes
nsize = N                 // starting pool size
do N times:
    i = rnd(nsize)        // give a number between 0 and nsize-1
    print n[i]
    nsize = nsize - 1     // these two lines effectively remove the used number
    n[i] = n[nsize]

By simply selecting a random number from the pool, replacing it with the top number from that pool, then reducing the size of the pool, you get a shuffle without having to worry about a large number of swaps up front.

This is important if the number is high in that it doesn't introduce an unnecessary startup delay.

For example, examine the following bench-check, choosing 10-from-10:

<------ n[] ------>
0 1 2 3 4 5 6 7 8 9  nsize  rnd(nsize)  output
-------------------  -----  ----------  ------
0 1 2 3 4 5 6 7 8 9     10           4       4
0 1 2 3 9 5 6 7 8        9           7       7
0 1 2 3 9 5 6 8          8           2       2
0 1 8 3 9 5 6            7           6       6
0 1 8 3 9 5              6           0       0
5 1 8 3 9                5           2       8
5 1 9 3                  4           1       1
5 3 9                    3           0       5
9 3                      2           1       3
9                        1           0       9

You can see the pool reducing as you go and, because you're always replacing the used one with an unused one, you'll never have a repeat.



回答3:

It entirely depends on the two values (n and N).

For a large n and small N (e.g. pick two distinct random values between zero and a million), option two will be better. The expected time here is probably tricky to calculate, and well beyond my Sunday-evening abilities... but basically you need to iterate N times in the outer loop; then you need to test whether you've already returned that value (which is probably O(m) once you've picked m values already); then you need to potentially try again if you've found that value - this will have no upper bound on time, but will have a probabilistic time which is complex to compute.

When n and N are close to each other (e.g. pick 99 random values between one and a hundred inclusive) then a tweak to option one (only actually picking as many values as you need, rather than shuffling the whole array) will be better. Using a Fisher-Yates shuffle you'll get O(n).

In practice:

  • If I knew that I'll have a large n and small N, I'd pick the second option
  • If I knew that I'll have a value of n which is "not hugely larger" than N, I'd probably pick the first option
  • If it was too close to call but I knew the values ahead of time, I'd run it several times each way and benchmark the two
  • If I don't know the values ahead of time at all, I'd run the two options several times on lots of different (n, N) pairs to try to get a better feel for how to work out the balance.


回答4:

At the extreme 2 could potentially never finish as each number it generates could already be in the list. However, in general you will be iterating over far more than N numbers.

1 is guaranteed to complete in finite time.