可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I need to generate a certain number of random numbers starting from a sequence of integers and I use the following code: result<-sample(x=c(2:50), size=10e6, replace=T). I find that increasing the length of the result vector (up to a length of 10^6), the distribution of random numbers is not random if the length of the vector x is an odd number. When plotting the histogram of result I usually get that the 1st number of the sequence (in the example the '2') has a column (and so a number of elements) that is always higher than the other columns. If x=c(1:50), and so the length of x is an even number, the behaviour of the random generator seems to be ok. Is there any issue about random number generators in R about this strange result? I use R 3.0.1 under Ubuntu 13.10.

回答1:

As I mentioned in my comment above, this has absolutely nothing to do with random number generators.

Consider:

set.seed(123)
result <- sample(x=c(2:50), size=10e4, replace=TRUE)
x <- hist(result)

Something looks wrong, eh? But look closer:

> x$breaks
 [1]  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
> x$counts
 [1] 6132 3971 4179 4115 4108 4002 4145 4073 4192 4117 4123 4099 4054 4013 4067 4055 4073 4082 4095
[20] 4088 4044 4050 4027 4096

versus...

> table(result)
result
   2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21 
1979 2100 2053 1978 1993 2152 2027 2058 2057 2074 2034 1991 2011 2075 2070 2067 2006 2047 2145 2019 
  22   23   24   25   26   27   28   29   30   31   32   33   34   35   36   37   38   39   40   41 
2098 2060 2063 2099 2000 2016 2038 1990 2023 1976 2091 2060 1995 2061 2012 2003 2079 2008 2087 2036 
  42   43   44   45   46   47   48   49   50 
2052 1989 2055 2044 2006 2001 2026 2062 2034

Note that the first bin from hist appears to include all 2, 3 and 4 values. This is because the default binning strategy employed by hist adds some "fuzziness" to the bin boundaries, which result in the first two break point being slightly less than 2.0 and slightly more than 4.0. Combine that with the intervals being right closed, and you get the resulting histogram.

Compare with:

hist(result,breaks = 1:50)

回答2:

This may be an off topic tech-talk, but if you're dissatisfied with the default (Mersenne Twister) (pseudo)random number generator, try switching to another one, see ?RNGkind. However, it is known that MT19937 passes many very rigid tests (like Marsaglia's Diehard test battery or TestU1), so there's most probably (EDIT: now: surely) something wrong with your code.

Anyway, If your code had been ok (EDIT: again, we know it's not the case), what you would have obtained, is a nice procedure for testing the randomness of a RNG (in which MT does not perform well). Anyway, this may (at least theoretically) happen to be true, there's no perfect generator.

A nice introduction to the topic of random number generation and RNG testing is: Random Number Generation and Monte Carlo Methods by James E. Gentle.

回答3:

>table(result)
result
     2      3      4      5      6      7      8      9     10     11     12     13     14 
203862 203602 204693 204089 203715 203070 204382 204380 204600 203483 204448 204833 203852 
    15     16     17     18     19     20     21     22     23     24     25     26     27 
204510 203639 203328 204017 204385 204699 203519 204518 203278 203941 203994 204531 204685 
    28     29     30     31     32     33     34     35     36     37     38     39     40 
203378 203993 204128 203932 203961 204176 204684 204833 204499 203654 202945 204347 204354 
    41     42     43     44     45     46     47     48     49     50 
204368 204763 203901 204382 203921 204350 203581 203322 204334 204141

using table command you can see that there are of comparable frequency.

>a <- hist(result) 
>a

$breaks
[1]  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50

$counts
 [1] 612157 407804 407452 408980 407931 408685 408149 407345 409084 408037 407219 408525
[13] 408063 408121 407893 408860 409332 406599 408701 409131 408283 408271 406903 408475

$density
 [1] 0.03060785 0.02039020 0.02037260 0.02044900 0.02039655 0.02043425 0.02040745 0.02036725
 [9] 0.02045420 0.02040185 0.02036095 0.02042625 0.02040315 0.02040605 0.02039465 0.02044300
[17] 0.02046660 0.02032995 0.02043505 0.02045655 0.02041415 0.02041355 0.02034515 0.02042375

$mids
 [1]  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

$xname
[1] "result"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

Notice that in breaks the values are 2 then 4 which means it includes frequencies of 2,3, and 4 in the first bin. the next bins all have only two values while the first one has 3 hence you see the spike in histogram plot.