I need to generate a certain number of random numbers starting from a sequence of integers and I use the following code: result<-sample(x=c(2:50), size=10e6, replace=T)
. I find that increasing the length of the result vector (up to a length of 10^6), the distribution of random numbers is not random if the length of the vector x
is an odd number. When plotting the histogram of result
I usually get that the 1st number of the sequence (in the example the '2') has a column (and so a number of elements) that is always higher than the other columns. If x=c(1:50)
, and so the length of x
is an even number, the behaviour of the random generator seems to be ok. Is there any issue about random number generators in R about this strange result? I use R 3.0.1 under Ubuntu 13.10.
问题:
回答1:
As I mentioned in my comment above, this has absolutely nothing to do with random number generators.
Consider:
set.seed(123)
result <- sample(x=c(2:50), size=10e4, replace=TRUE)
x <- hist(result)
Something looks wrong, eh? But look closer:
> x$breaks
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
> x$counts
[1] 6132 3971 4179 4115 4108 4002 4145 4073 4192 4117 4123 4099 4054 4013 4067 4055 4073 4082 4095
[20] 4088 4044 4050 4027 4096
versus...
> table(result)
result
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
1979 2100 2053 1978 1993 2152 2027 2058 2057 2074 2034 1991 2011 2075 2070 2067 2006 2047 2145 2019
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
2098 2060 2063 2099 2000 2016 2038 1990 2023 1976 2091 2060 1995 2061 2012 2003 2079 2008 2087 2036
42 43 44 45 46 47 48 49 50
2052 1989 2055 2044 2006 2001 2026 2062 2034
Note that the first bin from hist
appears to include all 2, 3 and 4 values. This is because the default binning strategy employed by hist
adds some "fuzziness" to the bin boundaries, which result in the first two break point being slightly less than 2.0 and slightly more than 4.0. Combine that with the intervals being right closed, and you get the resulting histogram.
Compare with:
hist(result,breaks = 1:50)
回答2:
This may be an off topic tech-talk, but if you're dissatisfied with the default (Mersenne Twister) (pseudo)random number generator, try switching to another one, see ?RNGkind
. However, it is known that MT19937 passes many very rigid tests (like Marsaglia's Diehard test battery or TestU1), so there's most probably (EDIT: now: surely) something wrong with your code.
Anyway, If your code had been ok (EDIT: again, we know it's not the case), what you would have obtained, is a nice procedure for testing the randomness of a RNG (in which MT does not perform well). Anyway, this may (at least theoretically) happen to be true, there's no perfect generator.
A nice introduction to the topic of random number generation and RNG testing is: Random Number Generation and Monte Carlo Methods by James E. Gentle.
回答3:
>table(result)
result
2 3 4 5 6 7 8 9 10 11 12 13 14
203862 203602 204693 204089 203715 203070 204382 204380 204600 203483 204448 204833 203852
15 16 17 18 19 20 21 22 23 24 25 26 27
204510 203639 203328 204017 204385 204699 203519 204518 203278 203941 203994 204531 204685
28 29 30 31 32 33 34 35 36 37 38 39 40
203378 203993 204128 203932 203961 204176 204684 204833 204499 203654 202945 204347 204354
41 42 43 44 45 46 47 48 49 50
204368 204763 203901 204382 203921 204350 203581 203322 204334 204141
using table command you can see that there are of comparable frequency.
>a <- hist(result)
>a
$breaks
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
$counts
[1] 612157 407804 407452 408980 407931 408685 408149 407345 409084 408037 407219 408525
[13] 408063 408121 407893 408860 409332 406599 408701 409131 408283 408271 406903 408475
$density
[1] 0.03060785 0.02039020 0.02037260 0.02044900 0.02039655 0.02043425 0.02040745 0.02036725
[9] 0.02045420 0.02040185 0.02036095 0.02042625 0.02040315 0.02040605 0.02039465 0.02044300
[17] 0.02046660 0.02032995 0.02043505 0.02045655 0.02041415 0.02041355 0.02034515 0.02042375
$mids
[1] 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
$xname
[1] "result"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
Notice that in breaks the values are 2 then 4 which means it includes frequencies of 2,3, and 4 in the first bin. the next bins all have only two values while the first one has 3 hence you see the spike in histogram plot.