I have found what I would consider erratic behavior (but for which I hope there is a simple explanation) in R
's use of seeds in conjunction with rbinom()
when prob=0.5
is used. General idea: To me, if I set the seed, run rbinom()
once (i.e. conduct a single random process), despite what value prob
is set to, the random
seed should change by one increment. Then, if I again set the seed to the same value, and run another random process (such as rbinom()
again, but maybe with a different value of prob
), the seed should again change to the same value as it did for the previous single random process.
I have found R
does exactly this as long as I'm using rbinom()
with any prob!=0.5
. Here is an example:
Compare seed vector, .Random.seed
, for two probabilities other than 0.5:
set.seed(234908)
x <- rbinom(n=1,size=60,prob=0.4)
temp1 <- .Random.seed
set.seed(234908)
x <- rbinom(n=1,size=60,prob=0.3)
temp2 <- .Random.seed
any(temp1!=temp2)
> [1] FALSE
Compare seed vector, .Random.seed
, for prob=0.5 vs. prob!=0.5:
set.seed(234908)
x <- rbinom(n=1,size=60,prob=0.5)
temp1 <- .Random.seed
set.seed(234908)
x <- rbinom(n=1,size=60,prob=0.3)
temp2 <- .Random.seed
any(temp1!=temp2)
> [1] TRUE
temp1==temp2
> [1] TRUE FALSE TRUE TRUE TRUE TRUE TRUE
> [8] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
...
I have found this for all comparisions of prob=0.5
against all other probabilities
in the set {0.1, 0.2, ..., 0.9}. Similarly, if I compare any values of prob
from
{0.1, 0.2, ..., 0.9} other than 0.5, the .Random.seed
vector is always element-by-element equal. These facts also hold true for either odd or even size
within rbinom()
.
To make it even more strange (I apologize that this is a little convoluted - it's relevant to the way my function is written), when I use probabilities saved as elements in a vector, I have same problem if 0.5 is first element, but not second. Here is the example for this case:
First case: 0.5 is the first probability referenced in the vector
set.seed(234908)
MNAR <- c(0.5,0.3)
x <- rbinom(n=1,size=60,prob=MNAR[1])
y <- rbinom(n=1,size=50,prob=MNAR[2])
temp1 <- .Random.seed
set.seed(234908)
MNAR <- c(0.1,0.3)
x <- rbinom(n=1,size=60,prob=MNAR[1])
y <- rbinom(n=1,size=50,prob=MNAR[2])
temp2 <- .Random.seed
any(temp1!=temp2)
> [1] TRUE
any(temp1!=temp2)
> [1] TRUE FALSE TRUE TRUE TRUE TRUE TRUE
> [8] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Second case: 0.5 is the second probability referenced in the vector
set.seed(234908)
MNAR <- c(0.3,0.5)
x <- rbinom(n=1,size=60,prob=MNAR[1])
y <- rbinom(n=1,size=50,prob=MNAR[2])
temp1 <- .Random.seed
set.seed(234908)
MNAR <- c(0.1,0.3)
x <- rbinom(n=1,size=60,prob=MNAR[1])
y <- rbinom(n=1,size=50,prob=MNAR[2])
temp2 <- .Random.seed
any(temp1!=temp2)
> [1] FALSE
Again, I find that despite the values used for prob
and size
, this pattern holds. Can anyone explain this mystery to me? It's causing quite a problem because results that should be the same are coming up different because the seed is for some reason used/calculated differently when prob=0.5
but in no other instance.
So let's turn our comments into an answer. Thanks to Ben Bolker for putting us on the right track with a link to the code: https://svn.r-project.org/R/trunk/src/nmath/rbinom.c and the suggestion to track down where unif_rand()
is called.
A quick scan and it seems that the code is broken into two sections, delimited by the comments:
/*-------------------------- np = n*p >= 30 : ------------------- */
and
/*---------------------- np = n*p < 30 : ------------------------- */
Inside each of these, the number of calls to unif_rand
is not the same (two versus one.)
So for a given size
(n
), your random seed may end up in a different state depending on the value of prob
(p
): whether size * prob >= 30
or not.
With that in mind, all the results you got with your examples should now make sense:
# these end up in the same state
rbinom(n=1,size=60,prob=0.4) # => np < 30
rbinom(n=1,size=60,prob=0.3) # => np < 30
# these don't
rbinom(n=1,size=60,prob=0.5) # => np >= 30
rbinom(n=1,size=60,prob=0.3) # => np < 30
# these don't
{rbinom(n=1,size=60,prob=0.5) # np >= 30
rbinom(n=1,size=50,prob=0.3)} # np < 30
{rbinom(n=1,size=60,prob=0.1) # np < 30
rbinom(n=1,size=50,prob=0.3)} # np < 30
# these do
{rbinom(n=1,size=60,prob=0.3) # np < 30
rbinom(n=1,size=50,prob=0.5)} # np < 30
{rbinom(n=1,size=60,prob=0.1) # np < 30
rbinom(n=1,size=50,prob=0.3)} # np < 30
I'm going to take a contrarian position on this question and claim that the expectations are not appropriate and are not supported by the documentation. The documentation does not make any claim about what side effects (specifically on .Random.seed
) can be expected by calling rbinom
, or how those side effects may or may not be the same in various cases.
rbinom
has three parameters: n
, size
, and prob
. Your expectation is that, for a random seed set before calling rbinom
, .Random.seed
will be the same after calling rbinom
for a given n
and any values of size
and prob
(or maybe any finite values of size
and prob
). You certainly realize that it would be different for different values of n
. rbinom
doesn't guarantee that or imply that.
Without knowing the internals of the function, this can't be known; as the other answer showed, the algorithm is different based on the product of size
and prob
. And the internals may change so these specific details may change.
At least, in this case, the resulting .Random.seed
will be the same after every call of rbinom
which has the same n
, size
and prob
. I can construct a pathological function for which this is not even true:
seedtweak <- function() {
if(floor(as.POSIXlt(Sys.time())$sec * 10) %% 2) {
runif(1)
}
invisible(NULL)
}
Basically, this function looks a whether the tenths of the second of the time is odd or even to decided whether or not to draw a random number. Run this function and .Random.seed
may or may not change:
rs <- replicate(10, {
set.seed(123)
seedtweak()
.Random.seed
})
all(apply(rs, 1, function(x) Reduce(`==`, x)))
The best you can (should?) hope for is that a given set of code with all the inputs/parameters the same (including the seed) will always give identical results. Expecting identical results when only most (or only some) of the parameters are the same is not realistic unless all the functions called make those guarantees.