I want to generate two data series of size 100 in R, one of which is going to be remission time, tr, from Exp(mean=1) distribution and the other one is going to be survival time, t, from Exp(mean=2.5) distribution. I want them to be negatively correlated (say, the correlation is -0.5). But at the same time I want that R avoids the values of t[i] that are less than tr[i] for data point i, because survival times should be greater than remission times. I have been able to produce some correlation between the two variables (although the correlation is not exactly reproduced) using the following codes:
rho <- -0.5
mu <- rep(0,2)
Sigma <- matrix(rho, nrow=2, ncol=2) + diag(2)*(1 - rho)
library(MASS)
rawvars <- mvrnorm(100, mu=mu, Sigma=Sigma)
pvars <- pnorm(rawvars)
tr<-rep(0,100)
for(i in 1:100){
tr[i] <- qexp(pvars[,1][i], 1/1)
}
t<-rep(0,100)
for(i in 1:100){
repeat {
t[i] <- qexp(pvars[,2][i], 1/2)
if (t[i]>tr[i]) break
}
}
cor(tr,t)
sum(tr>t) # shows number of invalid cases
But I don't understand how I should efficiently induce the condition so that R only generates values of t that are greater than corresponding tr. Moreover, is there a better way (faster way) to do the whole thing in R? Thanks in advanced for your response.
The issue here is that qexp
is the quantile function and will return the same value for the same probability pvars[,2][i]
. As a result, your code can easily go into an infinite loop when any one of the pvars[i,]
is such that t[i]<=tr[i]
. To avoid that, you must regenerate your rawvars
for each t[i], tr[i]
pair that fails your condition. In addition, looping over pvars
is not necessary since qexp
and operator >
are all vectorized. The following code does what you want:
rho <- -0.5
mu <- rep(0,2)
Sigma <- matrix(rho, nrow=2, ncol=2) + diag(2)*(1 - rho)
library(MASS)
set.seed(1) ## so that results are repeatable
compute.tr.t <- function(n, paccept) {
n <- round(n / paccept)
rawvars <- mvrnorm(n, mu=mu, Sigma=Sigma)
pvars <- pnorm(rawvars)
tr <- qexp(pvars[,1], 1/1)
t <- qexp(pvars[,2], 1/2)
keep <- which(t > tr)
return(data.frame(t=t[keep],tr=tr[keep]))
}
n <- 10000 ## generating 10000 instead of 100, this can now be large
paccept <- 1
res <- data.frame()
while (n > 0) {
new.res <- compute.tr.t(n, paccept)
res <- rbind(res, new.res)
paccept <- nrow(new.res) / n
n <- n - nrow(res)
}
Notes:
The function compute.tr.t
borrows a technique from rejection sampling here. Its input arguments are the requested number of samples that we want and the expected probability of acceptance. With this:
- It generates
n = n / paccept
exponential variates for both tr
and t
as you do to account for the probability of acceptance
- It only keeps those satisfying the condition
t > tr
.
What compute.tr.t
returns may be less than the requested n
samples. We can then use this information to compute how many more samples we need and what the updated expected probability of acceptance is.
We generate the samples satisfying our condition in a while
loop. In this loop:
- We call
compute.tr.t
with a requested number of samples to generate and the expected acceptance rate. Initially, these will be set to how many total samples we want and 1
, respectively.
- The result of
compute.tr.t
are then appended to the result data frame res
.
- Updating the probability of accept is simply the ratio of how many samples were returned over how many were requested.
- Updating the requested number of samples is simply how many more we need from the total number we want.
- We stop when the next requested number of samples is less than or equal to
0
(i.e., we have enough samples).
The resulting data frame may contain more than the total number of samples we want.
Running this code, we get:
print(cor(res$tr,res$t))
[1] -0.09128498
print(sum(res$tr>res$t)) # shows number of invalid cases
##[1] 0
We note that the anti correlation is significantly weaker than expected. This is due to your condition. If we remove this condition by modifying compute.tr.t
as:
compute.tr.t <- function(n, paccept) {
n <- round(n / paccept)
rawvars <- mvrnorm(n, mu=mu, Sigma=Sigma)
pvars <- pnorm(rawvars)
tr <- qexp(pvars[,1], 1/1)
t <- qexp(pvars[,2], 1/2)
return(data.frame(t=t,tr=tr))
}
Then we get:
print(cor(res$tr,res$t))
##[1] -0.3814602
print(sum(res$tr>res$t)) # shows number of invalid cases
##[1] 3676
The correlation is now much more reasonable, but the number of invalid cases is significant.