I am generating sample data for running a simulation where I need to take care of variance across the sample. I have written the code but I am not getting the variance as expected. Need some help on how to get this right. Also any suggestions on optimizing the code is most welcome!
So to start with I generate a sample data using below code -
library("data.table")
set.seed(1200)
N_Blocks = 100 #My actual data has this around 1500 which take time for below for loop so restricted this to 100
cyc=200
City <- vector()
selected <- vector()
Census <- vector()
City <- sample(paste("City", formatC(1, width=nchar(cyc), flag="0"), sep=""),N_Blocks,rep=T)
selected <- sample(0:1,N_Blocks,rep = T)
Census <- sample(0:200,N_Blocks,rep = T)
df1 <- data.frame(City,selected,Census)
str(df1)
Now I need to repeat this data for 60 Months(5 years) and 200 sets, where variance across the months would be as below -
City001 - City050 - Variance of +- 5%
City051 - City100 - Variance of +- 10%
City101 - City150 - Variance of +- 15%
City151 - City200 - Variance of +- 20%
My database is big and I wanted to do using data.table, but since I was not able to, I have written a for loop as below -
df1 <- as.data.table(df1, row.names = NULL)
datalist <- list()
varlow <- 0.95
varhigh <- 1.05
sets=1
cyc=200
mov1 =13
M=72
seedno=1200
for (itr in 1:cyc){
vec0 <- NULL
vec0 <- as.vector(df1$Census)
df1a <- df1
set.seed(seedno) ## seed for reproducability
for (m in mov1:M) {
#set.seed(seedno) ## seed for reproducability
for (l in 1:N_Blocks) {
vec0[l] <- ifelse(vec0[l]==0 , sample(0:3, 1, rep=T),
sample(floor(vec0[l]*runif(1,varlow,1)):ceiling(vec0[l]*runif(1,1,varhigh)),1,rep=T))
}
df1a <- cbind(df1a, data.table(xx=vec0))
names(df1a)[names(df1a)=="xx"] <- paste0("M",m)
df1a$varlow <- varlow
df1a$varhigh <- varhigh
df1a$set <- sets
df1a$City <- sample(paste("City", formatC(itr, width=nchar(cyc), flag="0"), sep=""),N_Blocks,rep=T)
}
datalist[[itr]] <- df1a
if(itr==50){
varlow=0.90
varhigh=1.10
sets=2
}
if(itr==100){
varlow=0.85
varhigh=1.15
sets=3
}
if(itr==150){
varlow=0.80
varhigh=1.20
sets=4
}
}
df1_f <- NULL
df1_f = do.call(rbind, datalist)
This code generates the data, 200 sets of the same 100 records. However the variance across months is not +-5%,+-10%,+-15%,+-20% as per the sets.
If I check for the growth for each of the sets using below code, I see that the growth is not as expected, i.e the variance is not increasing.....
report1 <- df1_f[,.(M24=sum(M24),
M36=sum(M36),
M48=sum(M48),
M60=sum(M60),
M72=sum(M72)),by=set]
growth is from -2.1% to 1.8%, where as we have given the variance to go up to 20%.
Note - the values in the df1$Census needs to vary by +- 5% etc. I am storing this value in vec0 and using in the for loop.
I think I am missing something basic, how can I get the desired sample data with such variance for each set?
Thank you!!