-->

Code optimization + Generating sample data based o

2019-06-14 01:27发布

问题:

I am generating sample data for running a simulation where I need to take care of variance across the sample. I have written the code but I am not getting the variance as expected. Need some help on how to get this right. Also any suggestions on optimizing the code is most welcome!

So to start with I generate a sample data using below code -

library("data.table")
set.seed(1200)
N_Blocks = 100 #My actual data has this around 1500 which take time for below for loop so restricted this to 100
cyc=200
City <- vector()
selected <- vector()
Census <- vector()

  City <- sample(paste("City", formatC(1, width=nchar(cyc), flag="0"), sep=""),N_Blocks,rep=T)
  selected <- sample(0:1,N_Blocks,rep = T)
  Census <- sample(0:200,N_Blocks,rep = T)


df1 <- data.frame(City,selected,Census)
str(df1)

Now I need to repeat this data for 60 Months(5 years) and 200 sets, where variance across the months would be as below -

City001 - City050 - Variance of +- 5%

City051 - City100 - Variance of +- 10%

City101 - City150 - Variance of +- 15%

City151 - City200 - Variance of +- 20%

My database is big and I wanted to do using data.table, but since I was not able to, I have written a for loop as below -

df1  <- as.data.table(df1, row.names = NULL)

datalist <- list()

varlow <- 0.95
varhigh <- 1.05
sets=1
cyc=200
mov1 =13
M=72
seedno=1200

for (itr in 1:cyc){
  vec0 <- NULL
  vec0 <- as.vector(df1$Census)
  df1a <- df1

  set.seed(seedno)  ## seed for reproducability 
  for (m in mov1:M) {
    #set.seed(seedno)  ## seed for reproducability 
    for (l in 1:N_Blocks)  {

      vec0[l] <- ifelse(vec0[l]==0 , sample(0:3, 1, rep=T), 
                        sample(floor(vec0[l]*runif(1,varlow,1)):ceiling(vec0[l]*runif(1,1,varhigh)),1,rep=T))

    }

    df1a <- cbind(df1a, data.table(xx=vec0))
    names(df1a)[names(df1a)=="xx"]  <- paste0("M",m)
    df1a$varlow <- varlow
    df1a$varhigh <- varhigh
    df1a$set <- sets
    df1a$City <- sample(paste("City", formatC(itr, width=nchar(cyc), flag="0"), sep=""),N_Blocks,rep=T)


  }

  datalist[[itr]] <- df1a

  if(itr==50){
    varlow=0.90
    varhigh=1.10
    sets=2
  } 

  if(itr==100){
    varlow=0.85
    varhigh=1.15
    sets=3
  }

  if(itr==150){
    varlow=0.80
    varhigh=1.20
    sets=4
  }
}

df1_f <- NULL
df1_f = do.call(rbind, datalist)

This code generates the data, 200 sets of the same 100 records. However the variance across months is not +-5%,+-10%,+-15%,+-20% as per the sets.

If I check for the growth for each of the sets using below code, I see that the growth is not as expected, i.e the variance is not increasing.....

report1 <- df1_f[,.(M24=sum(M24),
                    M36=sum(M36),
                    M48=sum(M48),
                    M60=sum(M60),
                    M72=sum(M72)),by=set]

growth is from -2.1% to 1.8%, where as we have given the variance to go up to 20%.

Note - the values in the df1$Census needs to vary by +- 5% etc. I am storing this value in vec0 and using in the for loop.

I think I am missing something basic, how can I get the desired sample data with such variance for each set?

Thank you!!