How to apply a regression in a for loop for all th

2019-09-08 12:27发布

问题:

That is a long question I know, but bear with me.

I have a dataset in this form:

    head(TRAINSET)
         X1        X2        X3      X4      X5    X6    X7     X8    X9     X10     X11    X12     X13        X14        Y
1 -2.973012 -2.956570 -2.386837 -0.5861751 4e-04 0.44 0.0728 0.0307 0.0354  0.0078  0.0047 0.0100 -0.0022   0.0038 -0.005200012
2 -2.937649 -2.958624 -2.373960 -0.5636891 5e-04 0.44 0.0718 0.0323 0.0351  0.0075  0.0028 0.0095 -0.0019   0.0000  0.042085781
3 -2.984238 -2.937649 -2.428712 -0.5555258 2e-04 0.43 0.0728 0.0329 0.0347  0.0088  0.0018 0.0092 -0.0019  -0.0076  0.004577122
4 -2.976535 -2.970053 -2.443424 -0.5331107 9e-04 0.47 0.0588 0.0320 0.0331  0.0253  0.0011 0.0092 -0.0170  -0.0076  0.010515970
5 -2.979631 -2.962549 -2.468805 -0.5108256 6e-04 0.46 0.0613 0.0339 0.0333 -0.0005 -0.0006 0.0090  0.0060  -0.0058  0.058487141
6 -3.030536 -2.979631 -2.528079 -0.5024574 3e-04 0.43 0.0562 0.0333 0.0327  0.0109 -0.0006 0.0093 -0.0120   0.0000 -0.022896759

This is the Train set of mine, and it is 300 rows. The remaining 700 rows are the Test set. What I am trying to accomplish is:

  1. For each column fit a linear model of this form : Y ~ X1.
  2. Use the model created to get the predicted value of the Y by using the first X1 of the Test set.
  3. After that, take the first row of the Test set and rbind it to the Train set (now the Train set is 301 rows).
  4. Predict the value of Y using the 2nd row of X1 from the test set.
  5. Repeat for the remaining 699 rows of the Test set.
  6. Apply it for all the remaining variables of the datasets (X2,...,X14).

I have managed to produce the accurate results when I apply a code that i made for each variable specifically:

fittedvaluess<-NULL   #empty set to fill
for(i in 1:nrow(TESTSET)){      #beggin iteration over the rows of Test set 
  TRAINSET<-rbind(TRAINSET,TESTSET[i,]) #add the rows to the train set
  LM<-lm(Y~X1,TRAINSET)               #fit the evergrowing OLS    
  predictd<-predict(LM,TESTSET[i+1,],type = "response") #get the predicted value
  fittedvaluess<-cbind(fittedvaluess,predictd) #get the vector of the predicted values
  print(cbind(i,length(TRAINSET$LHS),length(TRAINSET$DP),nrow(TRAINSET))) #to make sure it works
}

However, i want to automate this to go and repeat it over the columns. I have made this:

data<-TRAINSET #cause every time i had to remake the trainset
fittedvaluesss<-NULL          
for(i in 1:nrow(TESTSET){          #begin iteration on rows of Testset
  data<-rbind(data,TESTSET[i,])    # rbind the rows to the Trainset called data
  for(j in 1:ncol(TESTSET){        #iterate over the columns
    LM<-lm(data$LHS~data[,j],data)  #fit OLS
    predictd<-predict(LM,TESTSET[i+1,j],type = "response") #get the predicted value
    fittedvaluesss<-cbind(fittedvaluesss,predictd) #derive the predicted value
    print(c(i,j)) #make sure it works
  }
}

The results are unfortunately wrong: the fittedvalues are a huge matrix :

 dim(fittedvaluesss)
[1] 2306 3167 #Stopped around the middle of its run

Which doesn't make any sense. I have even run it for

i in 1:3
and
j in 1:3 

and still the matrix was insanely huge. I have tried having the iteration starting from the columns and the go over the lines. Exactly the same wrong results. For some reason in each run i was getting at least 362 values from the PREDICT function. I am really stuck over this problem.

Any help is highly welcome.

EDIT 1: This is also known as a RECURSIVE FORECASTING methodology in Finance. It is a method to forecast future values from a model fit from your current dataset.

回答1:

Consider reversing your looping logic with columns in outer loop and rows in inner loop. Additionally, try nested apply functions which returns structures more aligned to your needs than the for loop. Specifically, the inner vapply() returns a numeric vector of all testset's predicted values for each iterated column. Then the outer sapply() binds each returned vector to a column of a matrix.

Ultimately, fittedvaluess is a matrix with dimensions: TESTSET nrow X TESTSET ncol. Notice too, outer loop leaves out last column since you do not regress Y on Y.

fittedvaluess <- sapply(1:(ncol(TESTSET)-1), function(c){

  col <- names(TESTSET)[[c]]                     # RETRIEVE COLUMN NAME FOR LM FORMULA

  predictvals <- vapply(1:nrow(TESTSET), function(r){      
    TRAINSET <- rbind(TRAINSET, TESTSET[1:r,])   # BINDING ROWS ON AND PRIOR TO CURRENT ROW
    LM <- lm(paste0("Y~", col), TRAINSET)        # CONCATENATED STRING FORMULA
    predictd <- predict(LM, TESTSET[r+1,], type="response")
  }, numeric(1))

})

Why sapply and vapply?

Both sapply() and vapply() are wrappers to lapply(). Where sapply() (simple lapply) can return either a vector or matrix, vapply() (verified lapply) allows you to specifically choose the returned output --vector, list, matrix-- as well as type and length. So vapply requires a third argument specifying such criteria. Here, we choose a numeric vector of one length (or one object): numeric(1). Because of this pre-specification, vapply() tends to run faster than lapply() in some cases. Had we only chose the general lapply(), we would need to run various casting and conversions of list output to align to matrix output. In a way, we could have done nested vapply() loops!



回答2:

By using the below, which is has a minor version of my original code, except that I didn't use the predict

#EXPAND IT INTO DOING SO IN ALL COLUMNS
data<-TRAINSET
fittedvaluesss<-NULL
for(i in 1:nrow(TESTSET)){ #go each row
  data<-rbind(data,TESTSET[i,]) #update the dataset
  for(j in 1:ncol(TESTSET)){ #repead for each column the following
    LM<-lm(data$LHS~data[,j])   #OLS reg
    predictd<-coef(LM)[1]+coef(LM)[2]*TESTSET[i+1,j] #Simply apply the formula yourself A+Bx for each new iteration
    #predict(LM,TESTSET[i+1,j],type = "response")
    print(length(predictd)) #makes sure it is ONE value
    fittedvaluesss<-c(fittedvaluesss,predictd)
    print(c(i,j))
  }
}
matrixa<-matrix(fittedvaluesss,15,648) #put the values in a matrix: Note that the Ypreds are in every row
matrixa<-t(matrixa) #transpose in order to have each Ypred from a var in a column

The reason this works, is that the predict function for each iteration returns a small matrix of size 361x15 (in my initial code) and that is for a single iteration. Thus i dropped the predict function and used the coefficients themselves. This seemed to return the correct forecasts.