That is a long question I know, but bear with me.
I have a dataset in this form:
head(TRAINSET)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 Y
1 -2.973012 -2.956570 -2.386837 -0.5861751 4e-04 0.44 0.0728 0.0307 0.0354 0.0078 0.0047 0.0100 -0.0022 0.0038 -0.005200012
2 -2.937649 -2.958624 -2.373960 -0.5636891 5e-04 0.44 0.0718 0.0323 0.0351 0.0075 0.0028 0.0095 -0.0019 0.0000 0.042085781
3 -2.984238 -2.937649 -2.428712 -0.5555258 2e-04 0.43 0.0728 0.0329 0.0347 0.0088 0.0018 0.0092 -0.0019 -0.0076 0.004577122
4 -2.976535 -2.970053 -2.443424 -0.5331107 9e-04 0.47 0.0588 0.0320 0.0331 0.0253 0.0011 0.0092 -0.0170 -0.0076 0.010515970
5 -2.979631 -2.962549 -2.468805 -0.5108256 6e-04 0.46 0.0613 0.0339 0.0333 -0.0005 -0.0006 0.0090 0.0060 -0.0058 0.058487141
6 -3.030536 -2.979631 -2.528079 -0.5024574 3e-04 0.43 0.0562 0.0333 0.0327 0.0109 -0.0006 0.0093 -0.0120 0.0000 -0.022896759
This is the Train set of mine, and it is 300 rows. The remaining 700 rows are the Test set. What I am trying to accomplish is:
- For each column fit a linear model of this form : Y ~ X1.
- Use the model created to get the predicted value of the Y by using the first X1 of the Test set.
- After that, take the first row of the Test set and rbind it to the Train set (now the Train set is 301 rows).
- Predict the value of Y using the 2nd row of X1 from the test set.
- Repeat for the remaining 699 rows of the Test set.
- Apply it for all the remaining variables of the datasets (X2,...,X14).
I have managed to produce the accurate results when I apply a code that i made for each variable specifically:
fittedvaluess<-NULL #empty set to fill
for(i in 1:nrow(TESTSET)){ #beggin iteration over the rows of Test set
TRAINSET<-rbind(TRAINSET,TESTSET[i,]) #add the rows to the train set
LM<-lm(Y~X1,TRAINSET) #fit the evergrowing OLS
predictd<-predict(LM,TESTSET[i+1,],type = "response") #get the predicted value
fittedvaluess<-cbind(fittedvaluess,predictd) #get the vector of the predicted values
print(cbind(i,length(TRAINSET$LHS),length(TRAINSET$DP),nrow(TRAINSET))) #to make sure it works
}
However, i want to automate this to go and repeat it over the columns. I have made this:
data<-TRAINSET #cause every time i had to remake the trainset
fittedvaluesss<-NULL
for(i in 1:nrow(TESTSET){ #begin iteration on rows of Testset
data<-rbind(data,TESTSET[i,]) # rbind the rows to the Trainset called data
for(j in 1:ncol(TESTSET){ #iterate over the columns
LM<-lm(data$LHS~data[,j],data) #fit OLS
predictd<-predict(LM,TESTSET[i+1,j],type = "response") #get the predicted value
fittedvaluesss<-cbind(fittedvaluesss,predictd) #derive the predicted value
print(c(i,j)) #make sure it works
}
}
The results are unfortunately wrong: the fittedvalues are a huge matrix :
dim(fittedvaluesss)
[1] 2306 3167 #Stopped around the middle of its run
Which doesn't make any sense. I have even run it for
i in 1:3
and
j in 1:3
and still the matrix was insanely huge. I have tried having the iteration starting from the columns and the go over the lines. Exactly the same wrong results. For some reason in each run i was getting at least 362 values from the PREDICT function. I am really stuck over this problem.
Any help is highly welcome.
EDIT 1: This is also known as a RECURSIVE FORECASTING methodology in Finance. It is a method to forecast future values from a model fit from your current dataset.