First off, I am pretty new to this so my method/thinking may be wrong, I have imported a xlsx data set into a data frame using R and R studio. I want to be able to loop through the column names to get all of the variables with exactly "10" in them in order to run a simple linear regression. So here's my code:
indx <- grepl('_10_', colnames(data)) #list returns all of the true values in the data set
col10 <- names(data[indx]) #this gives me the names of the columns I want
Here is the for loop I have which returns an error:
temp <- c()
for(i in 1:length(col10)){
temp = col10[[i]]
lm.test <- lm(Total_Transactions ~ temp[[i]], data = data)
print(temp) #actually prints out the right column names
i + 1
}
Is it even possible to run a loop to place those variables in the linear regression model? The error I am getting is: "Error in model.frame.default(formula = Total_Transactions ~ temp[[i]], : variable lengths differ (found for 'temp[[i]]')". If anyone could point me in the right direction I would be very grateful. Thanks.
Ok, I'll post an answer. I will use the dataset mtcars
as an example. I believe it will work with your dataset.
First, I create a store, lm.test
, an object of class list
. In your code you are assigning the output of lm(.)
every time through the loop and in the end you would only have the last one, all others would have been rewriten by the newer ones.
Then, inside the loop, I use function reformulate
to put together the regression formula. There are other ways of doing this but this one is simple.
# Use just some columns
data <- mtcars[, c("mpg", "cyl", "disp", "hp", "drat", "wt")]
col10 <- names(data)[-1]
lm.test <- vector("list", length(col10))
for(i in seq_along(col10)){
lm.test[[i]] <- lm(reformulate(col10[i], "mpg"), data = data)
}
lm.test
Now you can use the results list for all sorts of things. I suggest you start using lapply
and friends for that.
For instance, to extract the coefficients:
cfs <- lapply(lm.test, coef)
In order to get the summaries:
smry <- lapply(lm.test, summary)
It becomes very simple once you're familiar with *apply
functions.
You can create a temporary subset in which you select only the columns used in your regression. This way, you won't need to inject the temporary name in the formula.
Sticking up to your code, this should do the trick.
for(i in 1:length(col10)){
tempSubset <- data[,c("Total_Transactions", col10[i]]
lm.test <- lm(Total_Transactions ~ ., data = tempSubset)
i + 1
}