I am writing loops or functions in R, and I still haven't really understood how to do that. Currently, I need to write a loop/function (not sure which one would be better) to create several linear regression models within the same data frame.
I have data like this:
dataset <- read.table(text =
"ID A_2 B_2 C_2 A_1 B_1 C_1 AGE
M1 10 6 6 8 8 9 25
M2 50 69 54 67 22 44 16
M3 5 80 44 78 5 55 18
M4 60 70 52 89 3 56 28
M5 60 5 34 90 80 56 34
M6 55 55 67 60 100 77 54", header = TRUE, stringsAsFactors = FALSE)
I am building models like this:
model1 <- lm(A_2~A_1+age, data=dataset)
model2 <- lm(B_2~B_1+age, data=dataset)
model3 <- lm(C_2~C_1+age, data=dataset)
I need to write a loop which:
- takes
variable _2
(the dependent variable) and variable _1
(independent variable) and covariates like age
...
- creates the
lm
models, and stores outputs (i.e, T-value, p-value, confidence intervals etc) in a data.frame that I can then print.
Dep_va Ind_var Convarites Pvalue "upper.cI" "low.cI"
A_2 A_1 age
B_2 B_1 age
C_2 C_1 age
D_2 D_1 age
Here is a base R
approach to the problem with lapply
loops.
First if you want to automatically extract the variable names ending in _2
which should be all of your dependent variables you could implement the following code:
dep_vars<-grep("_2$",colnames(dataset),value = T) #This selects all variables ending in `_2` which should all be dependent variables.
reg_vars<-gsub("_2$","",dep_vars) #This removes the `_2` from the dependent variables which should give you the common stem which can be used to select both dependent and independent variables from your data frame.
Then you can use this in your lapply
loop for creating your formulas:
full_results <- lapply(reg_vars, function(i) summary(lm(paste0("log(",i,"_2)~",i,"_1+AGE"),data=dataset)))
Now you will have a list of linear regression summaries where you can extract the info you want. I think this is what you want for the output but please clarify if not:
print_results<-lapply(full_results,function(i) data.frame(
Dep_va = names(attributes(i[["terms"]])$dataClasses)[1],
Ind_var = names(attributes(i[["terms"]])$dataClasses)[2],
Covariates = names(attributes(i[["terms"]])$dataClasses)[3],
Pvalue = i[["coefficients"]][2,4],
upper.cI = i[["coefficients"]][2,1]+1.96*i[["coefficients"]][2,2],
low.cI = i[["coefficients"]][2,1]-1.96*i[["coefficients"]][2,2]))
That code will give you a list of data frames and if you want to combine it into one data.frame
:
final_results<-do.call("rbind",print_results)
Output Results:
Dep_va Ind_var Covariates Pvalue upper.cI low.cI
1 A_2 A_1 AGE 0.25753805 1.113214 -0.1877324
2 B_2 B_1 AGE 0.68452053 1.211355 -1.9292236
3 C_2 C_1 AGE 0.04827506 1.497688 0.3661343
Hope that helps! Let me know if you were looking for different output results.
Here's a tidyverse approach:
library(tidyverse)
dataset %>%
gather(col, val, -ID, -AGE) %>%
separate(col, c("name", "num")) %>%
spread(num, val) %>%
group_by(name) %>%
group_map(~lm(`2` ~ `1` + AGE, data = .x) %>% broom::tidy())
# A tibble: 9 x 6
# Groups: name [3]
name term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 A (Intercept) -15.0 31.5 -0.477 0.666
2 A `1` 0.463 0.332 1.39 0.258
3 A AGE 0.851 0.731 1.16 0.329
4 B (Intercept) 49.1 52.5 0.935 0.419
5 B `1` -0.359 0.801 -0.448 0.685
6 B AGE 0.391 2.47 0.159 0.884
7 C (Intercept) 5.42 13.9 0.390 0.723
8 C `1` 0.932 0.289 3.23 0.0483
9 C AGE -0.299 0.470 -0.635 0.570
Explanation:
- Move data to long/tidy format with
gather()
separate
the variable category (in the example data, A
, B
, etc)
- Create a separate column for the IV and DV with
spread()
- Use
group_by()
and group_map()
to apply lm()
to each variable category.
Given your particular question, where you have variable names with common bases that are labelled with _2 and _1 to designate the dependent and independent variables respectively, you could solve your issue as follows:
var.names <- names(dataset[!names(dataset) %in% c("ID","AGE")])
names.list <- strsplit(var.names,split = "_")
list.of.models <- list()
for (i in 1:length(names.list)) {
DV <- grep(names.list[[i]][1], names(dataset))[1]
IV <- grep(names.list[[i]][1], names(dataset))[2]
list.of.models[[i]] <- lm(dataset[,DV] ~ dataset[,IV] + AGE, data = dataset)
}
summary(list.of.models[[1]])
Call:
lm(formula = dataset[, DV] ~ dataset[, IV] + AGE, data = dataset)
Residuals:
1 2 3 4 5 6
0.07496 20.42938 -31.36213 10.04093 4.47412 -3.65725
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.0439 31.5482 -0.477 0.666
dataset[, IV] 0.4627 0.3319 1.394 0.258
AGE 0.8507 0.7314 1.163 0.329
Residual standard error: 22.62 on 3 degrees of freedom
Multiple R-squared: 0.5276, Adjusted R-squared: 0.2127
F-statistic: 1.676 on 2 and 3 DF, p-value: 0.3246