Consider the following data / example. Each dataset contains a number of samples with one observation and one estimate:
library(tidyverse)
library(broom)
data = read.table(text = '
dataset sample_id observation estimate
A A1 4.8 4.7
A A2 4.3 4.5
A A3 3.1 2.9
A A4 2.1 2
A A5 1.1 1
B B1 4.5 4.3
B B2 3.9 4.1
B B3 2.9 3
B B4 1.8 2
B B5 1 1.2
', header = TRUE)
I want to calculate a linear model per dataset to remove any linear bias between observation and estimate, and get the fitted values next to the original ones:
data %>%
group_by(dataset) %>%
do(lm(observation ~ estimate, data = .) %>% augment)
What this does, however, is remove the sample_id
column, which I need to keep in order to continue calculations with this dataset based on that unique ID:
# A tibble: 10 x 10
# Groups: dataset [2]
dataset observation estimate .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 4.80 4.70 4.68 0.107 0.115 0.478 0.152 0.491 1.04
2 A 4.30 4.50 4.49 0.0996 -0.193 0.416 0.0609 0.957 -1.64
3 A 3.10 2.90 2.97 0.0693 0.135 0.201 0.156 0.120 0.976
4 A 2.10 2.00 2.11 0.0849 -0.00583 0.303 0.189 0.000444 -0.0452
5 A 1.10 1.00 1.15 0.120 -0.0508 0.602 0.180 0.206 -0.521
6 B 4.50 4.30 4.31 0.109 0.191 0.468 0.0597 1.20 1.65
7 B 3.90 4.10 4.09 0.100 -0.193 0.396 0.0844 0.798 -1.56
8 B 2.90 3.00 2.91 0.0713 -0.00630 0.201 0.195 0.000247 -0.0443
9 B 1.80 2.00 1.83 0.0898 -0.0275 0.319 0.193 0.0103 -0.210
10 B 1.00 1.20 0.964 0.125 0.0355 0.616 0.191 0.104 0.360
How can I keep the additional column from my original dataset?
I have seen this answer which uses nest
to collapse the data before, but I still only get the model parameters using this approach. I guess I could extract the parameters per dataset:
data %>%
group_by(dataset) %>%
nest() %>%
mutate(
mod = map(data, linear_adj_model),
pars = map(mod, tidy)
) %>%
unnest(pars) %>%
select(dataset, term, estimate) %>%
spread(term, estimate)
… which gives me this:
# A tibble: 2 x 3
dataset `(Intercept)` estimate
* <fct> <dbl> <dbl>
1 A 0.196 0.955
2 B -0.330 1.08
… and then left-join that with the original data, and then mutate
each estimate
to get the linearly-adjusted one, but that seems much too complicated.
Another ugly hack I've found consists in adding the column as a dummy variable to the model:
data %>%
group_by(dataset) %>%
do(lm(observation ~ estimate + 0 * sample_id, data = .) %>% augment)
Is there an easier (tidy) solution that does not involve manually specifying the variables I want to keep?
This is essentially the same as Markus's answer but perhaps a little cleaner.
You could use
broom::augment_columns
instead ofaugment
. The two arguments of the function we need arex
-- "a model" -- anddata
-- "original data onto which columns should be added".The idea is to
split
the data by dataset, fit a model to each component of the list and then usemap2
to iterate over the models and the (complete) data used for model building, i.e. the outcome ofsplit(data, f = data$dataset)
in parallel.augment_columns
adds a.rownames
column, hence theselect
in the last line.edit
The same solution but hopefully easier to read.
The first code block as a function that has four arguments:
df
,split_var
,dependend_var
, andexplanatory_var
.This approach also relies on the use of a dummy variable in the model, but it does it in a column-agnostic way. By defining a new dummy column,
row
, you canleft_join()
your original data back onto the results ofaugment()
, restoring any arbitrary numbers of columns without manually specifying them.I find this a bit more readable than the other solutions, but it is still a bit hackish. Getting rid of the duplicate columns from the
left_join
is a bit tedious. You don't presumably want columns likeobservation.x
andestimate.y
in your output, which you will have unless you dod theselect(-observation, -estimate)
part.How about this:
Too ugly still?