within-group differences from group member

2020-03-30 03:50发布

问题:

I have measurements for different treatments of an experiment that ran over several rounds, like so:

set.seed(1)
df <- data.frame(treatment = rep(c('baseline', 'treatment 1', 'treatment 2'), 
                                 times=5),
                 round = rep(1:5, each=3),
                 measurement1 = rep(1:5, each=3) + rnorm(15),
                 measurement2 = rep(1:5, each=3) + rnorm(15))

df

#      treatment round measurement1 measurement2
# 1     baseline     1    0.3735462    0.9550664
# 2  treatment 1     1    1.1836433    0.9838097
# 3  treatment 2     1    0.1643714    1.9438362
# 4     baseline     2    3.5952808    2.8212212
# 5  treatment 1     2    2.3295078    2.5939013
# 6  treatment 2     2    1.1795316    2.9189774
# 7     baseline     3    3.4874291    3.7821363
# 8  treatment 1     3    3.7383247    3.0745650
# 9  treatment 2     3    3.5757814    1.0106483
# 10    baseline     4    3.6946116    4.6198257
# 11 treatment 1     4    5.5117812    3.9438713
# 12 treatment 2     4    4.3898432    3.8442045
# 13    baseline     5    4.3787594    3.5292476
# 14 treatment 1     5    2.7853001    4.5218499
# 15 treatment 2     5    6.1249309    5.4179416

What I would like is a data.frame that contains the differences in the two measurements between each of the treatments and the baseline for each round. That is, grouped by round, I would like the respective measurement in the baseline treatment subtracted from each of the two measurements.

I'd prefer a dplyr solution if one exists but will accept anything that borders on elegant.

回答1:

You can use mutate_each for that:

mydf %>%
  group_by(round) %>%
  mutate_each(funs(. - .[treatment=="baseline"]), -treatment) %>%
  filter(treatment!="baseline")

which gives:

Source: local data frame [10 x 4]
Groups: round [5]

    treatment round measurement1 measurement2
       (fctr) (int)        (dbl)        (dbl)
1  treatment1     1     1.558820   -0.6584485
2  treatment2     1    -0.068677    1.3364462
3  treatment1     2     1.769312   -0.2732490
4  treatment2     2     0.801357   -1.4852449
5  treatment1     3    -1.064394   -1.1513703
6  treatment2     3     2.433222   -0.7939903
7  treatment1     4     0.448744    0.1394982
8  treatment2     4    -1.066922   -1.1410085
9  treatment1     5     1.182761   -0.8311095
10 treatment2     5     0.138005    0.2622119

If you want to add the differences to your dataframe (just as @akrun did in his dplyr / tidyr alternative), you could also do:

mydf %>%
  group_by(round) %>%
  mutate(diff1 = measurement1 - measurement1[treatment=="baseline"],
         diff2 = measurement2 - measurement2[treatment=="baseline"]) %>%
  filter(treatment!="baseline")

which gives:

Source: local data table [10 x 6]

    treatment round measurement1 measurement2     diff1      diff2
       (fctr) (int)        (dbl)        (dbl)     (dbl)      (dbl)
1  treatment1     1     2.630392    -0.104258  1.558820 -0.6584485
2  treatment2     1     1.002895     1.890637 -0.068677  1.3364462
3  treatment1     2     3.822473     3.147443  1.769312 -0.2732490
4  treatment2     2     2.854518     1.935447  0.801357 -1.4852449
5  treatment1     3     1.520553     3.291122 -1.064394 -1.1513703
6  treatment2     3     5.018169     3.648502  2.433222 -0.7939903
7  treatment1     4     4.956380     4.544908  0.448744  0.1394982
8  treatment2     4     3.440714     3.264401 -1.066922 -1.1410085
9  treatment1     5     4.672056     5.082310  1.182761 -0.8311095
10 treatment2     5     3.627300     6.175631  0.138005  0.2622119


回答2:

We can use data.table

library(data.table)
setDT(df)[order(round,treatment), tail(.SD,2)- head(.SD,1)[rep(1,2)],
                 round , .SDcols=3:4]

Or another option with data.table is

setDT(df)[, lapply(.SD[, grep("^measurement", names(.SD)),
    with =FALSE], function(x) x[treatment!="baseline"]- 
      x[treatment=="baseline"]) , round]

Or using dplyr/tidyr

 library(dplyr)
 library(tidyr)
 gather(df, var, val, measurement1:measurement2) %>% 
          spread(treatment, val) %>% 
          mutate(diff1 = `treatment 1` - baseline, 
                 diff2 = `treatment 2` - baseline)


标签: r dplyr