Comparing Groups in data.table Columns

2019-08-10 03:15发布

问题:

I have a dataset that I need to both split by one variable (Day) and then compare between groups of another variable (Group), performing per-group statistics (e.g. mean) and also tests.

Here's an example of what I devised:

require(data.table)
data = data.table(Day = rep(1:10, each = 10), 
                  Group = rep(1:2, times = 50),
                  V = rnorm(100))

data[, .(g1_mean = mean(.SD[Group == 1]$V),
         g2_mean = mean(.SD[Group == 2]$V),
         p.value = t.test(V ~ Group, .SD, alternative = "two.sided")$p.value),
     by = list(Day)]

Which produces:

    Day      g1_mean     g2_mean   p.value
 1:   1  0.883406048  0.67177271 0.6674138
 2:   2  0.007544956 -0.55609722 0.3948459
 3:   3  0.409248637  0.28717183 0.8753213
 4:   4 -0.540075365  0.23181458 0.1785854
 5:   5 -0.632543900 -1.09965990 0.6457325
 6:   6 -0.083221671 -0.96286343 0.2011136
 7:   7 -0.044674252 -0.27666473 0.7079499
 8:   8  0.260795244 -0.15159164 0.4663712
 9:   9 -0.134164758  0.01136245 0.7992453
10:  10  0.496144329  0.76168408 0.1821123

I'm hoping that there's a less roundabout manner of arriving at this result.

回答1:

A possible compact alternative which can also apply more functions to each group:

DTnew <- dcast(DT[, pval := t.test(V ~ Group, .SD, alternative = "two.sided")$p.value, Day],
               Day + pval ~ paste0("g",Group), fun = list(mean,sd), value.var = "V")

which gives:

> DTnew
    Day      pval   V_mean_g1    V_mean_g2   V_sd_g1   V_sd_g2
 1:   1 0.4763594 -0.11630634  0.178240714 0.7462975 0.4516087
 2:   2 0.5715001 -0.29689807  0.082970631 1.3614177 0.2745783
 3:   3 0.2295251 -0.48792449 -0.031328749 0.3723247 0.6703694
 4:   4 0.5565573  0.33982242  0.080169698 0.5635136 0.7560959
 5:   5 0.5498684 -0.07554433  0.308661427 0.9343230 1.0100788
 6:   6 0.4814518  0.57694034  0.885968245 0.6457926 0.6773873
 7:   7 0.8053066  0.29845913  0.116217727 0.9541060 1.2782210
 8:   8 0.3549573  0.14827289 -0.319017581 0.5328734 0.9036501
 9:   9 0.7290625 -0.21589411 -0.005785092 0.9639758 0.8859461
10:  10 0.9899833  0.84034529  0.850429982 0.6645952 1.5809149

A decomposition of the code:

  • First, a pval variable is added to the dataset with DT[, pval := t.test(V ~ Group, .SD, alternative = "two.sided")$p.value, Day]
  • Because DT is updated in place and by reference by the previous step, the dcast function can be applied to that directly.
  • In the casting formula, you specify the variables that need to stay in the current form on the RHS and the variable that needs to be spread over columns on the LHS.
  • With the fun argument you can specify which aggregation function has to be used on the value.var (here V). If multiple aggregation functions are needed, you can specify them in a list (e.g. list(mean,sd)). This can be any type of function. So, also cumstom made functions can be used.

If you want to remove the V_ from the column names, you can do:

names(DTnew) <- gsub("V_","",names(DTnew))

NOTE: I renamed the data.table to DT as it is often not wise to name your dataset after a function (check ?data)



回答2:

While not a one-liner, you might consider doing your two processes separate and then merging the results. This prevents you from having to hardcode the group-names.

First, we calculate the means:

my_means <- dcast(data[,mean(V), by = .(Day, Group)],
                  Day~ paste0("Mean_Group", Group),value.var="V1")

Or in the less-convoluted way @Akrun mentioned in the comments, with some added formatting.

my_means <- dcast(Day~paste0("Mean_Group", Group), data=data,
              fun.agg=mean, value.var="V")

Then the t-tests:

t_tests <- data[,.(p_value=t.test(V~Group)$p.value), by = Day]

And then merge:

output <- merge(my_means, t_tests)