I have a dataset that I need to both split by one variable (Day
) and then compare between groups of another variable (Group
), performing per-group statistics (e.g. mean
) and also tests.
Here's an example of what I devised:
require(data.table)
data = data.table(Day = rep(1:10, each = 10),
Group = rep(1:2, times = 50),
V = rnorm(100))
data[, .(g1_mean = mean(.SD[Group == 1]$V),
g2_mean = mean(.SD[Group == 2]$V),
p.value = t.test(V ~ Group, .SD, alternative = "two.sided")$p.value),
by = list(Day)]
Which produces:
Day g1_mean g2_mean p.value
1: 1 0.883406048 0.67177271 0.6674138
2: 2 0.007544956 -0.55609722 0.3948459
3: 3 0.409248637 0.28717183 0.8753213
4: 4 -0.540075365 0.23181458 0.1785854
5: 5 -0.632543900 -1.09965990 0.6457325
6: 6 -0.083221671 -0.96286343 0.2011136
7: 7 -0.044674252 -0.27666473 0.7079499
8: 8 0.260795244 -0.15159164 0.4663712
9: 9 -0.134164758 0.01136245 0.7992453
10: 10 0.496144329 0.76168408 0.1821123
I'm hoping that there's a less roundabout manner of arriving at this result.
A possible compact alternative which can also apply more functions to each group:
DTnew <- dcast(DT[, pval := t.test(V ~ Group, .SD, alternative = "two.sided")$p.value, Day],
Day + pval ~ paste0("g",Group), fun = list(mean,sd), value.var = "V")
which gives:
> DTnew
Day pval V_mean_g1 V_mean_g2 V_sd_g1 V_sd_g2
1: 1 0.4763594 -0.11630634 0.178240714 0.7462975 0.4516087
2: 2 0.5715001 -0.29689807 0.082970631 1.3614177 0.2745783
3: 3 0.2295251 -0.48792449 -0.031328749 0.3723247 0.6703694
4: 4 0.5565573 0.33982242 0.080169698 0.5635136 0.7560959
5: 5 0.5498684 -0.07554433 0.308661427 0.9343230 1.0100788
6: 6 0.4814518 0.57694034 0.885968245 0.6457926 0.6773873
7: 7 0.8053066 0.29845913 0.116217727 0.9541060 1.2782210
8: 8 0.3549573 0.14827289 -0.319017581 0.5328734 0.9036501
9: 9 0.7290625 -0.21589411 -0.005785092 0.9639758 0.8859461
10: 10 0.9899833 0.84034529 0.850429982 0.6645952 1.5809149
A decomposition of the code:
- First, a
pval
variable is added to the dataset with DT[, pval := t.test(V ~ Group, .SD, alternative = "two.sided")$p.value, Day]
- Because
DT
is updated in place and by reference by the previous step, the dcast
function can be applied to that directly.
- In the casting formula, you specify the variables that need to stay in the current form on the RHS and the variable that needs to be spread over columns on the LHS.
- With the
fun
argument you can specify which aggregation function has to be used on the value.var
(here V
). If multiple aggregation functions are needed, you can specify them in a list (e.g. list(mean,sd)
). This can be any type of function. So, also cumstom made functions can be used.
If you want to remove the V_
from the column names, you can do:
names(DTnew) <- gsub("V_","",names(DTnew))
NOTE: I renamed the data.table
to DT
as it is often not wise to name your dataset after a function (check ?data
)
While not a one-liner, you might consider doing your two processes separate and then merging the results. This prevents you from having to hardcode the group-names.
First, we calculate the means:
my_means <- dcast(data[,mean(V), by = .(Day, Group)],
Day~ paste0("Mean_Group", Group),value.var="V1")
Or in the less-convoluted way @Akrun mentioned in the comments, with some added formatting.
my_means <- dcast(Day~paste0("Mean_Group", Group), data=data,
fun.agg=mean, value.var="V")
Then the t-tests:
t_tests <- data[,.(p_value=t.test(V~Group)$p.value), by = Day]
And then merge:
output <- merge(my_means, t_tests)