data.table: lapply a function with multicolumn out

2019-05-16 09:12发布

问题:

I'm using a function smean.cl.normal from Hmisc package that returns a vector with 3 values: the mean and the lower and upper CI. When I use it on a data.table with 2 groups, I obtain 2 columns and 6 rows. Is there a way to obtain the result with two rows corresponding to 2 groups and separate columns for each of function's outputs, i.e. the mean and CIs?

require(Hmisc)
require(data.table)

dt = data.table(x = rnorm(100),
                gr = rep(c('A', 'B'), each = 50))

dt[, lapply(.SD, smean.cl.normal), by = gr, .SDcols = "x"]

The output:

   gr           x
1:  A -0.07916335
2:  A -0.33656667
3:  A  0.17823998
4:  B -0.02745333
5:  B -0.32950607
6:  B  0.27459941

The desired output:

   gr        Mean         Lower         Upper
1:  A -0.07916335   -0.33656667    0.17823998
2:  B -0.02745333   -0.32950607    0.27459941

回答1:

The j argument in DT[i,j,by] expects a list, so use as.list:

dt[, 
  Reduce(c, lapply(.SD, function(x) as.list(smean.cl.normal(x))))
, by = gr, .SDcols = "x"]

#    gr       Mean      Lower     Upper
# 1:  A  0.1032966 -0.1899466 0.3965398
# 2:  B -0.1437617 -0.4261330 0.1386096

c(L1, L2, L3) is how lists are combined, so Reduce(c, List_o_Lists) does the trick in case your .SDcols contains more than just x. I guess do.call(c, List_o_Lists) should also work.


Comments

This is quite inefficient for a couple of reasons. Turn on verbose=TRUE to see that data.table doesn't like getting named lists in j:

The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.

Also, you are missing out on group-optimized versions of mean and other functions that can probably be used to build your result. This may not be a big deal for your use-case, though.


When you're applying this to only a single value column, just:

dt[, as.list(smean.cl.normal(x)), by = gr]

suffices.