Fast melted data.table operations

I am looking for patterns for manipulating data.table objects whose structure resembles that of dataframes created with melt from the reshape2 package. I am dealing with data tables with millions of rows. Performance is critical.

The generalized form of the question is whether there is a way to perform grouping based on a subset of values in a column and have the result of the grouping operation create one or more new columns.

A specific form of the question could be how to use data.table to accomplish the equivalent of what dcast does in the following:

input <- data.table(
  id=c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3), 
  variable=c('x', 'y', 'y', 'x', 'y', 'y', 'x', 'x', 'y', 'other'),
  value=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
dcast(input, 
  id ~ variable, sum, 
  subset=.(variable %in% c('x', 'y')))

the output of which is

标签： r dataframe data.table aggregation reshape2

3条回答

淡お忘

2楼-- · 2019-02-13 05:36

I'm not sure if this is the best way, but you can try:

input[, list(x = sum(value[variable == "x"]), 
             y = sum(value[variable == "y"])), by = "id"]
#    id  x  y
# 1:  1  1  5
# 2:  2  4 11
# 3:  3 15  9

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

3楼-- · 2019-02-13 05:39

> setkey(input, "id")
> input[ , list(sum(value)), by=id]
   id V1
1:  1  6
2:  2 15
3:  3 34

> input[ variable %in% c("x", "y"), list(sum(value)), by=id]
   id V1
1:  1  6
2:  2 15
3:  3 24

The last one:

> input[ variable %in% c("x", "y"), list(sum(value)), by=list(id, variable)]
   id variable V1
1:  1        x  1
2:  1        y  5
3:  2        x  4
4:  2        y 11
5:  3        x 15
6:  3        y  9

0人赞添加讨论(0) 举报

放我归山

4楼-- · 2019-02-13 06:01

Quick untested answer: seems like you're looking for by-without-by, a.k.a. grouping-by-i :

setkey(input,variable)
input[c("x","y"),sum(value)]

This is like a fast HAVING in SQL. j gets evaluated for each row of i. In other words, the above is the same result but much faster than :

input[,sum(value),keyby=variable][c("x","y")]

The latter subsets and evals for all the groups (wastefully) before selecting only the groups of interest. The former (by-without-by) goes straight to the subset of groups only.

The group results will be returned in long format, as always. But reshaping to wide afterwards on the (relatively small) aggregated data should be relatively instant. That's the thinking anyway.

The first setkey(input,variable) might bite if input has a lot of columns not of interest. If so, it might be worth subsetting the columns needed :

DT = setkey(input[,c("variable","value"),with=FALSE], variable)
DT[c("x","y"),sum(value)]

In future when secondary keys are implemented that would be easier :

set2key(input,variable)              # add a secondary key 
input[c("x","y"),sum(value),key=2]   # syntax speculative

To group by id as well :

setkey(input,variable)
input[c("x","y"),sum(value),by='variable,id']

and including id in the key might be worth setkey's cost depending on your data :

setkey(input,variable,id)
input[c("x","y"),sum(value),by='variable,id']

If you combine a by-without-by with by, as above, then the by-without-by then operates just like a subset; i.e., j is only run for each row of i when by is missing (hence the name by-without-by). So you need to include variable, again, in the by as shown above.

Alternatively, the following should group by id over the union of "x" and "y" instead (but the above is what you asked for in the question, iiuc) :

input[c("x","y"),sum(value),by=id]

0人赞添加讨论(0) 举报

Fast melted data.table operations

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间