data.table and stratified means

I've got some code that generate stratified weighted means and I'm certain this worked a few months ago. But, but I'm not sure what the current problem is. (I apologize - this must be very basic stuff):

dp=
structure(list(seqn = c(1L, 2L, 3L, 4L, 6L, 7L, 8L, 9L, 10L, 
11L, 12L, 13L, 3L, 4L, 9L, 10L, 11L, 14L, 8L, 11L, 12L, 10L, 
5L, 13L, 2L, 14L, 3L, 9L, 6L, 7L), sex = c(2L, 1L, 2L, 2L, 1L, 
2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), bmi = c(22.8935608711259, 
27.0944623781918, 40.4637162938634, 23.7649712675423, 15.3193372705538, 
31.1280302540991, 21.4866354393239, 20.3200254374398, 32.331092513536, 
25.3679771839413, 33.9400508162971, 14.7048592172926, 25.5243757788688, 
23.4331882363495, 27.6428134168995, 29.3923629426172, 24.9547209666314, 
17.0522203606383, 15.51, 22, 30.62, 30.94, 29.1, 25.57, 24.9, 
27.33, 17.63, 18.48, 22.56, 29.39), tc = c(273L, 181L, 150L, 
201L, 142L, 165L, 235L, 219L, 298L, 222L, 143L, 134L, 268L, 160L, 
236L, 225L, 260L, 140L, 162L, 132L, 156L, 140L, 279L, 314L, 215L, 
174L, 129L, 148L, 153L, 245L), swt = c(1645, 3318, 2280, 1574, 
4062, 1627, 14604, 24675, 975, 975, 2697, 1559, 1737.58, 1730.23, 
19521.36, 28080.57, 1248.43, 13745.77, 5251.76464426326, 6497.194885522, 
15915.7023420765, 3740.96809540218, 16574.177622509, 307.32513798849, 
4720.89748295751, 3247.78896499604, 7698.70949077031, 1262.6450411464, 
6609.43340735515, 4254.23723479882)), .Names = c("seqn", "sex", 
"bmi", "tc", "swt"), row.names = c(20560L, 20561L, 20562L, 20563L, 
20565L, 20566L, 20567L, 20568L, 20569L, 20570L, 20571L, 20572L, 
61335L, 61336L, 61338L, 61339L, 61340L, 61341L, 95465L, 96890L, 
104613L, 105988L, 107581L, 112267L, 113403L, 114292L, 119979L, 
120271L, 125939L, 135699L), class = "data.frame")

dt=data.table(dp, key='sex')

sapply(df,function(x)weighted.mean(x,df$swt))  #this works to weighted mean
dt[,lapply(.SD, mean, na.rm=T), .SDcols=c('bmi','tc','swt')]  
     #this also works for overall unweighted mean

dt[,lapply(.SD, function(x)weighted.mean(x,swt, na.rm=TRUE)), by=key(dt), .SDcols=c('bmi','tc','swt')]

but this gives the error: Error in weighted.mean.default(x, swt, na.rm = TRUE) : object 'swt' not found

sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.8.6

loaded via a namespace (and not attached):
[1] tools_2.15.2

标签： r data.table

2条回答

迷人小祖宗

2楼-- · 2020-07-14 12:51

UPDATE (from Arun): This is now fixed in v1.8.11. From NEWS:

o DT[, lapply(.SD, function(), by=] did not see columns of DT when optimisation is "on". This is now fixed, #2381. Tests added and tested successfully. Thanks to David F for reporting on SO: data.table and stratified means

This is indeed a bug introduced somewhere between 1.8.2 and 1.8.6.

dt[,lapply(.SD, function(x) weighted.mean(x,swt, na.rm=TRUE)), by=key(dt),
    .SDcols=c('bmi','tc','swt')] 
Error in weighted.mean.default(x, swt, na.rm = TRUE) : 
    object 'swt' not found

To work around this in the meantime, either turn off optimization :

options(datatable.optimize=FALSE)
dt[,lapply(.SD, function(x)weighted.mean(x,swt, na.rm=TRUE)), by=key(dt),    
    .SDcols=c('bmi','tc','swt')]
   sex      bmi       tc      swt
1:   1 25.64376 206.0115 17171.20
2:   2 23.73566 193.8727 11467.47

or, don't wrap with function() :

options(datatable.optimize=TRUE)
dt[,lapply(.SD, weighted.mean, swt, na.rm=TRUE), by=key(dt),    
    .SDcols=c('bmi','tc','swt')] 
   sex      bmi       tc      swt
1:   1 25.64376 206.0115 17171.20
2:   2 23.73566 193.8727 11467.47

We are making more use of optimization now, but this case slipped through the test suite: tests 825.1, 825.2 and 825.3 didn't cover an argument to a function being another column, within an anonymous function(). It would be a problem where the function isn't already given; i.e., unlike this case, where the function() can just be omitted since weighted.mean is already given and can be applied as-is.

You can see how optimization modifies j by setting verbose=TRUE (either per query or with the global option). In this case nothing would have been revealed as wrong by that verbose output, but just mentioning it as an aside.

Now filed as #2381: Optimization of lapply(.SD, function() ...) no longer sees columns inside .... Will fix and add tests so this can't regress again.

Thanks!

0人赞添加讨论(0) 举报

家丑人穷心不美

3楼-- · 2020-07-14 12:59

I suggest to keep it simple:

dt[,list(bmi_m=weighted.mean(bmi,swt,na.rm=TRUE),
         tc_m=weighted.mean(tc,swt,na.rm=TRUE),
         swt_m=weighted.mean(swt,swt,na.rm=TRUE)),by=key(dt)]

I think this is also reasonably fast.

0人赞添加讨论(0) 举报

data.table and stratified means

UPDATE (from Arun): This is now fixed in v1.8.11. From NEWS:

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间