Why does median trip up data.table (integer versus

2019-01-17 06:55发布

问题:

I have a data.table called enc.per.day for encounters per day. It has 2403 rows in which a date of service is specified and the number of patients seen on that day. I wanted to see the median number of patients seen on any type of weekday.

enc.per.day[,list(patient.encounters=median(n)),by=list(weekdays(DOS))]

That line gives an error

Error in [.data.table(enc.per.day, , list(patient.encounters = median(n)), : columns of j don't evaluate to consistent types for each group: result for group 4 has column 1 type 'integer' but expecting type 'double'

The following all work well

tapply(enc.per.day$n,weekdays(enc.per.day$DOS),median)
enc.per.day[,list(patient.encounters=round(median(n))),by=list(weekdays(DOS))]
enc.per.day[,list(patient.encounters=median(n)+0),by=list(weekdays(DOS))]

What is going on? It took me a long time to figure out why my code would not work.

By the way the underlying vector enc.per.day$n is an integer

storage.mode(enc.per.day$n)

returns "integer". Further, there are no NAs anywhere in the data.table.

回答1:

TL;DR wrap median with as.double()

median() 'trips up' data.table because --- even when only passed integer vectors --- median() sometimes returns an integer value, and sometimes returns a double.

## median of 1:3 is 2, of type "integer" 
typeof(median(1:3))
# [1] "integer"

## median of 1:2 is 1.5, of type "double"
typeof(median(1:2))
# [1] "double"

Reproducing your error message with a minimal example:

library(data.table)
dt <- data.table(patients = c(1:3, 1:2), 
                 weekdays = c("Mon", "Mon", "Mon", "Tue", "Tue"))

dt[,median(patients), by=weekdays]
# Error in `[.data.table`(dt, , median(patients), by = weekdays) : 
#   columns of j don't evaluate to consistent types for each group: 
#   result for group 2 has column 1 type 'double' but expecting type 'integer'

data.table complains because, after inspecting the value of the first group to be processed, it has concluded that, OK, these results are going to be of type "integer". But then right away (or in your case in group 4), it gets passed a value of type "double", which won't fit in its "integer" results vector.


data.table could instead accumulate results until the end of the group-wise calculations, and then perform type conversions if necessary, but that would require a bunch of additional performance-degrading overhead; instead, it just reports what happened and lets you fix the problem. After the first group has run, and it knows the type of the result, it allocates a result vector of that type as long as the number of groups, and then populates it. If it later finds that some groups return more than 1 item, it will grow (i.e., reallocate) that result vector as needed. In most cases though, data.table's first guess for the final size of the result is right first time (e.g., 1 row result per group) and hence fast.

In this case, using as.double(median(X)) instead of median(X) provides a suitable fix.

(By the way, your version using round() worked because it always returns values of type "double", as you can see by typing typeof(round(median(1:2))); typeof(round(median(1:3))).)