I recently upgraded data.table from 1.8.10 to 1.9.2, and I found the following difference between the two versions when grouping across large integers.
Is there a setting that I need to change in 1.9.2 to have the first of the following two group statements work as it did in 1.8.10 (and I presume 1.8.10 is the correct behavior)?
Also, the results are the same in the two packages for the second of the following two group statements, but is that behavior expected?
1.8.10
> library(data.table)
data.table 1.8.10 For help type: help("data.table")
> foo = data.table(i = c(2884199399609098249, 2884199399608934409))
> lapply(foo, class)
$i
[1] "numeric"
> foo
i
1: 2884199399609098240
2: 2884199399608934400
> foo[, .N, by=i]
i N
1: 2884199399609098240 1
2: 2884199399608934400 1
> foo = data.table(i = c(9999999999999999999, 9999999999999999998))
> foo[, .N, by=i]
i N
1: 10000000000000000000 2
>
And 1.9.2
> library(data.table)
data.table 1.9.2 For help type: help("data.table")
> foo = data.table(i = c(2884199399609098249, 2884199399608934409))
> lapply(foo, class)
$i
[1] "numeric"
> foo
i
1: 2884199399609098240
2: 2884199399608934400
> foo[, .N, by=i]
i N
1: 2884199399609098240 2
> foo = data.table(i = c(9999999999999999999, 9999999999999999998))
> foo[, .N, by=i]
i N
1: 10000000000000000000 2
>
The numbers used for the first test (shows difference between the data.table versions) are the numbers from my actual dataset, and the ones that caused a few of my regression tests to fail after upgrading data.table.
I'm curious about the second test, after I increase the numbers by another order of magnitude, if it is expected in both versions of the data.table package to ignore minor differences in the last significant digit.
I'm assuming this all has to do with floating-point representation. Maybe the correct way for me to handle this is to represent these large integers either as integer64 or character? I'm hesitant to do integer64 as I'm not sure if data.table and the R environment fully support them, e.g., I've had to add this in previous data.table code:
options(datatable.integer64="character") # Until integer64 setkey is implemented
Maybe that has been implemented, but regardless changing that setting does not change the results of these tests at least in my environment. I suppose that that makes sense given that these values are stored as numeric in the foo
data table.
Yes the result in v1.8.10 was the correct behaviour. We improved the method of rounding in v1.9.2. That's best explained here :
Grouping very small numbers (e.g. 1e-28) and 0.0 in data.table v1.8.10 vs v1.9.2
That meant we went backwards on supporting integers > 2^31 stored in type
numeric
. That's now addressed in v1.9.3 (available from R-Forge), see NEWS :So you can either call
setNumericRounding(0)
to switch off rounding globally for allnumeric
columns, or better, use the more appropriate type for the column:bit64::integer64
now that it's supported.The change in v1.9.2 was :
The example in
?setNumericRounding
is :