Large integers in data.table. Grouping results dif

2020-01-29 01:30发布

I recently upgraded data.table from 1.8.10 to 1.9.2, and I found the following difference between the two versions when grouping across large integers.

Is there a setting that I need to change in 1.9.2 to have the first of the following two group statements work as it did in 1.8.10 (and I presume 1.8.10 is the correct behavior)?

Also, the results are the same in the two packages for the second of the following two group statements, but is that behavior expected?

1.8.10

>   library(data.table)
data.table 1.8.10  For help type: help("data.table")
>   foo = data.table(i = c(2884199399609098249, 2884199399608934409))
>   lapply(foo, class)
$i
[1] "numeric"

>   foo
                     i
1: 2884199399609098240
2: 2884199399608934400
>   foo[, .N, by=i]
                     i N
1: 2884199399609098240 1
2: 2884199399608934400 1
>   foo = data.table(i = c(9999999999999999999, 9999999999999999998))
>   foo[, .N, by=i]
                      i N
1: 10000000000000000000 2
> 

And 1.9.2

>   library(data.table)
data.table 1.9.2  For help type: help("data.table")
>   foo = data.table(i = c(2884199399609098249, 2884199399608934409))
>   lapply(foo, class)
$i
[1] "numeric"

>   foo
                     i
1: 2884199399609098240
2: 2884199399608934400
>   foo[, .N, by=i]
                     i N
1: 2884199399609098240 2
>   foo = data.table(i = c(9999999999999999999, 9999999999999999998))
>   foo[, .N, by=i]
                      i N
1: 10000000000000000000 2
> 

The numbers used for the first test (shows difference between the data.table versions) are the numbers from my actual dataset, and the ones that caused a few of my regression tests to fail after upgrading data.table.

I'm curious about the second test, after I increase the numbers by another order of magnitude, if it is expected in both versions of the data.table package to ignore minor differences in the last significant digit.

I'm assuming this all has to do with floating-point representation. Maybe the correct way for me to handle this is to represent these large integers either as integer64 or character? I'm hesitant to do integer64 as I'm not sure if data.table and the R environment fully support them, e.g., I've had to add this in previous data.table code:

options(datatable.integer64="character") # Until integer64 setkey is implemented

Maybe that has been implemented, but regardless changing that setting does not change the results of these tests at least in my environment. I suppose that that makes sense given that these values are stored as numeric in the foo data table.

标签: r data.table
1条回答
▲ chillily
2楼-- · 2020-01-29 02:08

Yes the result in v1.8.10 was the correct behaviour. We improved the method of rounding in v1.9.2. That's best explained here :

Grouping very small numbers (e.g. 1e-28) and 0.0 in data.table v1.8.10 vs v1.9.2

That meant we went backwards on supporting integers > 2^31 stored in type numeric. That's now addressed in v1.9.3 (available from R-Forge), see NEWS :

o bit64::integer64 now works in grouping and joins, #5369. Thanks to James Sams for highlighting UPCs and Clayton Stanley.
Reminder: fread() has been able to detect and read integer64 for a while.

o New function setNumericRounding() may be used to reduce to 1 byte or 0 byte rounding when joining to or grouping columns of type numeric, #5369. See example in ?setNumericRounding and NEWS item from v1.9.2. getNumericRounding() returns the current setting.

So you can either call setNumericRounding(0) to switch off rounding globally for all numeric columns, or better, use the more appropriate type for the column: bit64::integer64 now that it's supported.

The change in v1.9.2 was :

o Numeric data is still joined and grouped within tolerance as before but instead of tolerance being sqrt(.Machine$double.eps) == 1.490116e-08 (the same as base::all.equal's default) the significand is now rounded to the last 2 bytes, apx 11 s.f. This is more appropriate for large (1.23e20) and small (1.23e-20) numerics and is faster via a simple bit twiddle. A few functions provided a 'tolerance' argument but this wasn't being passed through so has been removed. We aim to add a global option (e.g. 2, 1 or 0 byte rounding) in a future release [DONE].

The example in ?setNumericRounding is :

> DT = data.table(a=seq(0,1,by=0.2),b=1:2, key="a")
> DT
     a b
1: 0.0 1
2: 0.2 2
3: 0.4 1
4: 0.6 2
5: 0.8 1
6: 1.0 2
> setNumericRounding(0)   # turn off rounding; i.e. if we didn't round
> DT[.(0.4)]   # works
     a b
1: 0.4 1
> DT[.(0.6)]   # no match!, confusing to users
     a  b      # 0.6 is clearing there in DT, and 0.4 worked ok!
1: 0.6 NA
>     
> setNumericRounding(2)   # restore default
> DT[.(0.6)]   # now works as user expects
     a b
1: 0.6 2
>     
> # using type 'numeric' for integers > 2^31 (typically ids)
> DT = data.table(id = c(1234567890123, 1234567890124, 1234567890125), val=1:3)
> DT[,.N,by=id]   # 1 row (the last digit has been rounded)
             id N
1: 1.234568e+12 3
> setNumericRounding(0)  # turn off rounding
> DT[,.N,by=id]   # 3 rows (the last digit wasn't rounded)
             id N
1: 1.234568e+12 1
2: 1.234568e+12 1
3: 1.234568e+12 1
>  # but, better to use bit64::integer64 for such ids instead of numeric
>  setNumericRounding(2)  # restore default, preferred
查看更多
登录 后发表回答