Background
I tried to replace some CSV
output files with rds
files to improve efficiency. These are intermediate files that will serve as inputs to other R scripts.
Question
I started investigating when my scripts failed and found that readRDS()
and load()
do not return identical data tables
as the original. Is this supposed to happen? Or did I miss something?
Sample code
library( data.table )
aDT <- data.table( a=1:10, b=LETTERS[1:10] )
saveRDS( aDT, file = "aDT.rds")
bDT <- readRDS( file = "aDT.rds" )
identical( aDT, bDT, ignore.environment = T ) # Gives 'False'
aDF <- data.frame( a=1:10, b=LETTERS[1:10] )
saveRDS( aDF, file = "aDF.rds")
bDF <- readRDS( file = "aDF.rds" )
identical( aDF, bDF, ignore.environment = T ) # Gives 'True'
# Using 'save'& 'load' doesn't help either
aDT2 <- data.table( a=1:10, b=LETTERS[1:10] )
save( aDT2, file = "aDT2.RData")
bDT2 <- aDT2; rm( aDT2 )
load( file = "aDT2.RData" )
identical( aDT2, bDT2, ignore.environment = T ) # Gives 'False'
I am running R ver 3.2.0 on Linux Mint and have tested with data.table
ver 1.9.4 and 1.9.5 (latest).
Searching in SO and google returned this and this but I don't think they answer this issue. I am still trying to figure out why my scripts failed when I switched to rds
but I am starting with this.
Would appreciate it very much if knowledgeable SO members can help. Thanks!
Edit:
Hi everyone, I happened to find a way to resolve the issue - have posted the solution below. I apologise if it's rather inelegant. Now, I have 2 further questions:
(1) Is there a better way?
(2) Can something be done at the R
and/or data.table
code to resolve this? I mean, this issue causes unpredictable bugs and is not the first thing that comes to mind. My 2 cents worth.
Probably, this has to do with pointers:
attributes(aDT)
$names
[1] "a" "b"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10
$class
[1] "data.table" "data.frame"
$.internal.selfref
<pointer: 0x0000000000390788>
> attributes(bDT)
$names
[1] "a" "b"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10
$class
[1] "data.table" "data.frame"
$.internal.selfref
<pointer: (nil)>
> attributes(bDF)
$names
[1] "a" "b"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10
$class
[1] "data.frame"
> attributes(aDF)
$names
[1] "a" "b"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10
$class
[1] "data.frame"
You can closely look at what's going using .Internal(inspect(.))
command:
.Internal(inspect(aDT))
.Internal(inspect(bDT))
The newly loaded data.table
doesn't know the pointer value of the already loaded one. You could tell it with
attributes(bDT)$.internal.selfref <- attributes(aDT)$.internal.selfref
identical( aDT, bDT, ignore.environment = T )
# [1] TRUE
data.frame
don't keep this attribute, probably because they don't do in place modification.
I happen to find a way that resolves the issue (disclaimer: it's a rather inelegant way but it works!) - adding then deleting a dummy column in the loaded data table
leads to identical
being 'True'. I have also successfully replaced csv
with rds
intermediate files in my own code.
To be honest, I don't understand enough of the inner workings of R nor data table
to know why it works, so any explanations and/or more elegant solutions would be welcomed.
library( data.table )
aDT <- data.table( a=1:10, b=LETTERS[1:10] )
saveRDS( aDT, file = "aDT.rds")
bDT <- readRDS( file = "aDT.rds" )
identical( aDT, bDT, ignore.environment = T ) # Gives 'False'
bDT[ , aaa := NA ]; bDT[ , aaa := NULL ]
identical( aDT, bDT, ignore.environment = T ) # Now gives 'True'
# Using the add-del-col 'trick' works here too
aDT2 <- data.table( a=1:10, b=LETTERS[1:10] )
save( aDT2, file = "aDT2.RData")
bDT2 <- aDT2; rm( aDT2 )
load( file = "aDT2.RData" )
identical( aDT2, bDT2, ignore.environment = T ) # Gives 'False'
aDT2[ , aaa := NA ]; aDT2[ , aaa := NULL ]
identical( aDT2, bDT2, ignore.environment = T ) # Now gives 'True'
The solution is to use setDT
after load
or readRDS
aDT2 <- readRDS("aDT2.RData")
setDT(aDT2)
source: Adding new columns to a data.table by-reference within a function not always working