Background
I tried to replace some CSV
output files with rds
files to improve efficiency. These are intermediate files that will serve as inputs to other R scripts.
Question
I started investigating when my scripts failed and found that readRDS()
and load()
do not return identical data tables
as the original. Is this supposed to happen? Or did I miss something?
Sample code
library( data.table )
aDT <- data.table( a=1:10, b=LETTERS[1:10] )
saveRDS( aDT, file = "aDT.rds")
bDT <- readRDS( file = "aDT.rds" )
identical( aDT, bDT, ignore.environment = T ) # Gives 'False'
aDF <- data.frame( a=1:10, b=LETTERS[1:10] )
saveRDS( aDF, file = "aDF.rds")
bDF <- readRDS( file = "aDF.rds" )
identical( aDF, bDF, ignore.environment = T ) # Gives 'True'
# Using 'save'& 'load' doesn't help either
aDT2 <- data.table( a=1:10, b=LETTERS[1:10] )
save( aDT2, file = "aDT2.RData")
bDT2 <- aDT2; rm( aDT2 )
load( file = "aDT2.RData" )
identical( aDT2, bDT2, ignore.environment = T ) # Gives 'False'
I am running R ver 3.2.0 on Linux Mint and have tested with data.table
ver 1.9.4 and 1.9.5 (latest).
Searching in SO and google returned this and this but I don't think they answer this issue. I am still trying to figure out why my scripts failed when I switched to rds
but I am starting with this.
Would appreciate it very much if knowledgeable SO members can help. Thanks!
Edit:
Hi everyone, I happened to find a way to resolve the issue - have posted the solution below. I apologise if it's rather inelegant. Now, I have 2 further questions:
(1) Is there a better way?
(2) Can something be done at the R
and/or data.table
code to resolve this? I mean, this issue causes unpredictable bugs and is not the first thing that comes to mind. My 2 cents worth.
I happen to find a way that resolves the issue (disclaimer: it's a rather inelegant way but it works!) - adding then deleting a dummy column in the loaded
data table
leads toidentical
being 'True'. I have also successfully replacedcsv
withrds
intermediate files in my own code.To be honest, I don't understand enough of the inner workings of R nor
data table
to know why it works, so any explanations and/or more elegant solutions would be welcomed.The solution is to use
setDT
afterload
orreadRDS
source: Adding new columns to a data.table by-reference within a function not always working
The newly loaded
data.table
doesn't know the pointer value of the already loaded one. You could tell it withdata.frame
don't keep this attribute, probably because they don't do in place modification.Probably, this has to do with pointers:
You can closely look at what's going using
.Internal(inspect(.))
command: