I have a large data.table, with many missing values scattered throughout its ~200k rows and 200 columns. I would like to re code those NA values to zeros as efficiently as possible.
I see two options:
1: Convert to a data.frame, and use something like this
2: Some kind of cool data.table sub setting command
I'll be happy with a fairly efficient solution of type 1. Converting to a data.frame and then back to a data.table won't take too long.
Here's a solution using data.table's
:=
operator, building on Andrie and Ramnath's answers.Note that f_dowle updated dt1 by reference. If a local copy is required then an explicit call to the
copy
function is needed to make a local copy of the whole dataset. data.table'ssetkey
,key<-
and:=
do not copy-on-write.Next, let's see where f_dowle is spending its time.
There, I would focus on
na.replace
andis.na
, where there are a few vector copies and vector scans. Those can fairly easily be eliminated by writing a small na.replace C function that updatesNA
by reference in the vector. That would at least halve the 20 seconds I think. Does such a function exist in any R package?The reason
f_andrie
fails may be because it copies the whole ofdt1
, or creates a logical matrix as big as the whole ofdt1
, a few times. The other 2 methods work on one column at a time (although I only briefly looked atNAToUnknown
).EDIT (more elegant solution as requested by Ramnath in comments) :
I wish I did it that way to start with!
EDIT2 (over 1 year later, now)
There is also
set()
. This can be faster if there are a lot of column being looped through, as it avoids the (small) overhead of calling[,:=,]
in a loop.set
is a loopable:=
. See?set
.My understanding is that the secret to fast operations in R is to utilise vector (or arrays, which are vectors under the hood.)
In this solution I make use of a
data.matrix
which is anarray
but behave a bit like adata.frame
. Because it is an array, you can use a very simple vector substitution to replace theNA
s:A little helper function to remove the
NA
s. The essence is a single line of code. I only do this to measure execution time.A little helper function to create a
data.table
of a given size.Demonstration on a tiny sample:
Here's the simplest one I could come up with:
dt[is.na(dt)] <- 0
It's efficient and no need to write functions and other glue code.
For the sake of completeness, another way to replace NAs with 0 is to use
To compare results and times I have incorporated all approaches mentioned so far.
So the new approach is slightly slower than
f_dowle3
but faster than all the other approaches. But to be honest, this is against my Intuition of the data.table Syntax and I have no idea why this works. Can anybody enlighten me?Here is a solution using
NAToUnknown
in thegdata
package. I have used Andrie's solution to create a huge data table and also included time comparisons with Andrie's solution.