Background
Several SQL languages (I mostly use postgreSQL) have a function called coalesce which returns the first non null column element for each row. This can be very efficient to use when tables have a lot of NULL
elements in them.
I encounter this in a lot of scenarios in R as well when dealing with not so structured data which has a lot of NA's in them.
I have made a naive implementation myself but it is ridiculously slow.
coalesce <- function(...) {
apply(cbind(...), 1, function(x) {
x[which(!is.na(x))[1]]
})
}
Example
a <- c(1, 2, NA, 4, NA)
b <- c(NA, NA, NA, 5, 6)
c <- c(7, 8, NA, 9, 10)
coalesce(a,b,c)
# [1] 1 2 NA 4 6
Question
Is there any efficient way to implement coalesce
in R?
A very simple solution is to use the
ifelse
function from thebase
package:Although it appears to be slower than
coalesce2
above:You can use
Reduce
to make it work for an arbitrary number of vectors:Another apply method, with
mapply
.This selects the first non-NA value if more than one exists. The last non-missing element could be selected using
tail
.Maybe a bit more speed could be squeezed out of this alternative using the bare bones
.mapply
function, which looks a little different..mapply
differs in important ways from its non-dotted cousin.Map
) and so must be wrapped in some function likeunlist
orc
to return a vector.mapply
, the moreArgs argument does not have a default, so must explicitly be fed NULL.On my machine, using
Reduce
gets a 5x performance improvement:Here is my solution:
coalesce <- function(x){ y <- head( x[is.na(x) == F] , 1) return(y) }
It returns first vaule which is not NA and it works ondata.table
, for example if you want to use coalesce on few columns and these column names are in vector of strings:column_names <- c("col1", "col2", "col3")
how to use:
ranking[, coalesce_column := coalesce( mget(column_names) ), by = 1:nrow(ranking)]
Looks like coalesce1 is still available
which is faster still (but more-or-less a hand re-write of
Reduce
, so less general)Or for larger data compare
showing that
which()
can sometimes be effective, even though it implies a second pass through the index.I have a ready-to-use implementation called
coalesce.na
in my misc package. It seems to be competitive, but not fastest. It will also work for vectors of different length, and has a special treatment for vectors of length one:Here's the code:
Of course, as Kevin pointed out, an Rcpp solution might be faster by orders of magnitude.