How to implement coalesce efficiently in R

2018-12-31 06:13发布

Background

Several SQL languages (I mostly use postgreSQL) have a function called coalesce which returns the first non null column element for each row. This can be very efficient to use when tables have a lot of NULL elements in them.

I encounter this in a lot of scenarios in R as well when dealing with not so structured data which has a lot of NA's in them.

I have made a naive implementation myself but it is ridiculously slow.

coalesce <- function(...) {
  apply(cbind(...), 1, function(x) {
          x[which(!is.na(x))[1]]
        })
}

Example

a <- c(1,  2,  NA, 4, NA)
b <- c(NA, NA, NA, 5, 6)
c <- c(7,  8,  NA, 9, 10)
coalesce(a,b,c)
# [1]  1  2 NA  4  6

Question

Is there any efficient way to implement coalesce in R?

标签: r coalesce
7条回答
余生请多指教
2楼-- · 2018-12-31 06:51

A very simple solution is to use the ifelse function from the base package:

coalesce3 <- function(x, y) {

    ifelse(is.na(x), y, x)
}

Although it appears to be slower than coalesce2 above:

test <- function(a, b, func) {

    for (i in 1:10000) {

        func(a, b)
    }
}

system.time(test(a, b, coalesce2))
user  system elapsed 
0.11    0.00    0.10 

system.time(test(a, b, coalesce3))
user  system elapsed 
0.16    0.00    0.15 

You can use Reduce to make it work for an arbitrary number of vectors:

coalesce4 <- function(...) {

    Reduce(coalesce3, list(...))
}
查看更多
春风洒进眼中
3楼-- · 2018-12-31 06:53

Another apply method, with mapply.

mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]}, a, b, c)
[1]  1  2 NA  4  6

This selects the first non-NA value if more than one exists. The last non-missing element could be selected using tail.

Maybe a bit more speed could be squeezed out of this alternative using the bare bones .mapply function, which looks a little different.

unlist(.mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]},
               dots=list(a, b, c), MoreArgs=NULL))
[1]  1  2 NA  4  6

.mapplydiffers in important ways from its non-dotted cousin.

  • it returns a list (like Map) and so must be wrapped in some function like unlist or c to return a vector.
  • the set of arguments to be fed in parallel to the function in FUN must be given in a list to the dots argument.
  • Finally, mapply, the moreArgs argument does not have a default, so must explicitly be fed NULL.
查看更多
孤独寂梦人
4楼-- · 2018-12-31 06:54

On my machine, using Reduce gets a 5x performance improvement:

coalesce2 <- function(...) {
  Reduce(function(x, y) {
    i <- which(is.na(x))
    x[i] <- y[i]
    x},
  list(...))
}

> microbenchmark(coalesce(a,b,c),coalesce2(a,b,c))
Unit: microseconds
               expr    min       lq   median       uq     max neval
  coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438   100
 coalesce2(a, b, c) 19.601  21.4055  22.8835  23.8315  45.419   100
查看更多
与君花间醉酒
5楼-- · 2018-12-31 06:54

Here is my solution:

coalesce <- function(x){ y <- head( x[is.na(x) == F] , 1) return(y) } It returns first vaule which is not NA and it works on data.table, for example if you want to use coalesce on few columns and these column names are in vector of strings:

column_names <- c("col1", "col2", "col3")

how to use:

ranking[, coalesce_column := coalesce( mget(column_names) ), by = 1:nrow(ranking)]

查看更多
人间绝色
6楼-- · 2018-12-31 06:58

Looks like coalesce1 is still available

coalesce1 <- function(...) {
    ans <- ..1
    for (elt in list(...)[-1]) {
        i <- is.na(ans)
        ans[i] <- elt[i]
    }
    ans
}

which is faster still (but more-or-less a hand re-write of Reduce, so less general)

> identical(coalesce(a, b, c), coalesce1(a, b, c))
[1] TRUE
> microbenchmark(coalesce(a,b,c), coalesce1(a, b, c), coalesce2(a,b,c))
Unit: microseconds
               expr     min       lq   median       uq     max neval
  coalesce(a, b, c) 336.266 341.6385 344.7320 355.4935 538.348   100
 coalesce1(a, b, c)   8.287   9.4110  10.9515  12.1295  20.940   100
 coalesce2(a, b, c)  37.711  40.1615  42.0885  45.1705  67.258   100

Or for larger data compare

coalesce1a <- function(...) {
    ans <- ..1
    for (elt in list(...)[-1]) {
        i <- which(is.na(ans))
        ans[i] <- elt[i]
    }
    ans
}

showing that which() can sometimes be effective, even though it implies a second pass through the index.

> aa <- sample(a, 100000, TRUE)
> bb <- sample(b, 100000, TRUE)
> cc <- sample(c, 100000, TRUE)
> microbenchmark(coalesce1(aa, bb, cc),
+                coalesce1a(aa, bb, cc),
+                coalesce2(aa,bb,cc), times=10)
Unit: milliseconds
                   expr       min        lq    median        uq       max neval
  coalesce1(aa, bb, cc) 11.110024 11.137963 11.145723 11.212907 11.270533    10
 coalesce1a(aa, bb, cc)  2.906067  2.953266  2.962729  2.971761  3.452251    10
  coalesce2(aa, bb, cc)  3.080842  3.115607  3.139484  3.166642  3.198977    10
查看更多
倾城一夜雪
7楼-- · 2018-12-31 07:03

I have a ready-to-use implementation called coalesce.na in my misc package. It seems to be competitive, but not fastest. It will also work for vectors of different length, and has a special treatment for vectors of length one:

                    expr        min          lq      median          uq         max neval
    coalesce(aa, bb, cc) 990.060402 1030.708466 1067.000698 1083.301986 1280.734389    10
   coalesce1(aa, bb, cc)  11.356584   11.448455   11.804239   12.507659   14.922052    10
  coalesce1a(aa, bb, cc)   2.739395    2.786594    2.852942    3.312728    5.529927    10
   coalesce2(aa, bb, cc)   2.929364    3.041345    3.593424    3.868032    7.838552    10
 coalesce.na(aa, bb, cc)   4.640552    4.691107    4.858385    4.973895    5.676463    10

Here's the code:

coalesce.na <- function(x, ...) {
  x.len <- length(x)
  ly <- list(...)
  for (y in ly) {
    y.len <- length(y)
    if (y.len == 1) {
      x[is.na(x)] <- y
    } else {
      if (x.len %% y.len != 0)
        warning('object length is not a multiple of first object length')
      pos <- which(is.na(x))
      x[pos] <- y[(pos - 1) %% y.len + 1]
    }
  }
  x
}

Of course, as Kevin pointed out, an Rcpp solution might be faster by orders of magnitude.

查看更多
登录 后发表回答