Idiomatic way to copy cell values “down” in an R v

2019-08-07 11:45发布

问题:

Possible Duplicate:
Populate NAs in a vector using prior non-NA values?

Is there an idiomatic way to copy cell values "down" in an R vector? By "copying down", I mean replacing NAs with the closest previous non-NA value.

While I can do this very simply with a for loop, it runs very slowly. Any advice on how to vectorise this would be appreciated.

# Test code
# Set up test data
len <- 1000000
data <- rep(c(1, rep(NA, 9)), len %/% 10) * rep(1:(len %/% 10), each=10)
head(data, n=25)
tail(data, n=25)

# Time naive method
system.time({
  data.clean <- data;
  for (i in 2:length(data.clean)){
    if(is.na(data.clean[i])) data.clean[i] <- data.clean[i-1]
  }
})

# Print results
head(data.clean, n=25)
tail(data.clean, n=25)

Result of test run:

> # Set up test data
> len <- 1000000
> data <- rep(c(1, rep(NA, 9)), len %/% 10) * rep(1:(len %/% 10), each=10)
> head(data, n=25)
 [1]  1 NA NA NA NA NA NA NA NA NA  2 NA NA NA NA NA NA NA NA NA  3 NA NA NA NA
> tail(data, n=25)
 [1]     NA     NA     NA     NA     NA  99999     NA     NA     NA     NA
[11]     NA     NA     NA     NA     NA 100000     NA     NA     NA     NA
[21]     NA     NA     NA     NA     NA
> 
> # Time naive method
> system.time({
+   data.clean <- data;
+   for (i in 2:length(data.clean)){
+     if(is.na(data.clean[i])) data.clean[i] <- data.clean[i-1]
+   }
+ })
   user  system elapsed 
   3.09    0.00    3.09 
> 
> # Print results
> head(data.clean, n=25)
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
> tail(data.clean, n=25)
 [1]  99998  99998  99998  99998  99998  99999  99999  99999  99999  99999
[11]  99999  99999  99999  99999  99999 100000 100000 100000 100000 100000
[21] 100000 100000 100000 100000 100000
> 

回答1:

Use zoo::na.locf

Wrapping your code in function f (including returning data.clean at the end):

library(rbenchmark)
library(zoo)

identical(f(data), na.locf(data))
## [1] TRUE

benchmark(f(data), na.locf(data), replications=10, columns=c("test", "elapsed", "relative"))
##            test elapsed relative
## 1       f(data)  21.460   14.471
## 2 na.locf(data)   1.483    1.000


回答2:

I don't know about idiomatic, but here we identify the non-NA values (idx), and the index of the last non-NA value (cumsum(idx))

f1 <- function(x) {
    idx <- !is.na(x)
    x[idx][cumsum(idx)]
}

which seems to be about 6 times faster than na.locf for the example data. It drops leading NA's like na.locf does by default, so

f2 <- function(x, na.rm=TRUE) {
    idx <- !is.na(x)
    cidx <- cumsum(idx)
    if (!na.rm)
        cidx[cidx==0] <- NA_integer_
    x[idx][cidx]
}

which seems to add on about 30% time when na.rm=FALSE. Presumably na.locf has other merits, capturing more of the corner cases and allowing filling up instead of down (which is an interesting exercise in the cumsum world, anyway). It's also clear that we're making at least five allocations of possibly large data -- idx (actually, we calculate is.na() and it's complement), cumsum(idx), x[idx], and x[idx][cumsum(idx)] -- so there's room for further improvement, e.g., in C