Idiomatic way to copy cell values “down” in an R v

2019-08-07 11:58发布

Possible Duplicate:
Populate NAs in a vector using prior non-NA values?

Is there an idiomatic way to copy cell values "down" in an R vector? By "copying down", I mean replacing NAs with the closest previous non-NA value.

While I can do this very simply with a for loop, it runs very slowly. Any advice on how to vectorise this would be appreciated.

# Test code
# Set up test data
len <- 1000000
data <- rep(c(1, rep(NA, 9)), len %/% 10) * rep(1:(len %/% 10), each=10)
head(data, n=25)
tail(data, n=25)

# Time naive method
system.time({
  data.clean <- data;
  for (i in 2:length(data.clean)){
    if(is.na(data.clean[i])) data.clean[i] <- data.clean[i-1]
  }
})

# Print results
head(data.clean, n=25)
tail(data.clean, n=25)

Result of test run:

> # Set up test data
> len <- 1000000
> data <- rep(c(1, rep(NA, 9)), len %/% 10) * rep(1:(len %/% 10), each=10)
> head(data, n=25)
 [1]  1 NA NA NA NA NA NA NA NA NA  2 NA NA NA NA NA NA NA NA NA  3 NA NA NA NA
> tail(data, n=25)
 [1]     NA     NA     NA     NA     NA  99999     NA     NA     NA     NA
[11]     NA     NA     NA     NA     NA 100000     NA     NA     NA     NA
[21]     NA     NA     NA     NA     NA
> 
> # Time naive method
> system.time({
+   data.clean <- data;
+   for (i in 2:length(data.clean)){
+     if(is.na(data.clean[i])) data.clean[i] <- data.clean[i-1]
+   }
+ })
   user  system elapsed 
   3.09    0.00    3.09 
> 
> # Print results
> head(data.clean, n=25)
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
> tail(data.clean, n=25)
 [1]  99998  99998  99998  99998  99998  99999  99999  99999  99999  99999
[11]  99999  99999  99999  99999  99999 100000 100000 100000 100000 100000
[21] 100000 100000 100000 100000 100000
> 

2条回答
可以哭但决不认输i
2楼-- · 2019-08-07 12:28

Use zoo::na.locf

Wrapping your code in function f (including returning data.clean at the end):

library(rbenchmark)
library(zoo)

identical(f(data), na.locf(data))
## [1] TRUE

benchmark(f(data), na.locf(data), replications=10, columns=c("test", "elapsed", "relative"))
##            test elapsed relative
## 1       f(data)  21.460   14.471
## 2 na.locf(data)   1.483    1.000
查看更多
该账号已被封号
3楼-- · 2019-08-07 12:48

I don't know about idiomatic, but here we identify the non-NA values (idx), and the index of the last non-NA value (cumsum(idx))

f1 <- function(x) {
    idx <- !is.na(x)
    x[idx][cumsum(idx)]
}

which seems to be about 6 times faster than na.locf for the example data. It drops leading NA's like na.locf does by default, so

f2 <- function(x, na.rm=TRUE) {
    idx <- !is.na(x)
    cidx <- cumsum(idx)
    if (!na.rm)
        cidx[cidx==0] <- NA_integer_
    x[idx][cidx]
}

which seems to add on about 30% time when na.rm=FALSE. Presumably na.locf has other merits, capturing more of the corner cases and allowing filling up instead of down (which is an interesting exercise in the cumsum world, anyway). It's also clear that we're making at least five allocations of possibly large data -- idx (actually, we calculate is.na() and it's complement), cumsum(idx), x[idx], and x[idx][cumsum(idx)] -- so there's room for further improvement, e.g., in C

查看更多
登录 后发表回答