Idiomatic way to copy cell values “down” in an R v

Possible Duplicate:
Populate NAs in a vector using prior non-NA values?

Is there an idiomatic way to copy cell values "down" in an R vector? By "copying down", I mean replacing NAs with the closest previous non-NA value.

While I can do this very simply with a for loop, it runs very slowly. Any advice on how to vectorise this would be appreciated.

# Test code
# Set up test data
len <- 1000000
data <- rep(c(1, rep(NA, 9)), len %/% 10) * rep(1:(len %/% 10), each=10)
head(data, n=25)
tail(data, n=25)

# Time naive method
system.time({
  data.clean <- data;
  for (i in 2:length(data.clean)){
    if(is.na(data.clean[i])) data.clean[i] <- data.clean[i-1]
  }
})

# Print results
head(data.clean, n=25)
tail(data.clean, n=25)

Result of test run:

> # Set up test data
> len <- 1000000
> data <- rep(c(1, rep(NA, 9)), len %/% 10) * rep(1:(len %/% 10), each=10)
> head(data, n=25)
 [1]  1 NA NA NA NA NA NA NA NA NA  2 NA NA NA NA NA NA NA NA NA  3 NA NA NA NA
> tail(data, n=25)
 [1]     NA     NA     NA     NA     NA  99999     NA     NA     NA     NA
[11]     NA     NA     NA     NA     NA 100000     NA     NA     NA     NA
[21]     NA     NA     NA     NA     NA
> 
> # Time naive method
> system.time({
+   data.clean <- data;
+   for (i in 2:length(data.clean)){
+     if(is.na(data.clean[i])) data.clean[i] <- data.clean[i-1]
+   }
+ })
   user  system elapsed 
   3.09    0.00    3.09 
> 
> # Print results
> head(data.clean, n=25)
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
> tail(data.clean, n=25)
 [1]  99998  99998  99998  99998  99998  99999  99999  99999  99999  99999
[11]  99999  99999  99999  99999  99999 100000 100000 100000 100000 100000
[21] 100000 100000 100000 100000 100000
>

标签： performance r loops vectorization idioms

2条回答

可以哭但决不认输i

2楼-- · 2019-08-07 12:28

Use zoo::na.locf

Wrapping your code in function f (including returning data.clean at the end):

library(rbenchmark)
library(zoo)

identical(f(data), na.locf(data))
## [1] TRUE

benchmark(f(data), na.locf(data), replications=10, columns=c("test", "elapsed", "relative"))
##            test elapsed relative
## 1       f(data)  21.460   14.471
## 2 na.locf(data)   1.483    1.000

0人赞添加讨论(0) 举报

该账号已被封号

3楼-- · 2019-08-07 12:48

I don't know about idiomatic, but here we identify the non-NA values (idx), and the index of the last non-NA value (cumsum(idx))

f1 <- function(x) {
    idx <- !is.na(x)
    x[idx][cumsum(idx)]
}

which seems to be about 6 times faster than na.locf for the example data. It drops leading NA's like na.locf does by default, so

f2 <- function(x, na.rm=TRUE) {
    idx <- !is.na(x)
    cidx <- cumsum(idx)
    if (!na.rm)
        cidx[cidx==0] <- NA_integer_
    x[idx][cidx]
}

which seems to add on about 30% time when na.rm=FALSE. Presumably na.locf has other merits, capturing more of the corner cases and allowing filling up instead of down (which is an interesting exercise in the cumsum world, anyway). It's also clear that we're making at least five allocations of possibly large data -- idx (actually, we calculate is.na() and it's complement), cumsum(idx), x[idx], and x[idx][cumsum(idx)] -- so there's room for further improvement, e.g., in C

0人赞添加讨论(0) 举报

Idiomatic way to copy cell values “down” in an R v

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间