Replacing NAs with latest non-NA value

2018-12-31 01:08发布

In a data.frame (or data.table), I would like to "fill forward" NAs with the closest previous non-NA value. A simple example, using vectors (instead of a data.frame) is the following:

> y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)

I would like a function fill.NAs() that allows me to construct yy such that:

> yy
[1] NA NA NA  2  2  2  2  3  3  3  4  4

I need to repeat this operation for many (total ~1 Tb) small sized data.frames (~30-50 Mb), where a row is NA is all its entries are. What is a good way to approach the problem?

The ugly solution I cooked up uses this function:

last <- function (x){
    x[length(x)]
}    

fill.NAs <- function(isNA){
if (isNA[1] == 1) {
    isNA[1:max({which(isNA==0)[1]-1},1)] <- 0 # first is NAs 
                                              # can't be forward filled
}
isNA.neg <- isNA.pos <- isNA.diff <- diff(isNA)
isNA.pos[isNA.diff < 0] <- 0
isNA.neg[isNA.diff > 0] <- 0
which.isNA.neg <- which(as.logical(isNA.neg))
if (length(which.isNA.neg)==0) return(NULL) # generates warnings later, but works
which.isNA.pos <- which(as.logical(isNA.pos))
which.isNA <- which(as.logical(isNA))
if (length(which.isNA.neg)==length(which.isNA.pos)){
    replacement <- rep(which.isNA.pos[2:length(which.isNA.neg)], 
                                which.isNA.neg[2:max(length(which.isNA.neg)-1,2)] - 
                                which.isNA.pos[1:max(length(which.isNA.neg)-1,1)])      
    replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
} else {
    replacement <- rep(which.isNA.pos[1:length(which.isNA.neg)], which.isNA.neg - which.isNA.pos[1:length(which.isNA.neg)])     
    replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
}
replacement
}

The function fill.NAs is used as follows:

y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
isNA <- as.numeric(is.na(y))
replacement <- fill.NAs(isNA)
if (length(replacement)){
which.isNA <- which(as.logical(isNA))
to.replace <- which.isNA[which(isNA==0)[1]:length(which.isNA)]
y[to.replace] <- y[replacement]
} 

Output

> y
[1] NA  2  2  2  2  3  3  3  4  4  4

... which seems to work. But, man, is it ugly! Any suggestions?

13条回答
牵手、夕阳
2楼-- · 2018-12-31 01:08

I tried the below:

nullIdx <- as.array(which(is.na(masterData$RequiredColumn)))
masterData$RequiredColumn[nullIdx] = masterData$RequiredColumn[nullIdx-1]

nullIdx gets the idx number where ever masterData$RequiredColumn has a Null/ NA value. In the next line we replace it with the corresponding Idx-1 value, i.e. the last good value before each NULL/ NA

查看更多
只靠听说
3楼-- · 2018-12-31 01:08

Here is a modification of @AdamO's solution. This one runs faster, because it bypasses the na.omit function. This will overwrite the NA values in vector y (except for leading NAs).

   z  <- !is.na(y)                  # indicates the positions of y whose values we do not want to overwrite
   z  <- z | !cumsum(z)             # for leading NA's in y, z will be TRUE, otherwise it will be FALSE where y has a NA and TRUE where y does not have a NA
   y  <- y[z][cumsum(z)]
查看更多
余欢
4楼-- · 2018-12-31 01:10

Try this function. It does not require the ZOO package:

# last observation moved forward
# replaces all NA values with last non-NA values
na.lomf <- function(x) {

    na.lomf.0 <- function(x) {
        non.na.idx <- which(!is.na(x))
        if (is.na(x[1L])) {
            non.na.idx <- c(1L, non.na.idx)
        }
        rep.int(x[non.na.idx], diff(c(non.na.idx, length(x) + 1L)))
    }

    dim.len <- length(dim(x))

    if (dim.len == 0L) {
        na.lomf.0(x)
    } else {
        apply(x, dim.len, na.lomf.0)
    }
}

Example:

> # vector
> na.lomf(c(1, NA,2, NA, NA))
[1] 1 1 2 2 2
> 
> # matrix
> na.lomf(matrix(c(1, NA, NA, 2, NA, NA), ncol = 2))
     [,1] [,2]
[1,]    1    2
[2,]    1    2
[3,]    1    2
查看更多
梦醉为红颜
5楼-- · 2018-12-31 01:15

Throwing my hat in:

library(Rcpp)
cppFunction('IntegerVector na_locf(IntegerVector x) {
  int n = x.size();

  for(int i = 0; i<n; i++) {
    if((i > 0) && (x[i] == NA_INTEGER) & (x[i-1] != NA_INTEGER)) {
      x[i] = x[i-1];
    }
  }
  return x;
}')

Setup a basic sample and a benchmark:

x <- sample(c(1,2,3,4,NA))

bench_em <- function(x,count = 10) {
  x <- sample(x,count,replace = TRUE)
  print(microbenchmark(
    na_locf(x),
    replace_na_with_last(x),
    na.lomf(x),
    na.locf(x),
    repeat.before(x)
  ), order = "mean", digits = 1)
}

And run some benchmarks:

bench_em(x,1e6)

Unit: microseconds
                    expr   min    lq  mean median    uq   max neval
              na_locf(x)   697   798   821    814   821 1e+03   100
              na.lomf(x)  3511  4137  5002   4214  4330 1e+04   100
 replace_na_with_last(x)  4482  5224  6473   5342  5801 2e+04   100
        repeat.before(x)  4793  5044  6622   5097  5520 1e+04   100
              na.locf(x) 12017 12658 17076  13545 19193 2e+05   100

Just in case:

all.equal(
     na_locf(x),
     replace_na_with_last(x),
     na.lomf(x),
     na.locf(x),
     repeat.before(x)
)
[1] TRUE

Update

For a numeric vector, the function is a bit different:

NumericVector na_locf_numeric(NumericVector x) {
  int n = x.size();
  LogicalVector ina = is_na(x);

  for(int i = 1; i<n; i++) {
    if((ina[i] == TRUE) & (ina[i-1] != TRUE)) {
      x[i] = x[i-1];
    }
  }
  return x;
}
查看更多
泛滥B
6楼-- · 2018-12-31 01:15

There are a bunch of packages offering na.locf (NA Last Observation Carried Forward) functions:

  • xts - xts::na.locf
  • zoo - zoo::na.locf
  • imputeTS - imputeTS::na.locf
  • spacetime - spacetime::na.locf

And also other packages where this function is named differently.

查看更多
还给你的自由
7楼-- · 2018-12-31 01:16

This has worked for me:

  replace_na_with_last<-function(x,a=!is.na(x)){
     x[which(a)[c(1,1:sum(a))][cumsum(a)+1]]
  }


> replace_na_with_last(c(1,NA,NA,NA,3,4,5,NA,5,5,5,NA,NA,NA))

[1] 1 1 1 1 3 4 5 5 5 5 5 5 5 5

> replace_na_with_last(c(NA,"aa",NA,"ccc",NA))

[1] "aa"  "aa"  "aa"  "ccc" "ccc"

speed is reasonable too:

> system.time(replace_na_with_last(sample(c(1,2,3,NA),1e6,replace=TRUE)))


 user  system elapsed 

 0.072   0.000   0.071 
查看更多
登录 后发表回答