In a data.frame (or data.table), I would like to "fill forward" NAs with the closest previous non-NA value. A simple example, using vectors (instead of a data.frame
) is the following:
> y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
I would like a function fill.NAs()
that allows me to construct yy
such that:
> yy
[1] NA NA NA 2 2 2 2 3 3 3 4 4
I need to repeat this operation for many (total ~1 Tb) small sized data.frame
s (~30-50 Mb), where a row is NA is all its entries are. What is a good way to approach the problem?
The ugly solution I cooked up uses this function:
last <- function (x){
x[length(x)]
}
fill.NAs <- function(isNA){
if (isNA[1] == 1) {
isNA[1:max({which(isNA==0)[1]-1},1)] <- 0 # first is NAs
# can't be forward filled
}
isNA.neg <- isNA.pos <- isNA.diff <- diff(isNA)
isNA.pos[isNA.diff < 0] <- 0
isNA.neg[isNA.diff > 0] <- 0
which.isNA.neg <- which(as.logical(isNA.neg))
if (length(which.isNA.neg)==0) return(NULL) # generates warnings later, but works
which.isNA.pos <- which(as.logical(isNA.pos))
which.isNA <- which(as.logical(isNA))
if (length(which.isNA.neg)==length(which.isNA.pos)){
replacement <- rep(which.isNA.pos[2:length(which.isNA.neg)],
which.isNA.neg[2:max(length(which.isNA.neg)-1,2)] -
which.isNA.pos[1:max(length(which.isNA.neg)-1,1)])
replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
} else {
replacement <- rep(which.isNA.pos[1:length(which.isNA.neg)], which.isNA.neg - which.isNA.pos[1:length(which.isNA.neg)])
replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
}
replacement
}
The function fill.NAs
is used as follows:
y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
isNA <- as.numeric(is.na(y))
replacement <- fill.NAs(isNA)
if (length(replacement)){
which.isNA <- which(as.logical(isNA))
to.replace <- which.isNA[which(isNA==0)[1]:length(which.isNA)]
y[to.replace] <- y[replacement]
}
Output
> y
[1] NA 2 2 2 2 3 3 3 4 4 4
... which seems to work. But, man, is it ugly! Any suggestions?
Sorry for digging up an old question. I couldn't look up the function to do this job on the train, so I wrote one myself.
I was proud to find out that it's a tiny bit faster.
It's less flexible though.
But it plays nice with
ave
, which is what I needed.Edit
As this became my most upvoted answer, I was reminded often that I don't use my own function, because I often need zoo's
maxgap
argument. Because zoo has some weird problems in edge cases when I use dplyr + dates that I couldn't debug, I came back to this today to improve my old function.I benchmarked my improved function and all the other entries here. For the basic set of features,
tidyr::fill
is fastest while also not failing the edge cases. The Rcpp entry by @BrandonBertelsen is faster still, but it's inflexible regarding the input's type (he tested edge cases incorrectly due to a misunderstanding ofall.equal
).If you need
maxgap
, my function below is faster than zoo (and doesn't have the weird problems with dates).I put up the documentation of my tests.
new function
I've also put the function in my formr package (Github only).
a
data.table
solution:this approach could work with forward filling zeros as well:
this method becomes very useful on data at scale and where you would want to perform a forward fill by group(s), which is trivial with
data.table
. just add the group(s) to theby
clause prior to thecumsum
logic.Having a leading
NA
is a bit of a wrinkle, but I find a very readable (and vectorized) way of doing LOCF when the leading term is not missing is:na.omit(y)[cumsum(!is.na(y))]
A slightly less readable modification works in general:
c(NA, na.omit(y))[cumsum(!is.na(y))+1]
gives the desired output:
c(NA, 2, 2, 2, 2, 3, 3, 4, 4, 4)
This worked for me, although I'm not sure whether it is more efficient than other suggestions.
Dealing with a big data volume, in order to be more efficient, we can use the data.table package.
You probably want to use the
na.locf()
function from the zoo package to carry the last observation forward to replace your NA values.Here is the beginning of its usage example from the help page: