Possible Duplicate:
Populate NAs in a vector using prior non-NA values?
Is there an idiomatic way to copy cell values "down" in an R vector? By "copying down", I mean replacing NAs with the closest previous non-NA value.
While I can do this very simply with a for loop, it runs very slowly. Any advice on how to vectorise this would be appreciated.
# Test code
# Set up test data
len <- 1000000
data <- rep(c(1, rep(NA, 9)), len %/% 10) * rep(1:(len %/% 10), each=10)
head(data, n=25)
tail(data, n=25)
# Time naive method
system.time({
data.clean <- data;
for (i in 2:length(data.clean)){
if(is.na(data.clean[i])) data.clean[i] <- data.clean[i-1]
}
})
# Print results
head(data.clean, n=25)
tail(data.clean, n=25)
Result of test run:
> # Set up test data
> len <- 1000000
> data <- rep(c(1, rep(NA, 9)), len %/% 10) * rep(1:(len %/% 10), each=10)
> head(data, n=25)
[1] 1 NA NA NA NA NA NA NA NA NA 2 NA NA NA NA NA NA NA NA NA 3 NA NA NA NA
> tail(data, n=25)
[1] NA NA NA NA NA 99999 NA NA NA NA
[11] NA NA NA NA NA 100000 NA NA NA NA
[21] NA NA NA NA NA
>
> # Time naive method
> system.time({
+ data.clean <- data;
+ for (i in 2:length(data.clean)){
+ if(is.na(data.clean[i])) data.clean[i] <- data.clean[i-1]
+ }
+ })
user system elapsed
3.09 0.00 3.09
>
> # Print results
> head(data.clean, n=25)
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
> tail(data.clean, n=25)
[1] 99998 99998 99998 99998 99998 99999 99999 99999 99999 99999
[11] 99999 99999 99999 99999 99999 100000 100000 100000 100000 100000
[21] 100000 100000 100000 100000 100000
>
Use
zoo::na.locf
Wrapping your code in function
f
(including returningdata.clean
at the end):I don't know about idiomatic, but here we identify the non-NA values (
idx
), and the index of the last non-NA value (cumsum(idx)
)which seems to be about 6 times faster than
na.locf
for the example data. It drops leading NA's likena.locf
does by default, sowhich seems to add on about 30% time when
na.rm=FALSE
. Presumablyna.locf
has other merits, capturing more of the corner cases and allowing filling up instead of down (which is an interesting exercise in thecumsum
world, anyway). It's also clear that we're making at least five allocations of possibly large data --idx
(actually, we calculateis.na()
and it's complement),cumsum(idx)
,x[idx]
, andx[idx][cumsum(idx)]
-- so there's room for further improvement, e.g., in C