edit
The question was originally asked for data.table
. A solution with any package would be interesting.
I am a little stuck with a particular variation of a more general problem. I have panel data that I am using with data.table and I would like to fill in some missing values using the group by functionality of data.table. Unfortunately they are not numeric, so I can't simply interpolate, but they should only be filled in based on a condition. Is it possible to perform a kind of conditional na.locf in data.tables?
Essentially I only want to fill in the NAs if after the NAs the next observation is the previous ones, though the more general question is how to conditionally fill in NAs.
For example, in the following data I would like to fill in the associatedid variable by each id group. So id==1
, year==2003
would fill in as ABC123
because its the value before and after the NA, but not 2000 for the same id. id== 2
would not be changed because the next value is not the same as the one prior to the NAs. id==3
would fill in for 2003 and 2004.
mydf <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L), associatedid = structure(c(NA, 1L, 1L, NA, 1L, 1L, NA, 1L, 1L, NA, 2L, 2L, NA, 1L, 1L, NA, NA, 1L), .Label = c("ABC123", "DEF456"), class = "factor")), class = "data.frame", row.names = c(NA, -18L))
mydf
#> id year associatedid
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 <NA>
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 <NA>
#> 17 3 2004 <NA>
#> 18 3 2005 ABC123
dt = data.table(mydf, key = c("id"))
desired output
#> id year associatedid
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 ABC123
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 ABC123
#> 17 3 2004 ABC123
#> 18 3 2005 ABC123
I've been trying to put together a two pass approach that on the first pass would change the NA's to paste "p_" in fron of the starting value (within an id) and then with the second pass check that the last of a sequence is in agreement with the next real value. I offer my code so far, which is not really an answer, so not expecting any upvotes. (Probably would have been easier to rename that
associatedid
asasid
.)Here's a pure tidyverse solution :
Or using
zoo::na.locf
:The same idea but using data.table :
And finally a fun idea using base, noting that you want to interpolate only if constant interpolation and linear interpolation are the same, if this character variable was numeric :
This is all about writing a modified na.locf function. After that you can plug it into data.table like any other function.
Once we have that function, we can use data.table as usual:
This returns:
which is what you are looking for as best I understand it.
I provided some hedging in the new.locf definition so you still might have a little thinking to do but this should get you started.
You could roll forwards and backwards for the missing rows, compare values and assign if they are equal:
If
na.locf0
applied forward and backwards are the same then usena.locf0
; otherwise, if they are not equal or if either is NA then use NA.giving:
Here is another attempt with
dplyr
: