Conditional NA filling by group

2020-08-24 16:59发布

问题:

edit
The question was originally asked for data.table. A solution with any package would be interesting.


I am a little stuck with a particular variation of a more general problem. I have panel data that I am using with data.table and I would like to fill in some missing values using the group by functionality of data.table. Unfortunately they are not numeric, so I can't simply interpolate, but they should only be filled in based on a condition. Is it possible to perform a kind of conditional na.locf in data.tables?

Essentially I only want to fill in the NAs if after the NAs the next observation is the previous ones, though the more general question is how to conditionally fill in NAs.

For example, in the following data I would like to fill in the associatedid variable by each id group. So id==1 , year==2003 would fill in as ABC123 because its the value before and after the NA, but not 2000 for the same id. id== 2 would not be changed because the next value is not the same as the one prior to the NAs. id==3 would fill in for 2003 and 2004.

mydf <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L), associatedid = structure(c(NA, 1L, 1L, NA, 1L, 1L, NA, 1L, 1L, NA, 2L, 2L, NA, 1L, 1L, NA, NA, 1L), .Label = c("ABC123", "DEF456"), class = "factor")), class = "data.frame", row.names = c(NA, -18L))

mydf
#>    id year associatedid
#> 1   1 2000         <NA>
#> 2   1 2001       ABC123
#> 3   1 2002       ABC123
#> 4   1 2003         <NA>
#> 5   1 2004       ABC123
#> 6   1 2005       ABC123
#> 7   2 2000         <NA>
#> 8   2 2001       ABC123
#> 9   2 2002       ABC123
#> 10  2 2003         <NA>
#> 11  2 2004       DEF456
#> 12  2 2005       DEF456
#> 13  3 2000         <NA>
#> 14  3 2001       ABC123
#> 15  3 2002       ABC123
#> 16  3 2003         <NA>
#> 17  3 2004         <NA>
#> 18  3 2005       ABC123

dt = data.table(mydf, key = c("id"))

desired output

#>    id year associatedid
#> 1   1 2000         <NA>
#> 2   1 2001       ABC123
#> 3   1 2002       ABC123
#> 4   1 2003       ABC123
#> 5   1 2004       ABC123
#> 6   1 2005       ABC123
#> 7   2 2000         <NA>
#> 8   2 2001       ABC123
#> 9   2 2002       ABC123
#> 10  2 2003         <NA>
#> 11  2 2004       DEF456
#> 12  2 2005       DEF456
#> 13  3 2000         <NA>
#> 14  3 2001       ABC123
#> 15  3 2002       ABC123
#> 16  3 2003       ABC123
#> 17  3 2004       ABC123
#> 18  3 2005       ABC123

回答1:

This is all about writing a modified na.locf function. After that you can plug it into data.table like any other function.

new.locf <- function(x){
  # might want to think about the end of this loop
  # this works here but you might need to add another case
  # if there are NA's as the last value.
  #
  # anyway, loop through observations in a vector, x.
  for(i in 2:(length(x)-1)){
    nextval = i
    # find the next, non-NA value
    # again, not tested but might break if there isn't one?
    while(nextval <= length(x)-1 & is.na(x[nextval])){
      nextval = nextval + 1
    }
    # if the current value is not NA, great!
    if(!is.na(x[i])){
      x[i] <- x[i]
    }else{
      # if the current value is NA, and the last value is a value
      # (should given the nature of this loop), and
      # the next value, as calculated above, is the same as the last
      # value, then give us that value. 
      if(is.na(x[i]) & !is.na(x[i-1]) & x[i-1] == x[nextval]){
        x[i] <- x[nextval]
      }else{
        # finally, return NA if neither of these conditions hold
        x[i] <- NA
      }
    }
  }
  # return the new vector
  return(x) 
}

Once we have that function, we can use data.table as usual:

dt2 <- dt[,list(year = year,
                # when I read your data in, associatedid read as factor
                associatedid = new.locf(as.character(associatedid))
                ),
          by = "id"
          ]

This returns:

> dt2
    id year associatedid
 1:  1 2000           NA
 2:  1 2001       ABC123
 3:  1 2002       ABC123
 4:  1 2003       ABC123
 5:  1 2004       ABC123
 6:  1 2005       ABC123
 7:  2 2000           NA
 8:  2 2001       ABC123
 9:  2 2002       ABC123
10:  2 2003           NA
11:  2 2004       DEF456
12:  2 2005       DEF456
13:  3 2000           NA
14:  3 2001       ABC123
15:  3 2002       ABC123
16:  3 2003       ABC123
17:  3 2004       ABC123
18:  3 2005       ABC123

which is what you are looking for as best I understand it.

I provided some hedging in the new.locf definition so you still might have a little thinking to do but this should get you started.



回答2:

If na.locf0 applied forward and backwards are the same then use na.locf0; otherwise, if they are not equal or if either is NA then use NA.

library(data.table)
library(zoo)

dt[, associatedid := 
    ifelse(na.locf0(associatedid) == na.locf0(associatedid, fromLast=TRUE), 
      na.locf0(associatedid), NA), by = id]

giving:

> dt
    id year associatedid
 1:  1 2000         <NA>
 2:  1 2001       ABC123
 3:  1 2002       ABC123
 4:  1 2003       ABC123
 5:  1 2004       ABC123
 6:  1 2005       ABC123
 7:  2 2000         <NA>
 8:  2 2001       ABC123
 9:  2 2002       ABC123
10:  2 2003         <NA>
11:  2 2004       DEF456
12:  2 2005       DEF456
13:  3 2000         <NA>
14:  3 2001       ABC123
15:  3 2002       ABC123
16:  3 2003       ABC123
17:  3 2004       ABC123
18:  3 2005       ABC123


回答3:

Here's a pure tidyverse solution :

library(tidyverse)
mydf %>%
  mutate(up = associatedid, down = associatedid) %>%
  group_by(id) %>%
  fill(up,.direction = "up") %>%
  fill(down) %>%
  mutate_at("associatedid", ~if_else(is.na(.) & up == down, up, .)) %>%
  ungroup() %>%
  select(-up, - down)
#> # A tibble: 18 x 3
#>       id  year associatedid
#>    <int> <int> <fct>       
#>  1     1  2000 <NA>        
#>  2     1  2001 ABC123      
#>  3     1  2002 ABC123      
#>  4     1  2003 ABC123      
#>  5     1  2004 ABC123      
#>  6     1  2005 ABC123      
#>  7     2  2000 <NA>        
#>  8     2  2001 ABC123      
#>  9     2  2002 ABC123      
#> 10     2  2003 <NA>        
#> 11     2  2004 DEF456      
#> 12     2  2005 DEF456      
#> 13     3  2000 <NA>        
#> 14     3  2001 ABC123      
#> 15     3  2002 ABC123      
#> 16     3  2003 ABC123      
#> 17     3  2004 ABC123      
#> 18     3  2005 ABC123

Or using zoo::na.locf :

library(dplyr)
library(zoo)
mydf %>%
  group_by(id) %>%
  mutate_at("associatedid", ~if_else(
    is.na(.) & na.locf(.,F) == na.locf(.,F,fromLast = TRUE), na.locf(.,F), .)) %>%
  ungroup()
#> # A tibble: 18 x 3
#>       id  year associatedid
#>    <int> <int> <fct>       
#>  1     1  2000 <NA>        
#>  2     1  2001 ABC123      
#>  3     1  2002 ABC123      
#>  4     1  2003 ABC123      
#>  5     1  2004 ABC123      
#>  6     1  2005 ABC123      
#>  7     2  2000 <NA>        
#>  8     2  2001 ABC123      
#>  9     2  2002 ABC123      
#> 10     2  2003 <NA>        
#> 11     2  2004 DEF456      
#> 12     2  2005 DEF456      
#> 13     3  2000 <NA>        
#> 14     3  2001 ABC123      
#> 15     3  2002 ABC123      
#> 16     3  2003 ABC123      
#> 17     3  2004 ABC123      
#> 18     3  2005 ABC123

The same idea but using data.table :

library(zoo)
library(data.table)
setDT(mydf)
mydf[,associatedid := fifelse(
  is.na(associatedid) & na.locf(associatedid,F) == na.locf(associatedid,F,fromLast = TRUE), 
  na.locf(associatedid,F), associatedid),
  by = id]
mydf
#>     id year associatedid
#>  1:  1 2000         <NA>
#>  2:  1 2001       ABC123
#>  3:  1 2002       ABC123
#>  4:  1 2003       ABC123
#>  5:  1 2004       ABC123
#>  6:  1 2005       ABC123
#>  7:  2 2000         <NA>
#>  8:  2 2001       ABC123
#>  9:  2 2002       ABC123
#> 10:  2 2003         <NA>
#> 11:  2 2004       DEF456
#> 12:  2 2005       DEF456
#> 13:  3 2000         <NA>
#> 14:  3 2001       ABC123
#> 15:  3 2002       ABC123
#> 16:  3 2003       ABC123
#> 17:  3 2004       ABC123
#> 18:  3 2005       ABC123

And finally a fun idea using base, noting that you want to interpolate only if constant interpolation and linear interpolation are the same, if this character variable was numeric :

i <- ave( as.numeric(factor(mydf$associatedid)), mydf$id,FUN = function(x) ifelse(
  approx(x,xout = seq_along(x))$y == (z<- approx(x,xout = seq_along(x),method = "constant")$y),
  z, x))
mydf$associatedid <- levels(mydf$associatedid)[i]
mydf
#>    id year associatedid
#> 1   1 2000         <NA>
#> 2   1 2001       ABC123
#> 3   1 2002       ABC123
#> 4   1 2003       ABC123
#> 5   1 2004       ABC123
#> 6   1 2005       ABC123
#> 7   2 2000         <NA>
#> 8   2 2001       ABC123
#> 9   2 2002       ABC123
#> 10  2 2003         <NA>
#> 11  2 2004       DEF456
#> 12  2 2005       DEF456
#> 13  3 2000         <NA>
#> 14  3 2001       ABC123
#> 15  3 2002       ABC123
#> 16  3 2003       ABC123
#> 17  3 2004       ABC123
#> 18  3 2005       ABC123


回答4:

You could roll forwards and backwards for the missing rows, compare values and assign if they are equal:

library(data.table)
DT = data.table(mydf)

w  = DT[is.na(associatedid), which=TRUE]
dn = DT[w, DT[-w][.SD, on=.(id, year), roll=TRUE, x.associatedid]]
up = DT[w, DT[-w][.SD, on=.(id, year), roll=-Inf, x.associatedid]]
ww = na.omit(w[up == dn])
DT[ww, associatedid := dn[ww]]

    id year associatedid
 1:  1 2000         <NA>
 2:  1 2001       ABC123
 3:  1 2002       ABC123
 4:  1 2003       ABC123
 5:  1 2004       ABC123
 6:  1 2005       ABC123
 7:  2 2000         <NA>
 8:  2 2001       ABC123
 9:  2 2002       ABC123
10:  2 2003         <NA>
11:  2 2004       DEF456
12:  2 2005       DEF456
13:  3 2000         <NA>
14:  3 2001       ABC123
15:  3 2002       ABC123
16:  3 2003         <NA>
17:  3 2004         <NA>
18:  3 2005       ABC123


回答5:

Here is another attempt with dplyr :

library(dplyr)

mydf %>%
  #Detect NA values in associatedid
  mutate(isReplaced = is.na(associatedid), ans = associatedid) %>%
  group_by(id) %>%
  #Fill all NA values
  tidyr::fill(associatedid) %>%
  #Detect the NA values which were replaced
  mutate(isReplaced = isReplaced & !is.na(associatedid)) %>%
  #Group by id and associatedid 
  group_by(associatedid, add = TRUE) %>%
  #Add NA values if it was isReplaced and is first or last row of the group
  mutate(ans = replace(associatedid,row_number() %in% c(1, n()) & isReplaced, NA)) %>%
  ungroup() %>%
  select(-isReplaced, -associatedid)


# A tibble: 18 x 3
#      id  year ans   
#   <int> <int> <fct> 
# 1     1  2000 NA    
# 2     1  2001 ABC123
# 3     1  2002 ABC123
# 4     1  2003 ABC123
# 5     1  2004 ABC123
# 6     1  2005 ABC123
# 7     2  2000 NA    
# 8     2  2001 ABC123
# 9     2  2002 ABC123
#10     2  2003 NA    
#11     2  2004 DEF456
#12     2  2005 DEF456
#13     3  2000 NA    
#14     3  2001 ABC123
#15     3  2002 ABC123
#16     3  2003 ABC123
#17     3  2004 ABC123
#18     3  2005 ABC123


回答6:

I've been trying to put together a two pass approach that on the first pass would change the NA's to paste "p_" in fron of the starting value (within an id) and then with the second pass check that the last of a sequence is in agreement with the next real value. I offer my code so far, which is not really an answer, so not expecting any upvotes. (Probably would have been easier to rename that associatedid as asid.)

lapply( split(df, df$id), 
    function(d){ d$associatedid <- as.character(d$associatedid)
    missloc <- with( d, tapply(is.na(associatedid), id,  which))
    for (n in missloc) if( 
           d$associatedid[n+1] %in% c(d$associatedid[n-1],
                                   paste0("p_" , d$associatedid[n-1])&
    grepl( gsub("p\\_", "",  d$associatedid[n-1]), d$associatedid[n+1] )
                        { d$associatedid[n] <- d$associatedid[n-1]
                     } else{
               #tentative NA replacement
         d$associatedid[n] <- paste0("p_" , d$associatedid[n-1])}
 })