Select rows before a filtered row using dplyr

2019-07-19 05:16发布

I'm working on a study where we used a camera placed inside a nest box to determine when our study species laid its first egg. Some of the cameras weren't super reliable, and I'd like to see if there were continuous photos before the date where the first egg was laid. This way I can no for sure that this is the first egg date. There are >165,000 photos and >200 nests, so I grouped by nest box ID, filtered the rows down to those that have at least 1 egg, and then used the slice function to select the first row with data. Here's a reproducible example:

example <- structure(list(boxID = c("CA10", "CA10", "CA10", "CA10", "CA10", 
    "CA10", "CA10", "CA10", "CA10", "CA10", "CA10", "CA10", "CA10", 
    "CA10", "CA10"), visitType = c("Image", "Image", "Image", "Image", 
    "Image", "Image", "Image", "Image", "Image", "Image", "Image", 
    "Image", "Image", "Image", "Image"), day = c(25L, 25L, 25L, 26L, 
    26L, 26L, 27L, 27L, 27L, 28L, 28L, 28L, 29L, 29L, 29L), month = c("MAR", 
    "MAR", "MAR", "MAR", "MAR", "MAR", "MAR", "MAR", "MAR", "MAR", 
    "MAR", "MAR", "MAR", "MAR", "MAR"), year = c(2018, 2018, 2018, 
    2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 
    2018), timeChecked = c("02:59", "09:06", "15:13", "02:59", "09:07", 
    "15:14", "02:59", "09:07", "15:13", "02:58", "09:06", "15:12", 
    "02:58", "09:06", "15:12"), species = c("Empty", "Empty", "Empty", 
    "Empty", "Empty", "Empty", "Empty", "Empty", "American Kestrel", 
    "Empty", "American Kestrel", "American Kestrel", "American Kestrel", 
    "American Kestrel", "American Kestrel"), sexAdult = c(NA, NA, 
    NA, NA, NA, NA, NA, NA, "Female", NA, "Female", "Female", "Female", 
    NA, NA), numEggs = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    "1", "1", "1", "1", "1"), numNestlings = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), date = structure(c(17615, 
    17615, 17615, 17616, 17616, 17616, 17617, 17617, 17617, 17618, 
    17618, 17618, 17619, 17619, 17619), class = "Date")), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -15L), .Names = c("boxID", 
    "visitType", "day", "month", "year", "timeChecked", "species", 
    "sexAdult", "numEggs", "numNestlings", "date"))

Here's the code I have to find the first row that has at least 1 egg:

example %>%
      mutate_at(vars(numEggs, numNestlings), na_if, 'unknown') %>% # remove unknowns and other values that should be NA
      select(boxID, date, numEggs, visitType) %>%
      group_by(boxID) %>%
      filter(numEggs > 0) %>%
      slice(1) 

I'd like to look at the 5 or 10 rows before this first row with an egg to make sure there was continuous data up to this point in time. Is there a way to do this row indexing with slice or some other dplyr function?

2条回答
你好瞎i
2楼-- · 2019-07-19 05:47

Here a way based on slicing according to the position of the first non missing value of numEggs. You can modify the 5 value inside the last row according to the rows you want retain before the first non-NA numEggs

example %>%
      mutate_at(vars(numEggs, numNestlings), na_if, 'unknown') %>%
      select(boxID, date, numEggs, visitType) %>%
      group_by(boxID) %>%
      slice((min(which(!is.na(numEggs)))-5):min(which(!is.na(numEggs))))

    # A tibble: 6 x 4
# Groups:   boxID [1]
  boxID date       numEggs visitType
  <chr> <date>     <chr>   <chr>    
1 CA10  2018-03-26 NA      Image    
2 CA10  2018-03-27 NA      Image    
3 CA10  2018-03-27 NA      Image    
4 CA10  2018-03-27 NA      Image    
5 CA10  2018-03-28 NA      Image    
6 CA10  2018-03-28 1       Image 
查看更多
Viruses.
3楼-- · 2019-07-19 05:54

Here's one way. match returns the position of first numEggs > 0 and then we simply get additional n_previous rows from that position. We use max(1, ...) so that we don't get error if position of first numEggs > 0 < n_previous .

n_previous <- 5

example %>%
  mutate_at(vars(numEggs, numNestlings), na_if, 'unknown') %>% 
  select(boxID, date, numEggs, visitType) %>%
  group_by(boxID) %>%
  slice(max(1, match(TRUE, numEggs > 0) - n_previous):match(TRUE, numEggs > 0))

# A tibble: 6 x 4
# Groups:   boxID [1]
  boxID date       numEggs visitType
  <chr> <date>     <chr>   <chr>    
1 CA10  2018-03-26 <NA>    Image    
2 CA10  2018-03-27 <NA>    Image    
3 CA10  2018-03-27 <NA>    Image    
4 CA10  2018-03-27 <NA>    Image    
5 CA10  2018-03-28 <NA>    Image    
6 CA10  2018-03-28 1       Image  
查看更多
登录 后发表回答