R: Selecting first of n consecutive rows above a c

2019-01-19 06:41发布

问题:

I have a data frame with MRN, dates, and a test value.

I need to select all the first rows per MRN that have three consecutive values above 0.5.

This is an example version of the data:

   MRN Collected_Date   ANC
1  001     2015-01-02 0.345
2  001     2015-01-03 0.532
3  001     2015-01-04 0.843
4  001     2015-01-05 0.932
5  002     2015-03-03 0.012
6  002     2015-03-05 0.022
7  002     2015-03-06 0.543
8  002     2015-03-07 0.563
9  003     2015-08-02 0.343
10 003     2015-08-03 0.500
11 003     2015-08-04 0.734
12 003     2015-08-05 0.455
13 004     2014-01-02 0.001
14 004     2014-01-03 0.500
15 004     2014-01-04 0.562
16 004     2014-01-05 0.503

Example code:

df <- data.frame(MRN = c('001','001','001','001',
                         '002','002','002','002',
                         '003','003','003','003',
                         '004','004','004','004'), 
                 Collected_Date = as.Date(c('01-02-2015','01-03-2015','01-04-2015','01-05-2015',
                                            '03-03-2015','03-05-2015','03-06-2015','03-07-2015',
                                            '08-02-2015','08-03-2015','08-04-2015','08-05-2015',
                                            '01-02-2014','01-03-2014','01-04-2014','01-05-2014'), 
                                            format = '%m-%d-%Y'), 
                 ANC = as.numeric(c('0.345','0.532','0.843','0.932',
                         '0.012','0.022','0.543','0.563',
                         '0.343','0.500','0.734','0.455',
                         '0.001','0.500','0.562','0.503')))

Currently, I am using a very awkward approach using the lag function to calculate the date difference, then filter for all values >= 0.5, and then sum up the values, which helps to select the date of the THIRD value. I then substract two days to get the date of the first value:

   df %>% group_by(MRN) %>% 
    mutate(., days_diff = abs(Collected_Date[1] - Collected_Date)) %>% 
        filter(ANC >= 0.5) %>%
            mutate(days = days_diff + lag((days_diff))) %>%
                filter(days == 5) %>%
                    mutate(Collected_Date = Collected_Date - 2) %>%
                        select(MRN, Collected_Date)

Output:

Source: local data frame [2 x 2] Groups: MRN

  MRN Collected_Date
1 001     2015-01-03
2 004     2014-01-03

There must be a way simpler / more elegant way. Also, it does not give accurate results if there are gaps between the test dates.

My desired output for this example is:

   MRN Collected_Date   ANC     
1  001     2015-01-03 0.532
2  004     2014-01-03 0.500

So if at least three consecutive test values are >= 0.5, the date of the FIRST value should be returned.

If there are not at least three consecutive values >= 0.5, NA should be returned.

Any help is greatly appreciated!

Thank you very much!

回答1:

The easiest way is to use the zoo library in conjunction with dplyr. Within the zoo package there is a function called rollapply, we can use this to calculate a function value for a window of time.

In this example, we could apply the window to calculate the minimum of the next three values, and then apply the logic specified.

df %>% group_by(MRN) %>%
  mutate(ANC=rollapply(ANC, width=3, min, align="left", fill=NA, na.rm=TRUE)) %>%
  filter(ANC >= 0.5) %>%  
  filter(row_number() == 1)

#   MRN Collected_Date   ANC
# 1 001     2015-01-03 0.532
# 2 004     2014-01-03 0.500

In the code above we have used rollapply to calculate the minimum of the next 3 items. To see how this works compare the following:

rollapply(1:6, width=3, min, align="left", fill=NA) # [1]  1  2  3  4 NA NA
rollapply(1:6, width=3, min, align="center", fill=NA) # [1] NA  1  2  3  4 NA
rollapply(1:6, width=3, min, align="right", fill=NA) # [1] NA NA  1  2  3  4

So in our example, we have aligned from the left, so it starts from the current location and looks forward to the next 2 values.

Lastly we filter by the appropriate values, and take the first observation of each group.



回答2:

Base approach:

Use rle to find sequences of 3 or more and grab the first one

df <- data.frame(MRN = c('001','001','001','001','002','002','002','002','003','003','003','003','004','004','004','004'), Collected_Date = as.Date(c('01-02-2015','01-03-2015','01-04-2015','01-05-2015', '03-03-2015','03-05-2015','03-06-2015','03-07-2015', '08-02-2015','08-03-2015','08-04-2015','08-05-2015', '01-02-2014','01-03-2014','01-04-2014','01-05-2014'), format = '%m-%d-%Y'), ANC = as.numeric(c('0.345','0.532','0.843','0.932', '0.012','0.022','0.543','0.563', '0.343','0.500','0.734','0.455', '0.001','0.500','0.562','0.503')))

df[as.logical(with(df, ave(ANC, MRN, FUN = function(x)
   cumsum(x >= .5 & with(rle(x >= .5), rep(lengths, lengths)) >= 3) == 1))), ]

#    MRN Collected_Date   ANC 
# 2  001     2015-01-03 0.532
# 14 004     2014-01-03 0.500

Maybe this version is easier to understand

df[as.logical(with(df, ave(ANC, MRN, FUN = function(x) {
     r <- rle(x >= .5)
     r <- rep(r$lengths, r$lengths)
     cumsum(r == 3 & x >= .5) == 1
    }))), ]

edit

df <- df[c(1:4,4,4,4,5,5,5,5:16), ]
df[as.logical(with(df, ave(ANC, MRN, FUN = function(x)
  cumsum(x >= .5 & with(rle(x >= .5), rep(lengths, lengths)) >= 3) == 1))), ]

#    MRN Collected_Date   ANC
# 2  001     2015-01-03 0.532
# 14 004     2014-01-03 0.500


回答3:

We can create an auxiliary function which given a vector x returns a vector indicating the number of consecutive values above a given threshold:

high_run <- function(x, threshold) {
    high <- x >= threshold
    streak <- high[1]
    for(h in high[2:length(high)]){
        streak <- c(streak, streak[length(streak)]*h + h)
    }
    run
}

as well as a function which returns the starting index of the first run of a particular length:

high_run_start <- function(x, threshold, run){
    match(run, high_run(x, threshold)) - run + 1
}

We can then use this latter function to select the appropriate rows of the original dataframe:

> df %>% group_by(MRN) %>%
+ filter(row_number()==high_run_start(ANC,0.5,3))
Source: local data frame [2 x 3]
Groups: MRN

  MRN Collected_Date   ANC
1 001     2015-01-03 0.532
2 004     2014-01-03 0.500


回答4:

Here's a ddply solution (sorry, I'm not up-to-date with the %>% syntax, but perhaps it could also be applied).

I'm unsure if it's "elegant" in the sense that you mean, but it will make sense upon reading it a second time (which to me is more important than a one-liner), and is robust to missing dates etc.

The key is to use rle (run length encoding) to look for 'runs' of ANC >= 0.5 where the run is at least length 3. This takes care of the 'consecutive' part. we save this into r.

Then r.i gives the index in the first run that is of length 3 or more, and where the value of the run is TRUE.

To get the index in x you just sum the run lengths up to but not including the run we are interested in, and add 1 to get to the start (that's the sum(r$lengths[1:(r.i - 1)]) and the +1).

ddply(df,
.(MRN),
function (x) {
    r <- rle(x$ANC >= 0.5) # find 'runs' of x$ANC >= 0.5
    # find index of first run of length >=3 with ANC >= .5
    r.i <- which(r$lengths >= 3 & r$values)[1] 
    if (!is.na(r.i)) {
        # get index of first row in that run and return it.
        return(x[sum(r$lengths[seq_len(r.i - 1)]) + 1, ])
    }
    return(NULL)
})

It will make better sense if you extract e.g. x <- subset(df, MRN == '001') and step through to see what r, r.i look like.