可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a data frame with MRN, dates, and a test value.
I need to select all the first rows per MRN that have three consecutive values above 0.5.
This is an example version of the data:
MRN Collected_Date ANC
1 001 2015-01-02 0.345
2 001 2015-01-03 0.532
3 001 2015-01-04 0.843
4 001 2015-01-05 0.932
5 002 2015-03-03 0.012
6 002 2015-03-05 0.022
7 002 2015-03-06 0.543
8 002 2015-03-07 0.563
9 003 2015-08-02 0.343
10 003 2015-08-03 0.500
11 003 2015-08-04 0.734
12 003 2015-08-05 0.455
13 004 2014-01-02 0.001
14 004 2014-01-03 0.500
15 004 2014-01-04 0.562
16 004 2014-01-05 0.503
Example code:
df <- data.frame(MRN = c('001','001','001','001',
'002','002','002','002',
'003','003','003','003',
'004','004','004','004'),
Collected_Date = as.Date(c('01-02-2015','01-03-2015','01-04-2015','01-05-2015',
'03-03-2015','03-05-2015','03-06-2015','03-07-2015',
'08-02-2015','08-03-2015','08-04-2015','08-05-2015',
'01-02-2014','01-03-2014','01-04-2014','01-05-2014'),
format = '%m-%d-%Y'),
ANC = as.numeric(c('0.345','0.532','0.843','0.932',
'0.012','0.022','0.543','0.563',
'0.343','0.500','0.734','0.455',
'0.001','0.500','0.562','0.503')))
Currently, I am using a very awkward approach using the lag function to calculate the date difference, then filter for all values >= 0.5, and then sum up the values, which helps to select the date of the THIRD value. I then substract two days to get the date of the first value:
df %>% group_by(MRN) %>%
mutate(., days_diff = abs(Collected_Date[1] - Collected_Date)) %>%
filter(ANC >= 0.5) %>%
mutate(days = days_diff + lag((days_diff))) %>%
filter(days == 5) %>%
mutate(Collected_Date = Collected_Date - 2) %>%
select(MRN, Collected_Date)
Output:
Source: local data frame [2 x 2]
Groups: MRN
MRN Collected_Date
1 001 2015-01-03
2 004 2014-01-03
There must be a way simpler / more elegant way. Also, it does not give accurate results if there are gaps between the test dates.
My desired output for this example is:
MRN Collected_Date ANC
1 001 2015-01-03 0.532
2 004 2014-01-03 0.500
So if at least three consecutive test values are >= 0.5, the date of the FIRST value should be returned.
If there are not at least three consecutive values >= 0.5, NA should be returned.
Any help is greatly appreciated!
Thank you very much!
回答1:
The easiest way is to use the zoo
library in conjunction with dplyr
. Within the zoo
package there is a function called rollapply
, we can use this to calculate a function value for a window of time.
In this example, we could apply the window to calculate the minimum of the next three values, and then apply the logic specified.
df %>% group_by(MRN) %>%
mutate(ANC=rollapply(ANC, width=3, min, align="left", fill=NA, na.rm=TRUE)) %>%
filter(ANC >= 0.5) %>%
filter(row_number() == 1)
# MRN Collected_Date ANC
# 1 001 2015-01-03 0.532
# 2 004 2014-01-03 0.500
In the code above we have used rollapply
to calculate the minimum of the next 3 items. To see how this works compare the following:
rollapply(1:6, width=3, min, align="left", fill=NA) # [1] 1 2 3 4 NA NA
rollapply(1:6, width=3, min, align="center", fill=NA) # [1] NA 1 2 3 4 NA
rollapply(1:6, width=3, min, align="right", fill=NA) # [1] NA NA 1 2 3 4
So in our example, we have aligned from the left, so it starts from the current location and looks forward to the next 2 values.
Lastly we filter by the appropriate values, and take the first observation of each group.
回答2:
Base approach:
Use rle
to find sequences of 3 or more and grab the first one
df <- data.frame(MRN = c('001','001','001','001','002','002','002','002','003','003','003','003','004','004','004','004'), Collected_Date = as.Date(c('01-02-2015','01-03-2015','01-04-2015','01-05-2015', '03-03-2015','03-05-2015','03-06-2015','03-07-2015', '08-02-2015','08-03-2015','08-04-2015','08-05-2015', '01-02-2014','01-03-2014','01-04-2014','01-05-2014'), format = '%m-%d-%Y'), ANC = as.numeric(c('0.345','0.532','0.843','0.932', '0.012','0.022','0.543','0.563', '0.343','0.500','0.734','0.455', '0.001','0.500','0.562','0.503')))
df[as.logical(with(df, ave(ANC, MRN, FUN = function(x)
cumsum(x >= .5 & with(rle(x >= .5), rep(lengths, lengths)) >= 3) == 1))), ]
# MRN Collected_Date ANC
# 2 001 2015-01-03 0.532
# 14 004 2014-01-03 0.500
Maybe this version is easier to understand
df[as.logical(with(df, ave(ANC, MRN, FUN = function(x) {
r <- rle(x >= .5)
r <- rep(r$lengths, r$lengths)
cumsum(r == 3 & x >= .5) == 1
}))), ]
edit
df <- df[c(1:4,4,4,4,5,5,5,5:16), ]
df[as.logical(with(df, ave(ANC, MRN, FUN = function(x)
cumsum(x >= .5 & with(rle(x >= .5), rep(lengths, lengths)) >= 3) == 1))), ]
# MRN Collected_Date ANC
# 2 001 2015-01-03 0.532
# 14 004 2014-01-03 0.500
回答3:
We can create an auxiliary function which given a vector x
returns a vector indicating the number of consecutive values above a given threshold:
high_run <- function(x, threshold) {
high <- x >= threshold
streak <- high[1]
for(h in high[2:length(high)]){
streak <- c(streak, streak[length(streak)]*h + h)
}
run
}
as well as a function which returns the starting index of the first run of a particular length:
high_run_start <- function(x, threshold, run){
match(run, high_run(x, threshold)) - run + 1
}
We can then use this latter function to select the appropriate rows of the original dataframe:
> df %>% group_by(MRN) %>%
+ filter(row_number()==high_run_start(ANC,0.5,3))
Source: local data frame [2 x 3]
Groups: MRN
MRN Collected_Date ANC
1 001 2015-01-03 0.532
2 004 2014-01-03 0.500
回答4:
Here's a ddply
solution (sorry, I'm not up-to-date with the %>%
syntax, but perhaps it could also be applied).
I'm unsure if it's "elegant" in the sense that you mean, but it will make sense upon reading it a second time (which to me is more important than a one-liner), and is robust to missing dates etc.
The key is to use rle
(run length encoding) to look for 'runs' of ANC >= 0.5
where the run is at least length 3. This takes care of the 'consecutive' part. we save this into r
.
Then r.i
gives the index in the first run that is of length 3 or more, and where the value of the run is TRUE
.
To get the index in x
you just sum
the run lengths up to but not including the run we are interested in, and add 1 to get to the start (that's the sum(r$lengths[1:(r.i - 1)])
and the +1
).
ddply(df,
.(MRN),
function (x) {
r <- rle(x$ANC >= 0.5) # find 'runs' of x$ANC >= 0.5
# find index of first run of length >=3 with ANC >= .5
r.i <- which(r$lengths >= 3 & r$values)[1]
if (!is.na(r.i)) {
# get index of first row in that run and return it.
return(x[sum(r$lengths[seq_len(r.i - 1)]) + 1, ])
}
return(NULL)
})
It will make better sense if you extract e.g. x <- subset(df, MRN == '001')
and step through to see what r
, r.i
look like.