Each observations in my dataframe contains a different "before date" and "after date instance". The problem is some dates overlap for each ID. For instance, in the table below, ID's 1 and 4 contain overlapping date values.
ID before date after date
1 10/1/1996 12/1/1996
1 1/1/1998 9/30/2003
1 1/1/2000 12/31/2004
2 1/1/2001 3/31/2006
3 1/1/2001 9/30/2006
4 1/1/2001 9/30/2005
4 10/1/2004 12/30/2004
4 10/3/2004 11/28/2004
I am trying to get something like this:
ID before date after date
1 10/1/1996 12/1/1996
1 1/1/1998 12/31/2004
2 1/1/2001 3/31/2006
3 1/1/2001 9/30/2006
4 1/1/2001 9/30/2005
Basically, I would like to replace any overlapping date values with the date range of the values with the overlap, leave the non-overlapping values alone, and delete any unnecessary rows. Not sure how to go about doing this
Firstly, you should convert your string dates into
Date
-classed values, which will make comparison possible. Here's how I've defined and coerced your data:Now, my solution involves computing an "overlapping grouping" vector which I've called
og
. It makes the assumption that the inputdf
is ordered byID
and thenbefore.date
, which it is in your example data. If not, this could be achieved bydf[order(df$ID,df$before.date),]
. Here's how I computeog
:Unfortunately, the base R
cummax()
function doesn't work onDate
-classed objects, so I had to write acummax.Date()
shim. I'll explain the need for theave()
andcummax()
business at the end of the post.As you can see, the above computation lags the RHS of each of the two vectorized comparisons by excluding the first element via
[-1]
. This allows us to compare a record'sID
for equality with the following record'sID
, and also compare if itsafter.date
is after thebefore.date
of the following record. The resulting logical vectors are ANDed (&
) together. The negation of that logical vector then represents adjacent pairs of records that do not overlap, and thus we cancumsum()
the result (and prepend zero, as the first record must start with zero) to get our grouping vector.Finally, for the final piece of the solution, I've used
by()
to work with each overlapping group independently:Since all records in a group must have the same
ID
, and we've made the assumption that records are ordered bybefore.date
(after being ordered byID
, which is no longer relevant), we can get the correctID
andbefore.date
values from the first record in the group. That's why I started withg[1,]
. Then we just need to get the greatestafter.date
from the group viamax(g$after.date)
, and overwrite the first record'safter.date
with that, which I've done withtransform()
.A word about performance: The assumption about ordering aids performance, because it allows us to simply compare each record against the immediately following record via lagged vectorized comparisons, rather than comparing every record in a group with every other record.
Now, for the
ave()
andcummax()
business. I realized after writing the initial version of my answer that there was a flaw in my solution, which happens to not be exposed by your example data. Say there are three records in a group. If the first record has a range that overlaps with both of the following two records, and then the middle record does not overlap with the third record, then my (original) code would fail to identify that the third record is part of the same overlapping group of the previous two records.The solution is to not simply use the
after.date
of the current record when comparing against the following record, but instead use the cumulative maximumafter.date
within the group. If any earlier record sprawled completely beyond its immediately following record, then it obviously overlapped with that record, and itsafter.date
is what's important in considering overlapping groups for subsequent records.Here's a demonstration of input data that requires this fix, using your
df
as a base:Now record 6 overlaps with both records 7 and 8, but record 7 does not overlap with record 8. The solution still works:
Here's a proof that the
og
calculation would be wrong without theave()
/cummax()
fix:Minor adjustment to the solution, to overwrite
after.date
in advance of theog
computation, and avoid themax()
call (makes more sense if you're planning on overwriting the originaldf
with the new aggregation):