Range overlap/intersect by group and between years

2019-07-02 21:05发布

问题:

I have a list of marked individuals (column Mark) which have been captured various years (column Year) within a range of the river (LocStart and LocEnd). Location on the river is in meters.

I would like to know if a marked individual has used overlapping range between years i.e. if the individual has gone to the same segment of the river from year to year.

Here is an example of the original data set:

IDMark YearLocStartLocEnd
11081199221,72922,229
21081199221,20321,703
31081200521,50822,008
41126199419,22219,522
51126199418,81119,311
61283200521,75422,254
71283200722,02522,525

Here is what I would like the final answer to look like:

MarkYear1Year2IDs
1081199220051, 3
1081199220052, 3
1283200520076, 7

In this case, individual 1126 would not be in the final output as the only two ranges available were the same year. I realize it would be easy to remove all the records where Year1 = Year2.

I would like to do this in R and have looked into the >IRanges package but have not been able to consider the group = Mark and been able to extract the Year1 and Year2 information.

回答1:

Using foverlaps() function from data.table package:

require(data.table)
setkey(setDT(dt), Mark, LocStart, LocEnd)               ## (1)
olaps = foverlaps(dt, dt, type="any", which=TRUE)       ## (2)
olaps = olaps[dt$Year[xid] != dt$Year[yid]]             ## (3)
olaps[, `:=`(Mark  = dt$Mark[xid], 
             Year1 = dt$Year[xid],
             Year2 = dt$Year[yid],
             xid   = dt$ID[xid], 
             yid   = dt$ID[yid])]                       ## (4)
olaps = olaps[xid < yid]                                ## (5)
#    xid yid Mark Year1 Year2
# 1:   2   3 1081  1992  2005
# 2:   1   3 1081  1992  2005
# 3:   6   7 1283  2005  2007
  1. We first convert the data.frame to data.table by reference using setDT. Then, we key the data.table on columns Mark, LocStart and LocEnd, which will allow us to perform overlapping range joins.

  2. We calculate self overlaps (dt with itself) with any type of overlap. But we return matching indices here using which = TRUE.

  3. Remove all indices where Year corresponding to xid and yid are identical.

  4. Add all the other columns and replace xid and yid with corresponding ID values, by reference.

  5. Remove all indices where xid >= yid. If row 1 overlaps with row 3, then row 3 also overlaps with row 1. We don't need both. foverlaps() doesn't have a way to remove this by default yet.