Find which interval row in a data frame that each

2019-01-09 04:30发布

站内文章 / 移动开发

23 0

该账号已被封号

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a vector of numeric elements, and a dataframe with two columns that define the start and end points of intervals. Each row in the dataframe is one interval. I want to find out which interval each element in the vector belongs to.

Here's some example data:

# Find which interval that each element of the vector belongs in

    library(tidyverse)
    elements <- c(0.1, 0.2, 0.5, 0.9, 1.1, 1.9, 2.1)

    intervals <-  frame_data(~phase, ~start, ~end,
                               "a",     0,     0.5,
                               "b",     1,     1.9,
                               "c",     2,     2.5)

The same example data for those who object to the tidyverse:

elements <- c(0.1, 0.2, 0.5, 0.9, 1.1, 1.9, 2.1)

intervals <- structure(list(phase = c("a", "b", "c"), 
                            start = c(0, 1, 2), 
                            end = c(0.5, 1.9, 2.5)), 
                       .Names = c("phase", "start", "end"), 
                       row.names = c(NA, -3L), 
                       class = "data.frame")

Here's one way to do it:

    library(intrval) 
    phases_for_elements <- 
    map(elements, ~.x %[]% data.frame(intervals[, c('start', 'end')])) %>% 
      map(., ~unlist(intervals[.x, 'phase']))

Here's the output:

    [[1]]
    phase 
      "a" 

    [[2]]
    phase 
      "a" 

    [[3]]
    phase 
      "a" 

    [[4]]
    character(0)

    [[5]]
    phase 
      "b" 

    [[6]]
    phase 
      "b" 

    [[7]]
    phase 
      "c"

But I'm looking for a simpler method with less typing. I've seen findInterval in related questions, but I'm not sure how I can use it in this situation.

回答1:

Here's a possible solution using the new "non-equi" joins in data.table (v>=1.9.8). While I doubt you'll like the syntax, it should be very efficient soluion.

Also, regarding findInterval, this function assumes continuity in your intervals, while this isn't the case here, so I doubt there is a straightforward solution using it.

library(data.table) #v1.10.0
setDT(intervals)[data.table(elements), on = .(start <= elements, end >= elements)]
#    phase start end
# 1:     a   0.1 0.1
# 2:     a   0.2 0.2
# 3:     a   0.5 0.5
# 4:    NA   0.9 0.9
# 5:     b   1.1 1.1
# 6:     b   1.9 1.9
# 7:     c   2.1 2.1

Regarding the above code, I find it pretty self-explanatory: Join intervals and elements by the condition specified in the on operator. That's pretty much it.

There is a certain caveat here though, start, end and elements should be all of the same type, so if one of them is integer, it should be converted to numeric first.

回答2:

cut is possibly useful here.

out <- cut(elements, t(intervals[c("start","end")]))
levels(out)[c(FALSE,TRUE)]  <- NA
intervals$phase[out]
#[1] "a" "a" "a" NA  "b" "b" "c"

回答3:

David Arenburg's mention of non-equi joins was very helpful for understanding what general kind of problem this is (thanks!). I can see now that it's not implemented for dplyr. Thanks to this answer, I see that there is a fuzzyjoin package that can do it in the same idiom. But it's barely any simpler than my map solution above (though more readable, in my view), and doesn't hold a candle to thelatemail's cut answer for brevity.

For my example above, the fuzzyjoin solution would be

library(fuzzyjoin)
library(tidyverse)

fuzzy_left_join(data.frame(elements), intervals, 
                by = c("elements" = "start", "elements" = "end"), 
                match_fun = list(`>=`, `<=`)) %>% 
  distinct()

Which gives:

    elements phase start end
1      0.1     a     0   0.5
2      0.2     a     0   0.5
3      0.5     a     0   0.5
4      0.9  <NA>    NA    NA
5      1.1     b     1   1.9
6      1.9     b     1   1.9
7      2.1     c     2   2.5

回答4:

Just lapply works:

l <- lapply(elements, function(x){
    intervals$phase[x >= intervals$start & x <= intervals$end]
})

str(l)
## List of 7
##  $ : chr "a"
##  $ : chr "a"
##  $ : chr "a"
##  $ : chr(0) 
##  $ : chr "b"
##  $ : chr "b"
##  $ : chr "c"

or in purrr, if you purrrfurrr,

elements %>% 
    map(~intervals$phase[.x >= intervals$start & .x <= intervals$end]) %>% 
    # Clean up a bit. Shorter, but less readable: map_chr(~.x[1] %||% NA)
    map_chr(~ifelse(length(.x) == 0, NA, .x))
## [1] "a" "a" "a" NA  "b" "b" "c"

回答5:

Inspired by @thelatemail's cut solution, here is one using findInterval which still requires a lot of typing:

out <- findInterval(elements, t(intervals[c("start","end")]), left.open = TRUE)
out[!(out %% 2)] <- NA
intervals$phase[out %/% 2L + 1L]
#[1] "a" "a" "a" NA  "b" "b" "c"

Caveat cut and findInterval have left-open intervals. Therefore, solutions using cut and findInterval are not equivalent to Ben's using intrval, David's non-equi join using data.table, and my other solution using foverlaps.

回答6:

Here is kind of a "one-liner" which (mis-)uses foverlaps from the data.table package but David's non-equi join is still more concise:

library(data.table) #v1.10.0
foverlaps(data.table(start = elements, end = elements), 
          setDT(intervals, key = c("start", "end")))
#   phase start end i.start i.end
#1:     a     0 0.5     0.1   0.1
#2:     a     0 0.5     0.2   0.2
#3:     a     0 0.5     0.5   0.5
#4:    NA    NA  NA     0.9   0.9
#5:     b     1 1.9     1.1   1.1
#6:     b     1 1.9     1.9   1.9
#7:     c     2 2.5     2.1   2.1

回答7:

For completion sake, here is another way, using the intervals package:

library(tidyverse)
elements <- c(0.1, 0.2, 0.5, 0.9, 1.1, 1.9, 2.1)

intervalsDF <- 
  frame_data(  ~phase, ~start, ~end,
               "a",     0,      0.5,
               "b",     1,      1.9,
               "c",     2,      2.5
  )

library(intervals)
library(rlist)

interval_overlap(
  Intervals(intervalsDF %>% select(-phase) %>% as.matrix, closed = c(TRUE, TRUE)),
  Intervals(data_frame(start = elements, end = elements), closed = c(TRUE, TRUE))
) %>% 
  list.map(data_frame(interval_index = .i, element_index = .)) %>% 
  do.call(what = bind_rows)

# A tibble: 6 × 2
#  interval_index element_index
#           <int>         <int>
#1              1             1
#2              1             2
#3              1             3
#4              2             5
#5              2             6
#6              3             7