Checking if value in vector is in range of values

2019-06-28 04:24发布

问题:

This question already has an answer here:

  • Overlap join with start and end positions 3 answers

So I'm working in R and have a large dataframe that contains a vector that has genome positions like such:

2655180
2657176
2658869 

And a second dataframe that has a a range of positions and a gene like such:

chr1    100088228   100162167   AGL
chr1    107599438   107600565   PRMT6
chr1    115215635   115238091   AMPD1
chr1    11850637    11863073    MTHFR
chr1    119958143   119965343   HSD3B2
chr1    144124628   144128902   HFE2
chr1    150769175   150779181   CTSK
chr1    154245300   154248277   HAX1
chr1    155204686   155210803   GBA
chr1    156084810   156108997   LMNA

Where the second and third columns are the start and end of the gene respectively. What I want to do is check if a row in the first data frame fits within the range of the second data frame and if so add the gene (column 4 of the second data frame) to the first data frame.

My current implementation uses nested for loops to check each entry in the first dataframe against all entries in the second dataframe. Are there any R functions that could help me with accomplishing this task?

In short: I need to check if a value in a row in a first vector is within a range specified in a differently sized second vector and then extract a value from the second vector.

回答1:

Using dplyr:

getValue <- function(x, data) {
  tmp <- data %>%
    filter(V2 <= x, x <= V3)
  return(tmp$V4)
}

x <- c(107599440, 150769180, 155204690)
sapply(x, getValue, data=df)

Which returns:

[1] "PRMT6" "CTSK"  "GBA" 

Note: I copied your data into a dataframe df that has column names V1, V2, V3, and V4. The columns V2 and V3 are the lower and upper values of the range.

df <- read.table(text="chr1    100088228   100162167   AGL
chr1    107599438   107600565   PRMT6
chr1    115215635   115238091   AMPD1
chr1    11850637    11863073    MTHFR
chr1    119958143   119965343   HSD3B2
chr1    144124628   144128902   HFE2
chr1    150769175   150779181   CTSK
chr1    154245300   154248277   HAX1
chr1    155204686   155210803   GBA
chr1    156084810   156108997   LMNA", stringsAsFactors=FALSE)

Update:

In case of multiple matches, this will return the first match:

getValue <- function(x, data) {
  tmp <- data %>%
    filter(V2 <= x, x <= V3) %>%
    filter(row_number() == 1)
  return(tmp$V4)
}

There are multiple ranking functions. Check out ?row_number for more info.



回答2:

Here you go. This answer depends on the assumptions discussed in the comments, namely, that the ranges neither overlap nor butt-up against one-another.

d <- read.table(text='chr1    100088228   100162167   AGL
chr1    107599438   107600565   PRMT6
chr1    115215635   115238091   AMPD1
chr1    11850637    11863073    MTHFR
chr1    119958143   119965343   HSD3B2
chr1    144124628   144128902   HFE2
chr1    150769175   150779181   CTSK
chr1    154245300   154248277   HAX1
chr1    155204686   155210803   GBA
chr1    156084810   156108997   LMNA')

# Since your original vector does not contain positions 
# that are in any of the ranges in your second data.frame, 
# I choose new values and commented the range they should belong to.
v <- read.table(text="
119958153 # HSD3B2
154245310 # HAX1
156084820 # LMNA")

# order the first data.frame by the ranges
d <- d[order(d[[2]]), ]

# create a vector breaks from the interval ranges
breaks <- as.vector(do.call(rbind, d[c(2,3)]))
ints <- ceiling(findInterval(v[[1]], breaks)/2)

v$AGL <- d[ints, 4]
#          V1    AGL
# 1 119958153 HSD3B2
# 2 154245310   HAX1
# 3 156084820   LMNA


回答3:

I realize you asked for a function, but here's a way that doesn't need nested loops, using some fake data.

x <- as.vector(c(1:3,6:9))      #Create a vector with values 1 to 3, and 6 to 9
y <- c(1:5)                     #Create a vector with values 1 to 5

inrange <- matrix(nrow=6,ncol=1)    #Create an empty matrix the same length as x
for (i in 1:nrow(x)){    
    inrange[i] <- ifelse((x[i] <= max(y) & x[i] >= min(y)),
        1,0)                      #This if statement evaluates each row of x to determine 
                }                 #whether the row is greater than/equal to the miniumum
                                  #or less than/equal to the maximum of y

"inrange" now takes on a value of 1 if values of x are in the range of y, and 0 if they do not.



回答4:

Suppose v is your vector and df the dataframe with columns chr, start, stop, gene, then another simple plain r solution is

sapply(v, function(v.element) df[v.element >= df["start"] & v.element <= df["stop"],"gene"])