This question already has an answer here:
-
Overlap join with start and end positions
3 answers
So I'm working in R and have a large dataframe that contains a vector that has genome positions like such:
2655180
2657176
2658869
And a second dataframe that has a a range of positions and a gene like such:
chr1 100088228 100162167 AGL
chr1 107599438 107600565 PRMT6
chr1 115215635 115238091 AMPD1
chr1 11850637 11863073 MTHFR
chr1 119958143 119965343 HSD3B2
chr1 144124628 144128902 HFE2
chr1 150769175 150779181 CTSK
chr1 154245300 154248277 HAX1
chr1 155204686 155210803 GBA
chr1 156084810 156108997 LMNA
Where the second and third columns are the start and end of the gene respectively. What I want to do is check if a row in the first data frame fits within the range of the second data frame and if so add the gene (column 4 of the second data frame) to the first data frame.
My current implementation uses nested for loops to check each entry in the first dataframe against all entries in the second dataframe. Are there any R functions that could help me with accomplishing this task?
In short: I need to check if a value in a row in a first vector is within a range specified in a differently sized second vector and then extract a value from the second vector.
Using dplyr
:
getValue <- function(x, data) {
tmp <- data %>%
filter(V2 <= x, x <= V3)
return(tmp$V4)
}
x <- c(107599440, 150769180, 155204690)
sapply(x, getValue, data=df)
Which returns:
[1] "PRMT6" "CTSK" "GBA"
Note: I copied your data into a dataframe df
that has column names V1
, V2
, V3
, and V4
. The columns V2
and V3
are the lower and upper values of the range.
df <- read.table(text="chr1 100088228 100162167 AGL
chr1 107599438 107600565 PRMT6
chr1 115215635 115238091 AMPD1
chr1 11850637 11863073 MTHFR
chr1 119958143 119965343 HSD3B2
chr1 144124628 144128902 HFE2
chr1 150769175 150779181 CTSK
chr1 154245300 154248277 HAX1
chr1 155204686 155210803 GBA
chr1 156084810 156108997 LMNA", stringsAsFactors=FALSE)
Update:
In case of multiple matches, this will return the first match:
getValue <- function(x, data) {
tmp <- data %>%
filter(V2 <= x, x <= V3) %>%
filter(row_number() == 1)
return(tmp$V4)
}
There are multiple ranking functions. Check out ?row_number
for more info.
Here you go. This answer depends on the assumptions discussed in the comments, namely, that the ranges neither overlap nor butt-up against one-another.
d <- read.table(text='chr1 100088228 100162167 AGL
chr1 107599438 107600565 PRMT6
chr1 115215635 115238091 AMPD1
chr1 11850637 11863073 MTHFR
chr1 119958143 119965343 HSD3B2
chr1 144124628 144128902 HFE2
chr1 150769175 150779181 CTSK
chr1 154245300 154248277 HAX1
chr1 155204686 155210803 GBA
chr1 156084810 156108997 LMNA')
# Since your original vector does not contain positions
# that are in any of the ranges in your second data.frame,
# I choose new values and commented the range they should belong to.
v <- read.table(text="
119958153 # HSD3B2
154245310 # HAX1
156084820 # LMNA")
# order the first data.frame by the ranges
d <- d[order(d[[2]]), ]
# create a vector breaks from the interval ranges
breaks <- as.vector(do.call(rbind, d[c(2,3)]))
ints <- ceiling(findInterval(v[[1]], breaks)/2)
v$AGL <- d[ints, 4]
# V1 AGL
# 1 119958153 HSD3B2
# 2 154245310 HAX1
# 3 156084820 LMNA
I realize you asked for a function, but here's a way that doesn't need nested loops, using some fake data.
x <- as.vector(c(1:3,6:9)) #Create a vector with values 1 to 3, and 6 to 9
y <- c(1:5) #Create a vector with values 1 to 5
inrange <- matrix(nrow=6,ncol=1) #Create an empty matrix the same length as x
for (i in 1:nrow(x)){
inrange[i] <- ifelse((x[i] <= max(y) & x[i] >= min(y)),
1,0) #This if statement evaluates each row of x to determine
} #whether the row is greater than/equal to the miniumum
#or less than/equal to the maximum of y
"inrange" now takes on a value of 1 if values of x are in the range of y, and 0 if they do not.
Suppose v is your vector and df the dataframe with columns chr, start, stop, gene,
then another simple plain r solution is
sapply(v, function(v.element) df[v.element >= df["start"] & v.element <= df["stop"],"gene"])