This question already has an answer here:
So I'm working in R and have a large dataframe that contains a vector that has genome positions like such:
2655180
2657176
2658869
And a second dataframe that has a a range of positions and a gene like such:
chr1 100088228 100162167 AGL
chr1 107599438 107600565 PRMT6
chr1 115215635 115238091 AMPD1
chr1 11850637 11863073 MTHFR
chr1 119958143 119965343 HSD3B2
chr1 144124628 144128902 HFE2
chr1 150769175 150779181 CTSK
chr1 154245300 154248277 HAX1
chr1 155204686 155210803 GBA
chr1 156084810 156108997 LMNA
Where the second and third columns are the start and end of the gene respectively. What I want to do is check if a row in the first data frame fits within the range of the second data frame and if so add the gene (column 4 of the second data frame) to the first data frame.
My current implementation uses nested for loops to check each entry in the first dataframe against all entries in the second dataframe. Are there any R functions that could help me with accomplishing this task?
In short: I need to check if a value in a row in a first vector is within a range specified in a differently sized second vector and then extract a value from the second vector.
Here you go. This answer depends on the assumptions discussed in the comments, namely, that the ranges neither overlap nor butt-up against one-another.
Using
dplyr
:Which returns:
Note: I copied your data into a dataframe
df
that has column namesV1
,V2
,V3
, andV4
. The columnsV2
andV3
are the lower and upper values of the range.Update:
In case of multiple matches, this will return the first match:
There are multiple ranking functions. Check out
?row_number
for more info.Suppose v is your vector and df the dataframe with columns chr, start, stop, gene, then another simple plain r solution is
I realize you asked for a function, but here's a way that doesn't need nested loops, using some fake data.
"inrange" now takes on a value of 1 if values of x are in the range of y, and 0 if they do not.