I wish to match two dataframes based on conditionals on more than one column but cannot figure out how. So if there are my data sets:
df1 <- data.frame(lower=c(0,5,10,15,20), upper=c(4,9,14,19,24), x=c(12,45,67,89,10))
df2 <- data.frame(age=c(12, 14, 5, 2, 9, 19, 22, 18, 23))
I wish to match age from df2 that falls into the range between lower and upper in df1 with the aim to add an extra column to df2 containing the value of x in df1 where age lies between upper and lower. i.e. I want df2 to look like
age x
12 67
14 67
5 45
....etc.
How can I achieve such a match ?
I would go with a simple sapply
and a "anded" condition in the df1$x
selection like this:
df2$x <- sapply( df2$age, function(x) { df1$x[ x >= df1$lower & x <= df1$upper ] })
which gives:
> df2
age x
1 12 67
2 14 67
3 5 45
4 2 12
5 9 45
6 19 89
7 22 10
8 18 89
9 23 10
For age 12 for example the selection inside the brackets gives:
> 12 >= df1$lower & 12 <= df1$upper
[1] FALSE FALSE TRUE FALSE FALSE
So getting df1$x
by this logical vector is easy as your ranges don't overlap
Using foverlaps
from data.table
is what you are looking for:
library(data.table)
setDT(df1)
setDT(df2)[,age2:=age]
setkey(df1,lower,upper)
foverlaps(df2, df1, by.x = names(df2),by.y=c("lower","upper"))[,list(age,x)]
# age x
# 1: 12 67
# 2: 14 67
# 3: 5 45
# 4: 2 12
# 5: 9 45
# 6: 19 89
# 7: 22 10
# 8: 18 89
# 9: 23 10
Here's another vectorized approach using findInterval
on a melted data set
library(data.table)
df2$x <- melt(setDT(df1), "x")[order(value), x[findInterval(df2$age, value)]]
# age x
# 1 12 67
# 2 14 67
# 3 5 45
# 4 2 12
# 5 9 45
# 6 19 89
# 7 22 10
# 8 18 89
# 9 23 10
The idea here is to
- First, tidy up you data so
lower
and upper
will be in the same column and x
will have corresponding values to that new column,
- Then, sort the data according to these ranges (necessary for
findInterval
).
- Finally, run
findInterval
within the x
column in order to find the correct incidences
And here's a possible dplyr
/tidyr
version
library(tidyr)
library(dplyr)
df1 %>%
gather(variable, value, -x) %>%
arrange(value) %>%
do(data.frame(x = .$x[findInterval(df2$age, .$value)])) %>%
cbind(df2, .)
# age x
# 1 12 67
# 2 14 67
# 3 5 45
# 4 2 12
# 5 9 45
# 6 19 89
# 7 22 10
# 8 18 89
# 9 23 10