Here is a problem I am trying to solve. Say, I have two data frames like the following:
observations <- data.frame(id = rep(rep(c(1,2,3,4), each=5), 5),
time = c(rep(1:5,4), rep(6:10,4), rep(11:15,4), rep(16:20,4), rep(21:25,4)),
measurement = rnorm(100,5,7))
sampletimes <- data.frame(location = letters[1:20],
id = rep(1:4,5),
time1 = rep(c(2,7,12,17,22), each=4),
time2 = rep(c(4,9,14,19,24), each=4))
They both contain a column named id
, which links the data frames. I want to have the measurement
s from observationss for which
timeis between
time1and
time2from the
sampletimesdata frame. Additionally, I'd like to connect the appropriate
location` to each measurement.
I have successfully done this by converting my sampletimes
to a wide format (i.e. all the time1
and time2
information in one row per entry for id
), merging the two data frames by the id
variable, and using conditional statements to take only instances when the time
falls between at least one of the time intervals in the row, and then assigning location
to the appropriate measurement.
However, I have around 2 million rows in observations
and doing this takes a really long time. I'm looking for a better way where I can keep the data in long format. The example dataset is very simple, but in reality, my data contains variable numbers of intervals and locations per id
.
For our example, the data frame I would hope to get back would be as follows:
id time measurement letters[1:20]
1 3 10.5163892 a
2 3 5.5774119 b
3 3 10.5057060 c
4 3 14.1563179 d
1 8 2.2653761 e
2 8 -1.0905546 f
3 8 12.7434161 g
4 8 17.6129261 h
1 13 10.9234673 i
2 13 1.6974481 j
3 13 -0.3664951 k
4 13 13.8792198 l
1 18 6.5038847 m
2 18 1.2032935 n
3 18 15.0889469 o
4 18 0.8934357 p
1 23 3.6864527 q
2 23 0.2404074 r
3 23 11.6028766 s
4 23 20.7466908 t
Here's a proposal with
merge
:The result:
Not efficient , but do the job :
EDIT
Since you have more than 5 millions rows, you should give a try to a
data.table
solution: