R: subset a data frame based on conditions from an

2019-02-18 14:51发布

Here is a problem I am trying to solve. Say, I have two data frames like the following:

observations <- data.frame(id = rep(rep(c(1,2,3,4), each=5), 5),
    time = c(rep(1:5,4), rep(6:10,4), rep(11:15,4), rep(16:20,4), rep(21:25,4)),
    measurement = rnorm(100,5,7))

sampletimes <- data.frame(location = letters[1:20], 
    id = rep(1:4,5),
    time1 = rep(c(2,7,12,17,22), each=4), 
    time2 = rep(c(4,9,14,19,24), each=4))

They both contain a column named id, which links the data frames. I want to have the measurements from observationss for whichtimeis betweentime1andtime2from thesampletimesdata frame. Additionally, I'd like to connect the appropriatelocation` to each measurement.

I have successfully done this by converting my sampletimes to a wide format (i.e. all the time1 and time2 information in one row per entry for id), merging the two data frames by the id variable, and using conditional statements to take only instances when the time falls between at least one of the time intervals in the row, and then assigning location to the appropriate measurement.

However, I have around 2 million rows in observations and doing this takes a really long time. I'm looking for a better way where I can keep the data in long format. The example dataset is very simple, but in reality, my data contains variable numbers of intervals and locations per id.

For our example, the data frame I would hope to get back would be as follows:

id time measurement letters[1:20]
1    3  10.5163892             a
2    3   5.5774119             b
3    3  10.5057060             c
4    3  14.1563179             d
1    8   2.2653761             e
2    8  -1.0905546             f
3    8  12.7434161             g
4    8  17.6129261             h
1   13  10.9234673             i
2   13   1.6974481             j
3   13  -0.3664951             k
4   13  13.8792198             l
1   18   6.5038847             m
2   18   1.2032935             n
3   18  15.0889469             o
4   18   0.8934357             p
1   23   3.6864527             q
2   23   0.2404074             r
3   23  11.6028766             s
4   23  20.7466908             t

2条回答
Lonely孤独者°
2楼-- · 2019-02-18 15:10

Here's a proposal with merge:

# merge both data frames
dat <- merge(observations, sampletimes, by = "id")
# extract valid rows
dat2 <- dat[dat$time > dat$time1 & dat$time < dat$time2, seq(4)]
# sort
dat2[order(dat2$time, dat2$id), ]

The result:

    id time measurement location
11   1    3    7.086246        a
141  2    3    6.893162        b
251  3    3   16.052627        c
376  4    3   -6.559494        d
47   1    8   11.506810        e
137  2    8   10.959782        f
267  3    8   11.079759        g
402  4    8   11.082015        h
83   1   13    5.584257        i
218  2   13   -1.714845        j
283  3   13  -11.196792        k
418  4   13    8.887907        l
99   1   18    1.656558        m
234  2   18   16.573179        n
364  3   18    6.522298        o
454  4   18    1.005123        p
125  1   23   -1.995719        q
250  2   23   -6.676464        r
360  3   23   10.514282        s
490  4   23    3.863357        t
查看更多
走好不送
3楼-- · 2019-02-18 15:26

Not efficient , but do the job :

 subset(merge(observations,sampletimes), time > time1 & time < time2)
        id time measurement location time1 time2
    11   1    3    3.180321        a     2     4
    47   1    8    6.040612        e     7     9
    83   1   13   -5.999317        i    12    14
    99   1   18    2.689414        m    17    19
    125  1   23   12.514722        q    22    24
    137  2    8    4.420679        f     7     9
    141  2    3   11.492446        b     2     4
    218  2   13    6.672506        j    12    14
    234  2   18   12.290339        n    17    19
    250  2   23   12.610828        r    22    24
    251  3    3    8.570984        c     2     4
    267  3    8   -7.112291        g     7     9
    283  3   13    6.287598        k    12    14
    360  3   23   11.941846        s    22    24
    364  3   18   -4.199001        o    17    19
    376  4    3    7.133370        d     2     4
    402  4    8   13.477790        h     7     9
    418  4   13    3.967293        l    12    14
    454  4   18   12.845535        p    17    19
    490  4   23   -1.016839        t    22    24

EDIT

Since you have more than 5 millions rows, you should give a try to a data.table solution:

library(data.table)
OBS <- data.table(observations)
SAM <- data.table(sampletimes)
merge(OBS,SAM,allow.cartesian=TRUE,by='id')[time > time1 & time < time2]
查看更多
登录 后发表回答