How to conditionally partition observations into g

2019-09-26 06:06发布

问题:

I have the following input:

C1  C2
1   1
1   1
1   2
1   3
1   4
2   1
.   .

C1 and C2 are groups, where C2 is a nested group within C1. Now I'd like to build subgroups on C1 having a minimum size of 2. While the groups in C2 should not be split, I'd like to have as many groups as possible. Manually, I would first have a look at the group C1 and join subgroups 2, 3 and 4 together to (G=1) and take the subgroup 1 (C2=1) as a group (G=2). The expected output would be (where G are the groups I try to create)

C1  C2  G
1   1   1
1   1   1
1   2   2
1   3   2
1   4   2
2   1   3
.   .   .

I hope it's clear what I mean. Any help is highly appreciated.

回答1:

Using:

library(data.table)
setDT(mydf)[, G := {r <- rep(1:floor(.N/2), each = 2); if(length(r) != .N) c(r, tail(r,1)) else r}
            , by = C1
            ][, G := rleid(G)][]

you get:

    C1 C2 G
 1:  1  1 1
 2:  1  1 1
 3:  1  2 2
 4:  1  3 2
 5:  1  4 2
 6:  2  1 3
 7:  2  1 3
 8:  2  2 4
 9:  2  3 4
10:  2  4 4
11:  3  1 5
12:  3  2 5
13:  3  3 6
14:  3  4 6
15:  3  5 6

Used data:

mydf <- structure(list(C1 = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), 
                       C2 = c(1L, 1L, 2L, 3L, 4L, 1L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L)), 
                  .Names = c("C1", "C2"), class = "data.frame", row.names = c(NA, -15L))


标签: r dplyr