Randomly draw rows from dataframe based on unique

2019-07-13 09:05发布

问题:

I have a dataframe with many descriptor variables (trt, individual, session). I want to be able to randomly select a fraction of the possible trt x individual combinations but control for the session variable such that no random pull has the same session number. Here is what my dataframe looks like:

trt <- c(rep(c(rep("A", 3), rep("B", 3), rep("C", 3)), 9))
individual <- rep(c("Bob", "Nancy", "Tim"), 27)
session <- rep(1:27, each = 3)
data <- rnorm(81, mean = 4, sd = 1)
df <- data.frame(trt, individual, session, data))
df
   trt individual session             data
1    A        Bob       1 3.72013685581385
2    A      Nancy       1 3.97225419000673
3    A        Tim       1 4.44714175686225
4    B        Bob       2 5.00024599458127
5    B      Nancy       2 3.43615965145765
6    B        Tim       2  6.7920094635501
7    C        Bob       3 4.36315054477571
8    C      Nancy       3 5.07117348146375
9    C        Tim       3 4.38503325758969
10   A        Bob       4 4.30677162933005
11   A      Nancy       4 1.89311687510669
12   A        Tim       4 3.09084920968413
13   B        Bob       5 3.10436190897144
14   B      Nancy       5 3.59454992439722
15   B        Tim       5 3.40778069131207
16   C        Bob       6 4.00171937800892
17   C      Nancy       6 0.14578811080644
18   C        Tim       6 4.20754733296227
19   A        Bob       7 3.69131009783284
20   A      Nancy       7  4.7025756891679
21   A        Tim       7 4.46196017363017
22   B        Bob       8 3.97573281432736
23   B      Nancy       8  4.5373185942686
24   B        Tim       8 2.40937847038141
25   C        Bob       9 4.57519884980087
26   C      Nancy       9 5.19143914630448
27   C        Tim       9 4.83144732833874
28   A        Bob      10 3.01769965527235
29   A      Nancy      10 5.17300616827746
30   A        Tim      10 4.65432284571663
31   B        Bob      11 4.50892032922527
32   B      Nancy      11 3.38082717995663
33   B        Tim      11 4.92022245677209
34   C        Bob      12 4.54149796547394
35   C      Nancy      12 3.21992774137179
36   C        Tim      12 3.74507360931023
37   A        Bob      13 3.39524949548056
38   A      Nancy      13 4.17518916890901
39   A        Tim      13 3.02932375225388
40   B        Bob      14 3.59660910672907
41   B      Nancy      14 2.08784850191654
42   B        Tim      14 3.98446125755258
43   C        Bob      15 4.01837496797085
44   C      Nancy      15 3.40610126858125
45   C        Tim      15 4.57107635588582
46   A        Bob      16 3.15839276840723
47   A      Nancy      16 2.19932140340504
48   A        Tim      16 4.77588798035668
49   B        Bob      17  4.3524768657397
50   B      Nancy      17 4.49071625925856
51   B        Tim      17 4.02576463486266
52   C        Bob      18 3.74783360762117
53   C      Nancy      18 2.84123227236184
54   C        Tim      18  3.2024114782253
55   A        Bob      19 4.93837445490921
56   A      Nancy      19  4.7103051496802
57   A        Tim      19 6.22083635045134
58   B        Bob      20  4.5177747677824
59   B      Nancy      20 1.78839270771153
60   B        Tim      20 5.07140678136995
61   C        Bob      21 3.47818616035335
62   C      Nancy      21 4.28526474048439
63   C        Tim      21 4.22597602946575
64   A        Bob      22 1.91700925257901
65   A      Nancy      22 2.96317997587458
66   A        Tim      22 2.53506974227672
67   B        Bob      23 5.52714403395316
68   B      Nancy      23  3.3618513551059
69   B        Tim      23 4.85869007113978
70   C        Bob      24  3.4367068543959
71   C      Nancy      24 4.47769879000349
72   C        Tim      24 5.77340483757836
73   A        Bob      25 4.78524317734622
74   A      Nancy      25 3.55373702554664
75   A        Tim      25 2.88541465503637
76   B        Bob      26 4.62885302019139
77   B      Nancy      26 3.59430293369092
78   B        Tim      26 2.29610255924296
79   C        Bob      27 4.38433001299722
80   C      Nancy      27 3.77825207859976
81   C        Tim      27 2.12163194694365

How do I pull out 2 of each trt x individual combinations with a unique session number? This is an example what I want the dataframe to look like:

       trt individual session             data
    1    A        Bob       1 3.72013685581385
    5    B      Nancy       2 3.43615965145765
    7    C        Bob       3 4.36315054477571
    12   A        Tim       4 3.09084920968413
    15   B        Tim       5 3.40778069131207
    17   C      Nancy       6 0.14578811080644
    19   A        Bob       7 3.69131009783284
    29   A      Nancy      10 5.17300616827746
    31   B        Bob      11 4.50892032922527
    34   C        Bob      12 4.54149796547394
    39   A        Tim      13 3.02932375225388
    40   B        Bob      14 3.59660910672907
    47   A      Nancy      16 2.19932140340504
    51   B        Tim      17 4.02576463486266
    54   C        Tim      18  3.2024114782253
    59   B      Nancy      20 1.78839270771153
    71   C      Nancy      24 4.47769879000349
    81   C        Tim      27 2.12163194694365

I have tried a couple things with no luck.

I have tried to just randomly select two trt x individual combinations, but I end up with duplicate session values:

setDT((df))
df[ , .SD[sample(.N, 2)] , keyby = .(trt, individual)]
    trt individual session             data
 1:   A        Bob      25  2.7560788894668
 2:   A        Bob      19 4.12040841647523
 3:   A      Nancy       4 5.35362338127901
 4:   A      Nancy      19 5.51636882737692
 5:   A        Tim      19 5.10553640201998
 6:   A        Tim       1 2.77380671625473
 7:   B        Bob      23 3.50585105164409
 8:   B        Bob       8 3.58167259470814
 9:   B      Nancy      23 2.85301307507985
10:   B      Nancy       8 2.85179395539781
11:   B        Tim      26 2.40666507132474
12:   B        Tim      20 3.31276311351286
13:   C        Bob      24 3.19076007024549
14:   C        Bob       3 3.59146613276121
15:   C      Nancy       9 4.46606667880457
16:   C      Nancy      15 2.25405252536256
17:   C        Tim      12 4.43111661206133
18:   C        Tim      27 4.23868848646589

I have tried randomly selecting one of each session number and then pulling 2 trt x individual combinations, but it typically comes back with an error since the random selection doesnt grab an equal number of trt x individual combinations:

ind <- sapply( unique(df$session ) , function(x) sample( which(df$session == x) , 1) )
df.unique <- df[ind, ]
df.sub <- df.unique[, .SD[sample(.N, 2)] , by = .(trt, individual)]
Error in `[.data.frame`(df.unique, , .SD[sample(.N, 2)], by = .(trt, individual)) : 
  unused argument (by = .(trt, individual))

Thanks in advance for your help!

回答1:

Perhaps there is a clever way to sample, but here's a straightforward idea to get you started in the meanwhile:

setDT(df)
setkey(df, session)

usedsessions = 0 # some value that's not a session number
df[, {
       res = .SD[!.(usedsessions)][sample(.N, 2)]
       usedsessions = c(usedsessions, res$session)
       res
     }
   , by = .(trt, individual)]
#    trt individual session     data
# 1:   A        Bob       7 4.256668
# 2:   A        Bob      25 2.431821
# 3:   A      Nancy      16 4.785859
# 4:   A      Nancy      19 4.865248
# 5:   A        Tim       4 3.303689
# 6:   A        Tim      13 3.550261
# 7:   B        Bob      26 3.987136
# 8:   B        Bob      17 3.283055
# 9:   B      Nancy      14 3.177226
#10:   B      Nancy       2 3.639542
#11:   B        Tim       8 2.168447
#12:   B        Tim       5 3.521123
#13:   C        Bob      21 3.284245
#14:   C        Bob      12 5.773098
#15:   C      Nancy      24 4.624428
#16:   C      Nancy       9 3.235467
#17:   C        Tim      18 4.001395
#18:   C        Tim      27 5.002110

You'll probably need to add corner case processing (e.g. if there is no such sampling).