from data table, randomly select one row per group

2019-01-12 04:53发布

I'm looking for an efficient way to select rows from a data table such that I have one representative row for each unique value in a particular column.

Let me propose a simple example:

require(data.table)

y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
dt = as.data.table( z )

my objective is to subset data table dt by sampling one row for each letter a-h in column z.

3条回答
叛逆
2楼-- · 2019-01-12 05:00

OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z', sample 1 row from the sequence of rows per group, get the row index (.I), extract the column with the row index ($V1) and use that to subset the rows of 'dt'.

dt[dt[ , .I[sample(.N,1)] , by = z]$V1]
查看更多
啃猪蹄的小仙女
3楼-- · 2019-01-12 05:13

You can use dplyr

library(dplyr)

dt %>%
  group_by(z) %%
  sample_n(1)
查看更多
Viruses.
4楼-- · 2019-01-12 05:27

I think that shuffling the data.table row-wise and then applying unique(...,by) could also work. Groups are formed with by and the previous shuffling trickles down inside each group:

# shuffle the data.table row-wise
dt <- dt[sample(dim(dt)[1])]
# uniqueness by given column(s)
unique(dt, by = "z")

Below is an example on a bigger data.table with grouping by 3 columns. Comparing with @akrun ' solution seems to give the same grouping:

set.seed(2017)
dt <- data.table(c1 = sample(52*10^6), 
                 c2 = sample(LETTERS, replace = TRUE), 
                 c3 = sample(10^5, replace = TRUE), 
                 c4 = sample(10^3, replace = TRUE))
# the shuffling & uniqueness
system.time( test1 <- unique(dt[sample(dim(dt)[1])], by = c("c2","c3","c4")) )
# user  system elapsed 
# 13.87    0.49   14.33 

# @akrun' solution
system.time( test2 <- dt[dt[ , .I[sample(.N,1)] , by = c("c2","c3","c4")]$V1] )
# user  system elapsed 
# 11.89    0.10   12.01 

# Grouping is identical (so, all groups are being sampled in both cases)
identical(x=test1[,.(c2,c3)][order(c2,c3)], 
          y=test2[,.(c2,c3)][order(c2,c3)])
# [1] TRUE

For sampling more than one row per group check here

查看更多
登录 后发表回答