I'm looking for an efficient way to select rows from a data table such that I have one representative row for each unique value in a particular column.
Let me propose a simple example:
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
dt = as.data.table( z )
my objective is to subset data table dt by sampling one row for each letter a-h in column z.
OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z',
sample
1 row from the sequence of rows per group, get the row index (.I
), extract the column with the row index ($V1
) and use that to subset the rows of 'dt'.You can use dplyr
I think that shuffling the data.table row-wise and then applying
unique(...,by)
could also work. Groups are formed withby
and the previous shuffling trickles down inside each group:Below is an example on a bigger data.table with grouping by 3 columns. Comparing with @akrun ' solution seems to give the same grouping:
For sampling more than one row per group check here