问题:

Suppose I have a matrix in R as follows:

ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...

What I need is a random sample where every element is represented once and only once.

That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...

There can be more than two duplicates.

I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?

Thanks!

回答1:

tapply across the rownames and grab a sample of 1 in each ID group:

dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]

#  ID Value
#1  1    10
#3  2     8
#4  3    15
#6  4     9

If your data is truly a matrix and not a data.frame, you can work around this too, with:

dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]

Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.

replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4

You can do that with dplyr like so:

library(dplyr)
df %>% group_by(ID) %>% sample_n(1)

The idea is reorder the rows randomly and then remove duplicates in that order.

df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)

df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]