Randomly sample data frame into 3 groups in R

2020-02-11 07:17发布

问题:

Objective: Randomly divide a data frame into 3 samples.

  • one sample with 60% of the rows
  • other two samples with 20% of the rows
  • samples should not have duplicates of others (i.e. sample without replacement).

Here's a clunky solution:

allrows <- 1:nrow(mtcars)

set.seed(7)
trainrows <- sample(allrows, replace = F, size = 0.6*length(allrows))
test_cvrows <- allrows[-trainrows]
testrows <- sample(test_cvrows, replace=F, size = 0.5*length(test_cvrows))
cvrows <- test_cvrows[-which(test_cvrows %in% testrows)]

train <- mtcars[trainrows,]
test <- mtcars[testrows,]
cvr <- mtcars[cvrows,]

There must be something easier, perhaps in a package. dplyr has the sample_frac function, but that seems to target a single sample, not a split into multiple.

Close, but not quite the answer to this question: Random Sample with multiple probabilities in R

回答1:

Do you need the partitioning to be exact? If not,

set.seed(7)
ss <- sample(1:3,size=nrow(mtcars),replace=TRUE,prob=c(0.6,0.2,0.2))
train <- mtcars[ss==1,]
test <- mtcars[ss==2,]
cvr <- mtcars[ss==3,]

should do it.

Or, as @Frank says in comments, you can split() the original data to keep them as elements of a list:

mycars <- setNames(split(mtcars,ss), c("train","test","cvr"))


回答2:

Not the prettiest solution (especially for larger samples), but it works.

n = nrow(mtcars)
#use different rounding for differet sizes/proportions
times =rep(1:3,c(0.6*n,0.2*n,0.2*n))
ntimes = length(times)
if (ntimes < n)
    times = c(times,sample(1:3,n-ntimes,prob=c(0.6,0.2,0.2),replace=FALSE))
sets = sample(times)
df1 = mtcars[sets==1,]
df2 = mtcars[sets==2,]
df3 = mtcars[sets==3,]


回答3:

Options without replacement

Using caret package.

library(caret)

inTrain <- createDataPartition(mtcars$mpg, p = 0.6, list = FALSE)
train <- mtcars[inTrain, ]
inTest <- createDataPartition(mtcars$mpg[-inTrain], list = FALSE)
test <- mtcars[-inTrain,][inTest, ]
cvr <- mtcars[-inTrain,][-inTest, ]

Base package.

## splitData
# y column of data to create split on
# p list of percentage split
splitData <- function(y, p = c(0.5)){
  if(sum(p) > 1){
    stop("sum of p cannot exceed 1")
  }

  rows <- 1:length(y)

  res <- list()

  n_sample = round(length(rows) * p)
  for( size in n_sample){
    inSplit <-  sample.int(length(rows), size)
    res <- c(res, list(rows[inSplit]))
    rows <- rows[-inSplit]
  }

  if(sum(as.matrix(p)) < 1){
    res <- c(res, list(rows))
  }

  res
}

split_example_2 <- splitData(mtcars$mpg, p = c(0.6, 0.2))
split_example_3 <- splitData(mtcars$mpg)


回答4:

If you want to get exact and reproducible numbers for each group (split as close to the proportions as you can achieve, bearing in mind the group sizes must be whole numbers), rather than allow the group sizes to vary randomly each time you perform your random split, try:

sample_size <- nrow(mtcars)
set_proportions <- c(Training = 0.6, Validation = 0.2, Test = 0.2)
set_frequencies <- diff(floor(sample_size * cumsum(c(0, set_proportions))))
mtcars$set <- sample(rep(names(set_proportions), times = set_frequencies))

Then you can split into a list of dataframes simply by

mtcars <- split(mtcars, mtcars$set)

so e.g. the dataframe for the validation set is now accessed as mtcars$Validation, or alternatively you can split into separate data frames as:

mtcars_train <- mtcars[mtcars$set == "Training", ]
mtcars_validation <- mtcars[mtcars$set == "Validation", ]
mtcars_test <- mtcars[mtcars$set == "Test", ]

In some cases, like this one, you can't split the data exactly 60%, 20%, 20% but this method guarantees the sizes of the two 20% sets shouldn't be more than one apart from each other:

> set_frequencies
  Training Validation       Test 
        19          6          7

Check it has worked as expected:

> table(mtcars$set)

      Test   Training Validation 
         7         19          6 

(Based on the answer by Ben Bolker and the comment by liori.)