A common way for sampling/splitting data in R is using sample
, e.g., on row numbers. For example:
require(data.table)
set.seed(1)
population <- as.character(1e5:(1e6-1)) # some made up ID names
N <- 1e4 # sample size
sample1 <- data.table(id = sort(sample(population, N))) # randomly sample N ids
test <- sample(N-1, N/2, replace = F)
test1 <- sample1[test, .(id)]
The problem is that this isn't very robust to changes in the data. For example if we drop just one observation:
sample2 <- sample1[-sample(N, 1)]
samples 1 and 2 are still all but identical:
nrow(merge(sample1, sample2))
[1] 9999
Yet the same row splitting yields very different test sets, even though we've set the seed:
test2 <- sample2[test, .(id)]
nrow(test1)
[1] 5000
nrow(merge(test1, test2))
[1] 2653
One could sample specific IDs, but this would not be robust in case observations are omitted or added.
What would be a way to make the split more robust to changes to the data? Namely, have the assignment to test unchanged for unchanged observations, not assign dropped observations, and reassign new observations?