A common way for sampling/splitting data in R is using sample
, e.g., on row numbers. For example:
require(data.table)
set.seed(1)
population <- as.character(1e5:(1e6-1)) # some made up ID names
N <- 1e4 # sample size
sample1 <- data.table(id = sort(sample(population, N))) # randomly sample N ids
test <- sample(N-1, N/2, replace = F)
test1 <- sample1[test, .(id)]
The problem is that this isn't very robust to changes in the data. For example if we drop just one observation:
sample2 <- sample1[-sample(N, 1)]
samples 1 and 2 are still all but identical:
nrow(merge(sample1, sample2))
[1] 9999
Yet the same row splitting yields very different test sets, even though we've set the seed:
test2 <- sample2[test, .(id)]
nrow(test1)
[1] 5000
nrow(merge(test1, test2))
[1] 2653
One could sample specific IDs, but this would not be robust in case observations are omitted or added.
What would be a way to make the split more robust to changes to the data? Namely, have the assignment to test unchanged for unchanged observations, not assign dropped observations, and reassign new observations?
Use a hash function and sample on the mod of its last digit:
hash splitting works better in this case, because the assignment of test/train is determined by the hash of each obs., and not by its relative location in the data
[1] 5057
[1] 5057
sample size is not exactly 5000 because assignment is probabilistic, but it shouldn't be a problem in large samples thanks to the law of large numbers.
See also: http://blog.richardweiss.org/2016/12/25/hash-splits.html and https://crypto.stackexchange.com/questions/20742/statistical-properties-of-hash-functions-when-calculating-modulo