Reproducible splitting of data into training and t

2019-04-17 09:43发布

问题:

A common way for sampling/splitting data in R is using sample, e.g., on row numbers. For example:

require(data.table)
set.seed(1)

population <- as.character(1e5:(1e6-1))  # some made up ID names

N <- 1e4  # sample size

sample1 <- data.table(id = sort(sample(population, N)))  # randomly sample N ids
test <- sample(N-1, N/2, replace = F)
test1 <- sample1[test, .(id)]

The problem is that this isn't very robust to changes in the data. For example if we drop just one observation:

sample2 <- sample1[-sample(N, 1)]  

samples 1 and 2 are still all but identical:

nrow(merge(sample1, sample2))

[1] 9999

Yet the same row splitting yields very different test sets, even though we've set the seed:

test2 <- sample2[test, .(id)]
nrow(test1)

[1] 5000

nrow(merge(test1, test2))

[1] 2653

One could sample specific IDs, but this would not be robust in case observations are omitted or added.

What would be a way to make the split more robust to changes to the data? Namely, have the assignment to test unchanged for unchanged observations, not assign dropped observations, and reassign new observations?

回答1:

Use a hash function and sample on the mod of its last digit:

md5_bit_mod <- function(x, m = 2L) {
  # Inputs: 
  #  x: a character vector of ids
  #  m: the modulo divisor (modify for split proportions other than 50:50)
  # Output: remainders from dividing the first digit of the md5 hash of x by m
  as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}

hash splitting works better in this case, because the assignment of test/train is determined by the hash of each obs., and not by its relative location in the data

test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]

nrow(merge(test1a, test2a))

[1] 5057

nrow(test1a)

[1] 5057

sample size is not exactly 5000 because assignment is probabilistic, but it shouldn't be a problem in large samples thanks to the law of large numbers.

See also: http://blog.richardweiss.org/2016/12/25/hash-splits.html and https://crypto.stackexchange.com/questions/20742/statistical-properties-of-hash-functions-when-calculating-modulo