Reproducible splitting of data into training and t

A common way for sampling/splitting data in R is using sample, e.g., on row numbers. For example:

require(data.table)
set.seed(1)

population <- as.character(1e5:(1e6-1))  # some made up ID names

N <- 1e4  # sample size

sample1 <- data.table(id = sort(sample(population, N)))  # randomly sample N ids
test <- sample(N-1, N/2, replace = F)
test1 <- sample1[test, .(id)]

The problem is that this isn't very robust to changes in the data. For example if we drop just one observation:

sample2 <- sample1[-sample(N, 1)]

samples 1 and 2 are still all but identical:

nrow(merge(sample1, sample2))

[1] 9999

Yet the same row splitting yields very different test sets, even though we've set the seed:

test2 <- sample2[test, .(id)]
nrow(test1)

[1] 5000

nrow(merge(test1, test2))

[1] 2653

One could sample specific IDs, but this would not be robust in case observations are omitted or added.

What would be a way to make the split more robust to changes to the data? Namely, have the assignment to test unchanged for unchanged observations, not assign dropped observations, and reassign new observations?

标签： r cross-validation sampling reproducible-research robustness

1条回答

兄弟一词,经得起流年.

2楼-- · 2019-04-17 10:10

Use a hash function and sample on the mod of its last digit:

md5_bit_mod <- function(x, m = 2L) {
  # Inputs: 
  #  x: a character vector of ids
  #  m: the modulo divisor (modify for split proportions other than 50:50)
  # Output: remainders from dividing the first digit of the md5 hash of x by m
  as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}

hash splitting works better in this case, because the assignment of test/train is determined by the hash of each obs., and not by its relative location in the data

test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]

nrow(merge(test1a, test2a))

[1] 5057

nrow(test1a)

[1] 5057

sample size is not exactly 5000 because assignment is probabilistic, but it shouldn't be a problem in large samples thanks to the law of large numbers.

0人赞添加讨论(0) 举报

Reproducible splitting of data into training and t

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间