I have a matrix in which each row is a sample from a distribution. I want to do a rolling comparison of the distributions using ks.test
and save the test statistic in each case. The simplest way to implement this conceptually is with a loop:
set.seed(1942)
mt <- rbind(rnorm(5), rnorm(5), rnorm(5), rnorm(5))
results <- matrix(as.numeric(rep(NA, nrow(mt))))
for (i in 2 : nrow(mt)) {
results[i] <- ks.test(x = mt[i - 1, ], y = mt[i, ])$statistic
}
However, my real data has ~400 columns and ~300,000 rows for a single example, and I have a lot of examples. So I'd like this to be fast. The Kolmogorov-Smirnov test isn't all that mathematically complicated, so if the answer is "implement it in Rcpp
," I'll grudgingly accept that, but I'd be somewhat surprised -- it's already very fast to compute on a single pair in R.
Methods I've tried but have been unable to get working: dplyr
using rowwise/do/lag
, zoo
using rollapply
(which is what I use to generate the distributions), and populating a data.table
in a loop (edit: this one works, but it's still slow).
Here's a
dplyr
solution that gets the same result as your loop. I have my doubts if this is actually faster than the loop, but perhaps it can serve as a first step towards a solution.One source of speed up is to write a smaller version of
ks.test
that does less.ks.test2
below is more restrictive thanks.test
. It assumes, for example, that you have no missing values and that you always want the statistic associated with a two-sided test.Verify that the output is consistent with
ks.test
.Now determine the savings from the smaller function:
A quick and dirty implementation in Rcpp
for matrix of size
(400, 30000)
, it completes under 1s.And the result seems to be accurate.
I was able to compute the pairwise Kruskal-Wallis statistic using
ks.test()
withrollapplyr()
.This gets the expected result, but it's slow for a dataset of your size. Slow slow slow. This may be because
ks.test()
is computing a lot more than just the statistic at each iteration; it also gets the p-value and does a lot of error checking.Indeed, if we simulate a large dataset like so:
The
rollapplyr()
solution takes a long time; I halted execution after about 2 hours, at which point it had computed nearly all (but not all) results.It seems that while
rollapplyr()
is likely faster than afor
loop, it will not likely be the best overall solution in terms of performance.