Rolling variance algorithm

2019-01-20 23:15发布

I'm trying to find an efficient, numerically stable algorithm to calculate a rolling variance (for instance, a variance over a 20-period rolling window). I'm aware of the Welford algorithm that efficiently computes the running variance for a stream of numbers (it requires only one pass), but am not sure if this can be adapted for a rolling window. I would also like the solution to avoid the accuracy problems discussed at the top of this article by John D. Cook. A solution in any language is fine.

11条回答
唯我独甜
2楼-- · 2019-01-20 23:48

Actually Welfords algorithm can AFAICT easily be adapted to compute weighted Variance. And by setting weights to -1, you should be able to effectively cancel out elements. I havn't checked the math whether it allows negative weights though, but at a first look it should!

I did perform a small experiment using ELKI:

void testSlidingWindowVariance() {
MeanVariance mv = new MeanVariance(); // ELKI implementation of weighted Welford!
MeanVariance mc = new MeanVariance(); // Control.

Random r = new Random();
double[] data = new double[1000];
for (int i = 0; i < data.length; i++) {
  data[i] = r.nextDouble();
}

// Pre-roll:
for (int i = 0; i < 10; i++) {
  mv.put(data[i]);
}
// Compare to window approach
for (int i = 10; i < data.length; i++) {
  mv.put(data[i-10], -1.); // Remove
  mv.put(data[i]);
  mc.reset(); // Reset statistics
  for (int j = i - 9; j <= i; j++) {
    mc.put(data[j]);
  }
  assertEquals("Variance does not agree.", mv.getSampleVariance(),
    mc.getSampleVariance(), 1e-14);
}
}

I get around ~14 digits of precision compared to the exact two-pass algorithm; this is about as much as can be expected from doubles. Note that Welford does come at some computational cost because of the extra divisions - it takes about twice as long as the exact two-pass algorithm. If your window size is small, it may be much more sensible to actually recompute the mean and then in a second pass the variance every time.

I have added this experiment as unit test to ELKI, you can see the full source here: http://elki.dbs.ifi.lmu.de/browser/elki/trunk/test/de/lmu/ifi/dbs/elki/math/TestSlidingVariance.java it also compares to the exact two-pass variance.

However, on skewed data sets, the behaviour might be different. This data set obviously is uniform distributed; but I've also tried a sorted array and it worked.

查看更多
走好不送
3楼-- · 2019-01-20 23:48

Here's another O(log k) solution: find squares the original sequence, then sum pairs, then quadruples, etc.. (You'll need a bit of a buffer to be able to find all of these efficiently.) Then add up those values that you need to to get your answer. For example:

|||||||||||||||||||||||||  // Squares
| | | | | | | | | | | | |  // Sum of squares for pairs
|   |   |   |   |   |   |  // Pairs of pairs
|       |       |       |  // (etc.)
|               |
   ^------------------^    // Want these 20, which you can get with
        |       |          // one...
    |   |       |   |      // two, three...
                    | |    // four...
   ||                      // five stored values.

Now you use your standard E(x^2)-E(x)^2 formula and you're done. (Not if you need good stability for small sets of numbers; this was assuming that it was only accumulation of rolling error that was causing issues.)

That said, summing 20 squared numbers is very fast these days on most architectures. If you were doing more--say, a couple hundred--a more efficient method would clearly be better. But I'm not sure that brute force isn't the way to go here.

查看更多
祖国的老花朵
4楼-- · 2019-01-20 23:49

I've run across this problem as well. There are some great posts out there in computing the running cumulative variance such as John Cooke's Accurately computing running variance post and the post from Digital explorations, Python code for computing sample and population variances, covariance and correlation coefficient. Just could not find any that were adapted to a rolling window.

The Running Standard Deviations post by Subluminal Messages was critical in getting the rolling window formula to work. Jim takes the power sum of the squared differences of the values versus Welford’s approach of using the sum of the squared differences of the mean. Formula as follows:

PSA today = PSA(yesterday) + (((x today * x today) - x yesterday)) / n

  • x = value in your time series
  • n = number of values you've analyzed so far.

But, to convert the Power Sum Average formula to a windowed variety you need tweak the formula to the following:

PSA today = PSA yesterday + (((x today * x today) - (x yesterday * x Yesterday) / n

  • x = value in your time series
  • n = number of values you've analyzed so far.

You'll also need the Rolling Simple Moving Average formula:

SMA today = SMA yesterday + ((x today - x today - n) / n

  • x = value in your time series
  • n = period used for your rolling window.

From there you can compute the Rolling Population Variance:

Population Var today = (PSA today * n - n * SMA today * SMA today) / n

Or the Rolling Sample Variance:

Sample Var today = (PSA today * n - n * SMA today * SMA today) / (n - 1)

I've covered this topic along with sample Python code in a blog post a few years back, Running Variance.

Hope this helps.

Please note: I provided links to all the blog posts and math formulas in Latex (images) for this answer. But, due to my low reputation (< 10); I'm limited to only 2 hyperlinks and absolutely no images. Sorry about this. Hope this doesn't take away from the content.

查看更多
来,给爷笑一个
5楼-- · 2019-01-20 23:52

For only 20 values, it's trivial to adapt the method exposed here (I didn't say fast, though).

You can simply pick up an array of 20 of these RunningStat classes.

The first 20 elements of the stream are somewhat special, however once this is done, it's much more simple:

  • when a new element arrives, clear the current RunningStat instance, add the element to all 20 instances, and increment the "counter" (modulo 20) which identifies the new "full" RunningStat instance
  • at any given moment, you can consult the current "full" instance to get your running variant.

You will obviously note that this approach isn't really scalable...

You can also note that there is some redudancy in the numbers we keep (if you go with the RunningStat full class). An obvious improvement would be to keep the 20 lasts Mk and Sk directly.

I cannot think of a better formula using this particular algorithm, I am afraid that its recursive formulation somewhat ties our hands.

查看更多
兄弟一词,经得起流年.
6楼-- · 2019-01-20 23:53

I look forward to be proven wrong on this but I don't think this can be done "quickly." That said, a large part of the calculation is keeping track of the EV over the window which can be done easily.

I'll leave with the question: are you sure you need a windowed function? Unless you are working with very large windows it is probably better to just use a well known predefined algorithm.

查看更多
登录 后发表回答