Rolling variance algorithm

2019-01-20 23:15发布

I'm trying to find an efficient, numerically stable algorithm to calculate a rolling variance (for instance, a variance over a 20-period rolling window). I'm aware of the Welford algorithm that efficiently computes the running variance for a stream of numbers (it requires only one pass), but am not sure if this can be adapted for a rolling window. I would also like the solution to avoid the accuracy problems discussed at the top of this article by John D. Cook. A solution in any language is fine.

11条回答
Anthone
2楼-- · 2019-01-20 23:36

This is just a minor addition to the excellent answer provided by DanS. The following equations are for removing the oldest sample from the window and updating the mean and variance. This is useful, for example, if you want to take smaller windows near the right edge of your input data stream (i.e. just remove the oldest window sample without adding a new sample).

window_size -= 1; % decrease window size by 1 sample
new_mean = prev_mean + (prev_mean - x_old) / window_size
varSum = varSum - (prev_mean - x_old) * (new_mean - x_old)

Here, x_old is the oldest sample in the window you wish to remove.

查看更多
Viruses.
3楼-- · 2019-01-20 23:40

Here's a divide and conquer approach that has O(log k)-time updates, where k is the number of samples. It should be relatively stable for the same reasons that pairwise summation and FFTs are stable, but it's a bit complicated and the constant isn't great.

Suppose we have a sequence A of length m with mean E(A) and variance V(A), and a sequence B of length n with mean E(B) and variance V(B). Let C be the concatenation of A and B. We have

p = m / (m + n)
q = n / (m + n)
E(C) = p * E(A) + q * E(B)
V(C) = p * (V(A) + (E(A) + E(C)) * (E(A) - E(C))) + q * (V(B) + (E(B) + E(C)) * (E(B) - E(C)))

Now, stuff the elements in a red-black tree, where each node is decorated with mean and variance of the subtree rooted at that node. Insert on the right; delete on the left. (Since we're only accessing the ends, a splay tree might be O(1) amortized, but I'm guessing amortized is a problem for your application.) If k is known at compile-time, you could probably unroll the inner loop FFTW-style.

查看更多
女痞
4楼-- · 2019-01-20 23:41

I have been dealing with the same issue.

Mean is simple to compute iteratively, but you need to keep the complete history of values in a circular buffer.

next_index = (index + 1) % window_size;    // oldest x value is at next_index, wrapping if necessary.

new_mean = mean + (x_new - xs[next_index])/window_size;

I have adapted Welford's algorithm and it works for all the values that I have tested with.

varSum = var_sum + (x_new - mean) * (x_new - new_mean) - (xs[next_index] - mean) * (xs[next_index] - new_mean);

xs[next_index] = x_new;
index = next_index;

To get the current variance just divide varSum by the window size: variance = varSum / window_size;

查看更多
虎瘦雄心在
5楼-- · 2019-01-20 23:44

I know this question is old, but in case someone else is interested here follows the python code. It is inspired by johndcook blog post, @Joachim's, @DanS's code and @Jaime comments. The code below still gives small imprecisions for small data windows sizes. Enjoy.

from __future__ import division
import collections
import math


class RunningStats:
    def __init__(self, WIN_SIZE=20):
        self.n = 0
        self.mean = 0
        self.run_var = 0
        self.WIN_SIZE = WIN_SIZE

        self.windows = collections.deque(maxlen=WIN_SIZE)

    def clear(self):
        self.n = 0
        self.windows.clear()

    def push(self, x):

        self.windows.append(x)

        if self.n <= self.WIN_SIZE:
            # Calculating first variance
            self.n += 1
            delta = x - self.mean
            self.mean += delta / self.n
            self.run_var += delta * (x - self.mean)
        else:
            # Adjusting variance
            x_removed = self.windows.popleft()
            old_m = self.mean
            self.mean += (x - x_removed) / self.WIN_SIZE
            self.run_var += (x + x_removed - old_m - self.mean) * (x - x_removed)

    def get_mean(self):
        return self.mean if self.n else 0.0

    def get_var(self):
        return self.run_var / (self.WIN_SIZE - 1) if self.n > 1 else 0.0

    def get_std(self):
        return math.sqrt(self.get_var())

    def get_all(self):
        return list(self.windows)

    def __str__(self):
        return "Current window values: {}".format(list(self.windows))
查看更多
Evening l夕情丶
6楼-- · 2019-01-20 23:46

If you prefer code over words (heavily based on DanS' post): http://calcandstuff.blogspot.se/2014/02/rolling-variance-calculation.html

public IEnumerable RollingSampleVariance(IEnumerable data, int sampleSize)
{
    double mean = 0;
    double accVar = 0;

    int n = 0;
    var queue = new Queue(sampleSize);

    foreach(var observation in data)
    {
        queue.Enqueue(observation);
        if (n < sampleSize)
        {
            // Calculating first variance
            n++;
            double delta = observation - mean;
            mean += delta / n;
            accVar += delta * (observation - mean);
        }
        else
        {
            // Adjusting variance
            double then = queue.Dequeue();
            double prevMean = mean;
            mean += (observation - then) / sampleSize;
            accVar += (observation - prevMean) * (observation - mean) - (then - prevMean) * (then - mean);
        }

        if (n == sampleSize)
            yield return accVar / (sampleSize - 1);
    }
}
查看更多
ら.Afraid
7楼-- · 2019-01-20 23:46

I guess keeping track of your 20 samples, Sum(X^2 from 1..20), and Sum(X from 1..20) and then successively recomputing the two sums at each iteration isn't efficient enough? It's possible to recompute the new variance without adding up, squaring, etc., all of the samples each time.

As in:

Sum(X^2 from 2..21) = Sum(X^2 from 1..20) - X_1^2 + X_21^2
Sum(X from 2..21) = Sum(X from 1..20) - X_1 + X_21
查看更多
登录 后发表回答