I'm trying to find an efficient, numerically stable algorithm to calculate a rolling variance (for instance, a variance over a 20-period rolling window). I'm aware of the Welford algorithm that efficiently computes the running variance for a stream of numbers (it requires only one pass), but am not sure if this can be adapted for a rolling window. I would also like the solution to avoid the accuracy problems discussed at the top of this article by John D. Cook. A solution in any language is fine.
相关问题
- Finding k smallest elements in a min heap - worst-
- binary search tree path list
- High cost encryption but less cost decryption
- How to get a fixed number of evenly spaced points
- Space complexity of validation of a binary search
相关文章
- What are the problems associated to Best First Sea
- Coin change DP solution to keep track of coins
- Algorithm for partially filling a polygonal mesh
- Robust polygon normal calculation
- Algorithm for maximizing coverage of rectangular a
- McNemar's test in Python and comparison of cla
- Is there an API to get statictics on Google Play d
- How to measure complexity of a string?
Actually Welfords algorithm can AFAICT easily be adapted to compute weighted Variance. And by setting weights to -1, you should be able to effectively cancel out elements. I havn't checked the math whether it allows negative weights though, but at a first look it should!
I did perform a small experiment using ELKI:
I get around ~14 digits of precision compared to the exact two-pass algorithm; this is about as much as can be expected from doubles. Note that Welford does come at some computational cost because of the extra divisions - it takes about twice as long as the exact two-pass algorithm. If your window size is small, it may be much more sensible to actually recompute the mean and then in a second pass the variance every time.
I have added this experiment as unit test to ELKI, you can see the full source here: http://elki.dbs.ifi.lmu.de/browser/elki/trunk/test/de/lmu/ifi/dbs/elki/math/TestSlidingVariance.java it also compares to the exact two-pass variance.
However, on skewed data sets, the behaviour might be different. This data set obviously is uniform distributed; but I've also tried a sorted array and it worked.
Here's another
O(log k)
solution: find squares the original sequence, then sum pairs, then quadruples, etc.. (You'll need a bit of a buffer to be able to find all of these efficiently.) Then add up those values that you need to to get your answer. For example:Now you use your standard E(x^2)-E(x)^2 formula and you're done.(Not if you need good stability for small sets of numbers; this was assuming that it was only accumulation of rolling error that was causing issues.)That said, summing 20 squared numbers is very fast these days on most architectures. If you were doing more--say, a couple hundred--a more efficient method would clearly be better. But I'm not sure that brute force isn't the way to go here.
I've run across this problem as well. There are some great posts out there in computing the running cumulative variance such as John Cooke's Accurately computing running variance post and the post from Digital explorations, Python code for computing sample and population variances, covariance and correlation coefficient. Just could not find any that were adapted to a rolling window.
The Running Standard Deviations post by Subluminal Messages was critical in getting the rolling window formula to work. Jim takes the power sum of the squared differences of the values versus Welford’s approach of using the sum of the squared differences of the mean. Formula as follows:
But, to convert the Power Sum Average formula to a windowed variety you need tweak the formula to the following:
You'll also need the Rolling Simple Moving Average formula:
From there you can compute the Rolling Population Variance:
Or the Rolling Sample Variance:
I've covered this topic along with sample Python code in a blog post a few years back, Running Variance.
Hope this helps.
For only 20 values, it's trivial to adapt the method exposed here (I didn't say fast, though).
You can simply pick up an array of 20 of these
RunningStat
classes.The first 20 elements of the stream are somewhat special, however once this is done, it's much more simple:
RunningStat
instance, add the element to all 20 instances, and increment the "counter" (modulo 20) which identifies the new "full"RunningStat
instanceYou will obviously note that this approach isn't really scalable...
You can also note that there is some redudancy in the numbers we keep (if you go with the
RunningStat
full class). An obvious improvement would be to keep the 20 lastsMk
andSk
directly.I cannot think of a better formula using this particular algorithm, I am afraid that its recursive formulation somewhat ties our hands.
I look forward to be proven wrong on this but I don't think this can be done "quickly." That said, a large part of the calculation is keeping track of the EV over the window which can be done easily.
I'll leave with the question: are you sure you need a windowed function? Unless you are working with very large windows it is probably better to just use a well known predefined algorithm.