Calculating variance with large numbers

2019-02-26 00:40发布

I haven't really used variance calculation that much, and I don't know quite what to expect. Actually I'm not too good with math at all.

I have a an array of 1000000 random numeric values in the range 0-10000.

The array could grow even larger, so I use 64 bit int for sum.

I have tried to find code on how to calc variance, but I don't know if I get correct output.

The mean is 4692 and median is 4533. I get variance 1483780.469308 using the following code:

// size is the element count, in this case 1000000
// value_sum is __int64

double p2 = pow( (double)(value_sum - (value_sum/size)), (double)2.0 );
double variance = sqrt( (double)(p2 / (size-1)) );

Am I getting a reasonable value?

Is anything wrong with the calculation?

7条回答
Anthone
2楼-- · 2019-02-26 01:19

Just for fun, a slightly different route to the same result, using std::valarray instead of std::vector and (various) algorithms:

template <class T>
T const variance(std::valarray<T> const &v) {
    if (v.size() == 0)
        return T(0.0);
    T average = v.sum() / v.size();
    std::valarray<T> diffs = v-average;
    diffs *= diffs;
    return diffs.sum()/diffs.size();
}

As Jacob hinted, there are really two possible versions of a variance calculation. As it stands, this assumes your inputs are the "universe". If you've taken only a sample of the overall universe, the last line should use: (diffs.size()-1) instead of diffs.size().

查看更多
Rolldiameter
3楼-- · 2019-02-26 01:20

Since variance is the square of the standard deviation, the answers to SO 1174984 should help out. The short diagnosis is that you need to compute the sum of the squares of the values as well as the sum of the values, and you don't seem to be doing that.

Since you have 106 values, and the square of any value can be up to 108, you could end up with a sum of squares up to 1014; your 64-bit integers can store up to 1018, so you could still handle ten thousand times as many inputs, or values ranging up to one million instead of only ten thousand, without running into overflows. There's no urgent need, therefore, to move to pure double computations.

查看更多
迷人小祖宗
4楼-- · 2019-02-26 01:22

Use a different formula maybe?

#include <functional>
#include <algorithm>
#include <iostream>
int main()
{
 using namespace std;

 vector<double> num( 3 );
 num[ 0 ] = 4000.9, num[ 1 ] = 11111.221, num[ 2 ] = -2;


 double mean = std::accumulate(num.begin(), num.end(), 0.0) / num.size();
 vector<double> diff(num.size());
 std::transform(num.begin(), num.end(), diff.begin(), 
                std::bind2nd(std::minus<double>(), mean));
 double variance = std::inner_product(diff.begin(), diff.end(), 
                                     diff.begin(), 0.0) / (num.size() - 1);
 cout << "mean = " << mean << endl
      << "variance = " << variance << endl;
}

Outputs: mean = 5036.71 variance = 3.16806e+07

查看更多
来,给爷笑一个
5楼-- · 2019-02-26 01:31

Since you're working with large numbers and then doing floating-point operations on them, you might want to do everything in doubles; that would save you a lot of casts.

Using pow .. 2 to calculate a square seems a bit awkward. You could calculate your number first, then multiply it by itself to get a square.

If you're doing division and feel the need to cast, cast the operands (i.e. the numerator and/or denominator) to double rather than the result. You're losing accuracy if you divide integers.

I'm not sure if your formula for variance is correct. You may want to look at the explanation in Wikipedia, for example. But I'm no math expert either, so I'm not sure you have a mistake.

查看更多
男人必须洒脱
6楼-- · 2019-02-26 01:40

Sample Variance calculation:

#include <math.h>
#include <vector>

double Variance(std::vector<double>);

int main()
{
     std::vector<double> samples;
     samples.push_back(2.0);
     samples.push_back(3.0);
     samples.push_back(4.0);
     samples.push_back(5.0);
     samples.push_back(6.0);
     samples.push_back(7.0);

     double variance = Variance(samples);
     return 0;
}

double Variance(std::vector<double> samples)
{
     int size = samples.size();

     double variance = 0;
     double t = samples[0];
     for (int i = 1; i < size; i++)
     {
          t += samples[i];
          double diff = ((i + 1) * samples[i]) - t;
          variance += (diff * diff) / ((i + 1.0) *i);
     }

     return variance / (size - 1);
}
查看更多
在下西门庆
7楼-- · 2019-02-26 01:44

First of all, if you're just looking to get a handle on what is a "reasonable" variance, keep in mind that variance is basically standard deviation squared. Standard deviation roughly measures the typical distance from a data point to its expected value.

So if your data has mean 4692, and your calculated variance is coming out to 1483780, that means your standard deviation is about 1218, which would suggest your numbers tend to be somewhere in the vicinity of the range 3474 - 5910. So that variance actually seems a bit low to me if the range of your numbers is 0 - 10000; but it obviously depends on the distribution of your data.

As for the calculation itself: You can calculate the variance using a running calculation as you're reading your data the first time around (you don't have to know the mean in advance) using Welford's Method:

Initialize M1 = x1 and S1 = 0.

For subsequent x's, use the recurrence formulas

Mk = Mk-1+ (xk - Mk-1)/k Sk = Sk-1 + (xk - Mk-1)*(xk - Mk).

For 2 ≤ k ≤ n, the kth estimate of the variance is s2 = Sk/(k - 1).

查看更多
登录 后发表回答