-->

Why does statistics.variance use 'unbiased'

2019-07-23 17:57发布

问题:

I have recently started using the statistics module for python.

I've noticed that by default the variance() method returns the 'unbiased' variance or sample variance:

import statistics as st
from random import randint

def myVariance(data):
    # finds the variance of a given set of numbers
    xbar = st.mean(data)
    return sum([(x - xbar)**2 for x in data])/len(data)

def myUnbiasedVariance(data):
    # finds the 'unbiased' variance of a given set of numbers (divides by N-1) 
    xbar = st.mean(data)
    return sum([(x - xbar)**2 for x in data])/(len(data)-1)

population = [randint(0, 1000) for i in range(0,100)]

print myVariance(population)

print myUnbiasedVariance(population)

print st.variance(population)

output:

81295.8011
82116.9708081
82116.9708081

This seems odd to me. I guess a lot of the time people are working with samples so they want a sample variance, but i would expect the default function to calculate a population variance. Does anyone know why this is?

回答1:

I would argue that almost all the time when people estimate the variance from data they work with a sample. And, by the definition of unbiased estimate, the expected value of the unbiased estimate of the variance equals the population variance.

In your code, you use random.randint(0, 1000), which samples from a discrete uniform distribution with 1001 possible values and variance 1000*1002/12 = 83500 (see, e.g., MathWorld). Here code that shows that, on average and when using samples as input, statistics.variance() gets closer to the population variance than statistics.pvariance():

import statistics as st, random, numpy as np

var, pvar = [], []
for i in range(10000):
  smpl = [random.randint(0, 1000) for j in range(10)]
  var.append(st.variance(smpl))
  pvar.append(st.pvariance(smpl))

print "mean variance(sample):  %.1f" %np.mean(var)
print "mean pvariance(sample): %.1f" %np.mean(pvar)
print "pvariance(population):  %.1f" %st.pvariance(range(1001))

Here sample output:

mean variance(sample):  83626.0
mean pvariance(sample): 75263.4
pvariance(population):  83500.0


回答2:

Here is another great post. I was wondering the exact same thing and the answer to this really cleared it up for me. Using np.var you can add an arg to it of "ddof=1" to return the unbiased estimator. Check it out:

What is the difference between numpy var() and statistics variance() in python?

print(np.var([1,2,3,4],ddof=1))
1.66666666667