compute mean in python for a generator

2020-06-09 05:40发布

I'm doing some statistics work, I have a (large) collection of random numbers to compute the mean of, I'd like to work with generators, because I just need to compute the mean, so I don't need to store the numbers.

The problem is that numpy.mean breaks if you pass it a generator. I can write a simple function to do what I want, but I'm wondering if there's a proper, built-in way to do this?

It would be nice if I could say "sum(values)/len(values)", but len doesn't work for genetators, and sum already consumed values.

here's an example:

import numpy 

def my_mean(values):
    n = 0
    Sum = 0.0
    try:
        while True:
            Sum += next(values)
            n += 1
    except StopIteration: pass
    return float(Sum)/n

X = [k for k in range(1,7)]
Y = (k for k in range(1,7))

print numpy.mean(X)
print my_mean(Y)

these both give the same, correct, answer, buy my_mean doesn't work for lists, and numpy.mean doesn't work for generators.

I really like the idea of working with generators, but details like this seem to spoil things.

10条回答
闹够了就滚
2楼-- · 2020-06-09 05:48

Your approach is a good one, but you should instead use the for x in y idiom instead of repeatedly calling next until you get a StopIteration. This works for both lists and generators:

def my_mean(values):
    n = 0
    Sum = 0.0

    for value in values:
        Sum += value
        n += 1
    return float(Sum)/n
查看更多
爷的心禁止访问
3楼-- · 2020-06-09 05:48

Try:

import itertools

def mean(i):
    (i1, i2) = itertools.tee(i, 2)
    return sum(i1) / sum(1 for _ in i2)

print mean([1,2,3,4,5])

tee will duplicate your iterator for any iterable i (e.g. a generator, a list, etc.), allowing you to use one duplicate for summing and the other for counting.

(Note that 'tee' will still use intermediate storage).

查看更多
我命由我不由天
4楼-- · 2020-06-09 05:51
def my_mean(values):
    n = 0
    sum = 0
    for v in values:
        sum += v
        n += 1
    return sum/n

The above is very similar to your code, except by using for to iterate values you are good no matter if you get a list or an iterator. The python sum method is however very optimized, so unless the list is really, really long, you might be more happy temporarily storing the data.

(Also notice that since you are using python3, you don't need float(sum)/n)

查看更多
太酷不给撩
5楼-- · 2020-06-09 05:54

Just one simple change to your code would let you use both. Generators were meant to be used interchangeably to lists in a for-loop.

def my_mean(values):
    n = 0
    Sum = 0.0
    for v in values:
        Sum += v
        n += 1
    return Sum / n
查看更多
来,给爷笑一个
6楼-- · 2020-06-09 06:00

You can use reduce without knowing the size of the array:

from itertools import izip, count
reduce(lambda c,i: (c*(i[1]-1) + float(i[0]))/i[1], izip(values,count(1)),0)
查看更多
【Aperson】
7楼-- · 2020-06-09 06:01
def my_mean(values):
    total = 0
    for n, v in enumerate(values, 1):
        total += v
    return total / n

print my_mean(X)
print my_mean(Y)

There is statistics.mean() in Python 3.4 but it calls list() on the input:

def mean(data):
    if iter(data) is data:
        data = list(data)
    n = len(data)
    if n < 1:
        raise StatisticsError('mean requires at least one data point')
    return _sum(data)/n

where _sum() returns an accurate sum (math.fsum()-like function that in addition to float also supports Fraction, Decimal).

查看更多
登录 后发表回答