An statistical accumulator allows one to perform incremental calculations. For instance, for computing the arithmetic mean of a stream of numbers given at arbitrary times one could make an object which keeps track of the current number of items given, n
and their sum, sum
. When one requests the mean, the object simply returns sum/n
.
An accumulator like this allows you to compute incrementally in the sense that, when given a new number, you don't need to recompute the entire sum and count.
Similar accumulators can be written for other statistics (cf. boost library for a C++ implementation).
How would you implement accumulators in Python? The code I came up with is:
class Accumulator(object):
"""
Used to accumulate the arithmetic mean of a stream of
numbers. This implementation does not allow to remove items
already accumulated, but it could easily be modified to do
so. also, other statistics could be accumulated.
"""
def __init__(self):
# upon initialization, the numnber of items currently
# accumulated (_n) and the total sum of the items acumulated
# (_sum) are set to zero because nothing has been accumulated
# yet.
self._n = 0
self._sum = 0.0
def add(self, item):
# the 'add' is used to add an item to this accumulator
try:
# try to convert the item to a float. If you are
# successful, add the float to the current sum and
# increase the number of accumulated items
self._sum += float(item)
self._n += 1
except ValueError:
# if you fail to convert the item to a float, simply
# ignore the exception (pass on it and do nothing)
pass
@property
def mean(self):
# the property 'mean' returns the current mean accumulated in
# the object
if self._n > 0:
# if you have more than zero items accumulated, then return
# their artithmetic average
return self._sum / self._n
else:
# if you have no items accumulated, return None (you could
# also raise an exception)
return None
# using the object:
# Create an instance of the object "Accumulator"
my_accumulator = Accumulator()
print my_accumulator.mean
# prints None because there are no items accumulated
# add one (a number)
my_accumulator.add(1)
print my_accumulator.mean
# prints 1.0
# add two (a string - it will be converted to a float)
my_accumulator.add('2')
print my_accumulator.mean
# prints 1.5
# add a 'NA' (will be ignored because it cannot be converted to float)
my_accumulator.add('NA')
print my_accumulator.mean
# prints 1.5 (notice that it ignored the 'NA')
Interesting design questions arise:
- How to make the accumulator thread-safe?
- How to safely remove items?
- How to architect in a way that allows other statistics to be plugged in easily (a factory for statistics)
If I were doing this in Python, there are two things I would do differently:
For the first one, I would likely want to come up with an API for performing an accumulation, perhaps something like:
Then I would create a AccumulatorRegistry that holds onto these accumulators, and allows the user to call actions and add to all of them. The code may look like:
I should probably speak to your thread-safe question. Python's GIL protects you from a lot of threading issues. There are a few things you may way to do to protect yourself though:
For a generalized, threadsafe higher-level function, you could use something like the following in combination with the
Queue.Queue
class and some other bits:This generator function evades most of its purported responsibility by delegating it to other entities:
Queue.Queue
to supply its source elements in a thread-safe mannercollections.deque
object can be passed in as the value of thestorage
parameter; this provides, among other things, a convenient way to only use the lastn
(in this case 3) valuesmean
) is passed as a parameter. This will result in less-than-optimally efficient code in some cases, but is readily applied to all sorts of situations.Note that there is a possibility of the accumulator timing out if your producer thread takes longer than 0.1 seconds per value. This is easily remedied by passing a longer timeout or by removing the timeout parameter entirely. In the latter case the function will block indefinitely at the end of the queue; this usage makes more sense in a case where it's being used in a sub thread (usually a
daemon
thread). Of course you can also parametrize the arguments that are passed toq.get
as a fourth argument toAccumulator
.If you want to communicate end of queue, i.e. that there are no more values to come, from the producer thread (here
putting_thread
), you can pass and check for a sentinel value or use some other method. There is more info in this thread; I opted to write a subclass of Queue.Queue called CloseableQueue that provides aclose
method.There are various other ways you could customize the behaviour of such a function, for example by limiting the queue size; this is just an example of usage.
edit
As mentioned above, this loses some efficiency because of the necessity of recalculation and also, I think, doesn't really answer your question.
A generator function can also accept values through its
send
method. So you can write a mean generator function likeHere the yield expression is both bringing values —the arguments to
send
— into the function, while simultaneously passing the calculated values out as the return value ofsend
.You can pass the generator returned by a call to that function to a more optimizable accumulator generator function like this one: