Most pythonic way of counting matching elements in

2020-03-09 07:00发布

问题:

I have an iterable of entries on which I would like to gather some simple statistics, say the count of all numbers divisible by two and the count of all numbers divisible by three.

My first alternative, While only iterating through the list once and avoiding the list expansion (and keeping the split loop refactoring in mind), looks rather bloated:

(alt 1)

r = xrange(1, 10)

twos = 0
threes = 0

for v in r:
  if v % 2 == 0:
    twos+=1
  if v % 3 == 0:
    threes+=1

print twos
print threes

This looks rather nice, but has the drawback of expanding the expression to a list:

(alt 2)

r = xrange(1, 10)

print len([1 for v in r if v % 2 == 0])
print len([1 for v in r if v % 3 == 0])

What I would really like is something like a function like this:

(alt 3)

def count(iterable):
  n = 0
  for i in iterable:
    n += 1
  return n

r = xrange(1, 10)

print count(1 for v in r if v % 2 == 0)
print count(1 for v in r if v % 3 == 0)

But this looks a lot like something that could be done without a function. The final variant is this:

(alt 4)

r = xrange(1, 10)

print sum(1 for v in r if v % 2 == 0)
print sum(1 for v in r if v % 3 == 0)

and while the smallest (and in my book probably the most elegant) it doesn't feel like it expresses the intent very well.

So, my question to you is:

Which alternative do you like best to gather these types of stats? Feel free to supply your own alternative if you have something better.

To clear up some confusion below:

  • In reality my filter predicates are more complex than just this simple test.
  • The objects I iterate over are larger and more complex than just numbers
  • My filter functions are more different and hard to parameterize into one predicate

回答1:

Having to iterate over the list multiple times isn't elegant IMHO.

I'd probably create a function that allows doing:

twos, threes = countmatching(xrange(1,10),
                             lambda a: a % 2 == 0,
                             lambda a: a % 3 == 0)

A starting point would be something like this:

def countmatching(iterable, *predicates):
    v = [0] * len(predicates)
    for e in iterable:
        for i,p in enumerate(predicates):
            if p(e):
                v[i] += 1
    return tuple(v)

Btw, "itertools recipes" has a recipe for doing much like your alt4.

def quantify(seq, pred=None):
    "Count how many times the predicate is true in the sequence"
    return sum(imap(pred, seq))


回答2:

Alt 4! But maybe you should refactor the code to a function that takes an argument which should contain the divisible number (two and three). And then you could have a better functionname.

def methodName(divNumber, r):
  return sum(1 for v in r if v % divNumber == 0)


print methodName(2, xrange(1, 10))
print methodName(3, xrange(1, 10))


回答3:

You could use the filter function.

It filters a list (or strictly an iterable) producing a new list containing only the items for which the specified function evaluates to true.

r = xrange(1, 10)

def is_div_two(n):
    return n % 2 == 0

def is_div_three(n):
    return n % 3 == 0

print len(filter(is_div_two,r))
print len(filter(is_div_three,r))

This is good as it allows you keep your statistics logic contained in a function and the intent of the filter should be pretty clear.



回答4:

I would choose a small variant of your (alt 4):

def count(predicate, list):
    print sum(1 for x in list if predicate(x))

r = xrange(1, 10)

count(lambda x: x % 2 == 0, r)
count(lambda x: x % 3 == 0, r)
# ...

If you want to change what count does, change its implementation in one place.

Note: since your predicates are complex, you'll probably want to define them in functions instead of lambdas. And so you'll probably want to put all this in a class rather than the global namespace.



回答5:

Well you could do one list comprehension/expression to get a set of tuples with that stat test in them and then reduce that down to get the sums.


r=xrange(10)
s=( (v % 2 == 0, v % 3 == 0) for v in r )
def add_tuples(t1,t2):
     return tuple(x+y for x,y in zip(t1, t2))
sums=reduce(add_tuples, s, (0,0)) # (0,0) is starting amount

print sums[0] # sum of numbers divisible by 2
print sums[1] # sum of numbers divisible by 3

Using generator expression etc should mean you'll only run through the iterator once (unless reduce does anything odd?). Basically you'd be doing map/reduce...



回答6:

True booleans are coerced to unit integers, and false booleans to zero integers. So if you're happy to use scipy or numpy, make an array of integers for each element of your sequence, each array containing one element for each of your tests, and sum over the arrays. E.g.

>>> sum(scipy.array([c % 2 == 0, c % 3 == 0]) for c in xrange(10))
array([5, 4])


回答7:

I would definitely be looking at a numpy array instead of an iterable list if you just have numbers. You will almost certainly be able to do what you want with some terse arithmetic on the array.



回答8:

Not as terse as you are looking for, but more efficient, it actually works with any iterable, not just iterables you can loop over multiple times, and you can expand the things to check for without complicating it further:

r = xrange(1, 10)

counts = {
   2: 0,
   3: 0,
}

for v in r:
    for q in counts:
        if not v % q:
            counts[q] += 1
        # Or, more obscure:
        #counts[q] += not v % q

for q in counts:
    print "%s's: %s" % (q, counts[q])


回答9:

from itertools import groupby
from collections import defaultdict

def multiples(v):
    return 2 if v%2==0 else 3 if v%3==0 else None
d = defaultdict(list)

for k, values in groupby(range(10), multiples):
    if k is not None:
        d[k].extend(values)


回答10:

Inspired by the OO-stab above, I had to try my hands on one as well (although this is way overkill for the problem I'm trying to solve :)

class Stat(object):
  def update(self, n):
    raise NotImplementedError

  def get(self):
    raise NotImplementedError


class TwoStat(Stat):
  def __init__(self):
    self._twos = 0

  def update(self, n):
    if n % 2 == 0: self._twos += 1

  def get(self):
    return self._twos


class ThreeStat(Stat):
  def __init__(self):
    self._threes = 0

  def update(self, n):
    if n % 3 == 0: self._threes += 1

  def get(self):
    return self._threes


class StatCalculator(object):
  def __init__(self, stats):
    self._stats = stats

  def calculate(self, r):
    for v in r:
      for stat in self._stats:
        stat.update(v)
    return tuple(stat.get() for stat in self._stats)


s = StatCalculator([TwoStat(), ThreeStat()])

r = xrange(1, 10)
print s.calculate(r)


回答11:

Alt 3, for the reason that it doesn't use memory proportional to the number of "hits". Given a pathological case like xrange(one_trillion), many of the other offered solutions would fail badly.



回答12:

The idea here is to use reduction to avoid repeated iterations. Also, this does not create any extra data structures, if memory is an issue for you. You start with a dictionary with your counters ({'div2': 0, 'div3': 0}) and increment them along the iteration.

def increment_stats(stats, n):
    if n % 2 == 0: stats['div2'] += 1
    if n % 3 == 0: stats['div3'] += 1
    return stats

r = xrange(1, 10)
stats = reduce(increment_stats, r, {'div2': 0, 'div3': 0})
print stats

If you want to count anything more complicated than divisors, it would be appropriate to use a more object-oriented approach (with the same advantages), encapsulating the logic for stats extraction.

class Stats:

    def __init__(self, div2=0, div3=0):
        self.div2 = div2
        self.div3 = div3

    def increment(self, n):
        if n % 2 == 0: self.div2 += 1
        if n % 3 == 0: self.div3 += 1
        return self

    def __repr__(self):
        return 'Stats(%d, %d)' % (self.div2, self.div3)

r = xrange(1, 10)
stats = reduce(lambda stats, n: stats.increment(n), r, Stats())
print stats

Please point out any mistakes.

@Henrik: I think the first approach is less maintainable since you have to control initialization of the dictionary in one place and update in another, as well as having to use strings to refer to each stat (instead of having attributes). And I do not think OO is overkill in this case, for you said the predicates and objects will be complex in your application. In fact if the predicates were really simple, I wouldn't even bother to use a dictionary, a single fixed size list would be just fine. Cheers :)