How to sum coupled values in a dict-like structure

2019-08-14 18:45发布

问题:

I have an xlsx that I'm parsing with openpyxl.

Column A is Product Name, Column B is revenue, and I want to extract each pair of prouct-revenue values into a dict. Were there no duplicate products, it would simply be a matter of the creating a dict by mapping ws.columns appropriately.

The problem is, there are multiple entries for some (but not all) products. For these, I need to sum the values in question, and just return a single key for those products (as for the rest). So if my revenue spreadsheet contains the following:

I want to sum the values of Revenue for Banana before returning the dict. The desired outcome then is:

{'Banana': 7.2, 'Apple': 1.7, 'Pear': 6.2, 'Kiwi': 1.2}

The following would work OK were there no duplicates:

revenue{}
i = 0;
for product in ws.columns[0]:
    revenue[product.value] = ws.columns[1][i].value
    i+=1

But obviously it breaks down when it encounters duplicates. I could try using a MultiDict(), which will give a structure from which I can perform the addition and create my final dict:

d = MultiDict()
for i in range(len(ws.columns[1])):
        d.add(ws.columns[0][i].value,ws.columns[1][i].value)

This leaves me with a MultiDict, which itself is actually a list of tuples, and it all gets a tad convoluted. Is there a neater or standard-library way of achieving the same-key-multiple-times data structure? What about employing zip()? Doesn't necessarily have to be dict-like. I just need to be able to create a dict from it (and then perform the addition).

回答1:

This should be close to what you want, assuming you can transform your data to a list of key-value tuples:

list_key_value_tuples = [("A", 1), ("B", 2), ("A", 3)]

d = {}
for key, value in list_key_value_tuples:
    d[key] = d.get(key, 0) + value

> print d
{'A': 4, 'B': 2}


回答2:

collections.defaultdict was made for this type of use case.

>>> 
>>> d = collections.defaultdict(float)
>>> p = [('Kiwi', 1.2), ('Banana', 3.2), ('Pear', 6.2), ('Banana', 2.3), ('Apple', 1.7), ('Banana', 1.7)]
>>> for k,v in p:
    d[k] += v


>>> d
defaultdict(<type 'float'>, {'Kiwi': 1.2, 'Pear': 6.2, 'Banana': 7.2, 'Apple': 1.7})
>>>


回答3:

Assuming length of second column is less than the first one; one can simply group rows by value in the first column and sum the rest like the following:

from itertools import izip_longest, groupby
from operator import itemgetter

rows = izip_longest(ws.columns[0], ws.columns[1], fillvalue=0)

result = dict((k, sum((g[1] for g in v))) for k, v in groupby(rows, itemgetter(0)))