Subset list based on value of dictionary element

2019-06-13 17:36发布

问题:

I have a list which is made up of dictionaries. I wish to subset the list, selecting the dictionaries based a comparison of element values (in this case, selecting only one dictionary per date, with the dict that's selected being the one with the largest realtime_start value).

An example list is:

obs = [{'date': '2012-10-01',
  'realtime_end': '2013-02-18',
  'realtime_start': '2012-11-15',
  'value': '231.751'},
 {'date': '2012-10-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2012-12-19',
  'value': '231.623'},
 {'date': '2012-11-01',
  'realtime_end': '2013-02-18',
  'realtime_start': '2012-12-14',
  'value': '231.025'},
 {'date': '2012-11-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-01-19',
  'value': '231.071'},
 {'date': '2012-12-01',
  'realtime_end': '2013-02-18',
  'realtime_start': '2013-01-16',
  'value': '230.979'},
 {'date': '2012-12-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-02-19',
  'value': '231.137'},
 {'date': '2012-12-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-03-19',
  'value': '231.197'},
 {'date': '2013-01-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-02-21',
  'value': '231.198'},
 {'date': '2013-01-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-03-21',
  'value': '231.222'}]

I wish to subset the list, such that it only contains one dict for each date, with the dict selected which has the largest realtime_start value.

In this case, after the list has been subset, it would be:

sub = [ {'date': '2012-10-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2012-12-19',
  'value': '231.623'},
 {'date': '2012-11-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-01-19',
  'value': '231.071'},
 {'date': '2012-12-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-03-19',
  'value': '231.197'},
 {'date': '2013-01-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-03-21',
  'value': '231.222'}]

Additionally, say i specified a maximum date:

maxDate = "2013-02-21"

How would i subset such that the realtime_start value was not greater than maxDate? In this case, i'd expect the following subset:

sub2 = [ {'date': '2012-10-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2012-12-19',
  'value': '231.623'},
 {'date': '2012-11-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-01-19',
  'value': '231.071'},
 {'date': '2012-12-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-02-19',
  'value': '231.137'},
 {'date': '2013-01-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-02-21',
  'value': '231.198'} ]

How would i write such a subset operation in Python 2.7.3? Is this possible in Python?

thanks

回答1:

You basically want to group your entries by the date field, and then perform operations on the groups of entries associated with each of those dates. The way I do this sort of thing is to use a plain ol' dict. In such cases, I think of a dict as a special kind of set—a "decorated set", if you will—, whose every (necessarily hashable) element is "decorated" with some (in general non-hashable) payload (i.e., the associated dictionary value). In your example, each element of this "decorated set" is one of the possible values for the date field among all the dicts in obs, and its associated payload is the list of all the dicts in obs that have that key as their date field.

Thus,

In [4]: dobs = dict()
In [5]: for o in obs:
   ...:     d = o['date']
   ...:     if d not in dobs:
   ...:         dobs[d] = []
   ...:     dobs[d].append(o)
   ...: 

One can use dict.setdefault to write the body of the for-loop more succinctly, like this:

In [7]: for o in obs:
   ...:     dobs.setdefault(o['date'], []).append(o)
   ...: 

Or one can pre-load the dictionary with empty lists, and then just append to them without the need to check if the key is already in the dict:

In [9]: dobs = dict([(d, []) for d in set([e['date'] for e in obs])])
In [10]: for o in obs:
   ....:     dobs[o['date']].append(o)
   ....: 

After any of the above, you'll end up with a dictionary, dobs, whose keys are date's and whose values are lists of all the dicts in obs that have the corresponding key as the date value.

Now you can go to town with this dict, and apply any kind of function to its values. For example, to extract, for each date, the dict having the most recent realtime_start, you could do this:

In [11]: rts = lambda x: x['realtime_start']
In [12]: [sorted(e, key=rts)[-1] for e in dobs.values() if e]
Out[12]: 
[{'date': '2013-01-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-03-21',
  'value': '231.222'},
 {'date': '2012-12-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-03-19',
  'value': '231.197'},
 {'date': '2012-10-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2012-12-19',
  'value': '231.623'},
 {'date': '2012-11-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-01-19',
  'value': '231.071'}]

(The if e qualifier at the end of the comprehension above is not necessary here, but I include it in the name of "defensive programming". Without it, the code above would fail if any of the values in dobs happens to be empty. We know that this won't be the case for dobs, but it may become an issue in a more general setting. More about this below.)

You also ask how to perform the selection described above while capping realtime_start at 2013-02-21. For this problem, I find it conceptually cleaner to split the problem into two subproblems: first, generate subset of dobs that meet the specified constraint on realtime_start; then, perform the same operation as before on the restricted dict. Thus:

In [13]: dobs2 = dict([(k, [d for d in v if d['realtime_start'] <= maxDate])
   ....:               for k, v in dobs.items()])
In [14]: [sorted(e, key=rts)[-1] for e in dobs2.values() if e]
Out[14]: 
[{'date': '2013-01-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-02-21',
  'value': '231.198'},
 {'date': '2012-12-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-02-19',
  'value': '231.137'},
 {'date': '2012-10-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2012-12-19',
  'value': '231.623'},
 {'date': '2012-11-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-01-19',
  'value': '231.071'}]

Once again, the if e qualifier is not necessary in this case, but if maxDate were low enough that some groups ended up empty, then it would be essential. (Without it, the attempt to access the last element of the first-encountered empty list would raise an IndexError exception.)

As you may have noticed, the ordering of the results above differs from yours. This is because the built-in Python dicts do not preserve ordering. If the ordering of the original obs list is significant, then you can replace all calls to dict with calls to collections.OrderedDict. E.g.:

In [15]: from collections import OrderedDict
In [16]: dobs = OrderedDict()
In [17]: for o in obs:
   ....:     dobs.setdefault(o['date'], []).append(o)
   ....: 
In [18]: [sorted(e, key=rts)[-1] for e in dobs.values()]
Out[18]: 
[{'date': '2012-10-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2012-12-19',
  'value': '231.623'},
 {'date': '2012-11-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-01-19',
  'value': '231.071'},
 {'date': '2012-12-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-03-19',
  'value': '231.197'},
 {'date': '2013-01-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-03-21',
  'value': '231.222'}]


回答2:

You could use itertools.groupby:

>>> import itertools
>>> # sort so that the same dates are contiguous
>>> obs.sort(key=lambda x: x['date'])
>>> grouped = itertools.groupby(obs, lambda x: x['date'])
>>> m = [max(g, key=lambda x: x['realtime_start']) for k, g in grouped]
>>> 
>>> import pprint
>>> pprint.pprint(m)
[{'date': '2012-10-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2012-12-19',
  'value': '231.623'},
 {'date': '2012-11-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-01-19',
  'value': '231.071'},
 {'date': '2012-12-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-03-19',
  'value': '231.197'},
 {'date': '2013-01-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-03-21',
  'value': '231.222'}]

You can add other conditions too:

>>> grouped = itertools.groupby(obs, lambda x: x['date'])
>>> m = [max((w for w in g if w['realtime_start'] <= maxDate),
         key=lambda x: x['realtime_start']) for k, g in grouped]
>>> pprint.pprint(m)
[{'date': '2012-10-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2012-12-19',
  'value': '231.623'},
 {'date': '2012-11-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-01-19',
  'value': '231.071'},
 {'date': '2012-12-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-02-19',
  'value': '231.137'},
 {'date': '2013-01-01',
  'realtime_end': '9999-12-31',
  'realtime_start': '2013-02-21',
  'value': '231.198'}]

But I'd recommend looking at my favourite Python data manipulation library, pandas: it's great for tabular and time-series data, and data manipulation using it would be much easier (and more R-like in functionality) than anything you could roll yourself.