I have a list which is made up of dictionaries. I wish to subset the list, selecting the dictionaries based a comparison of element values (in this case, selecting only one dictionary per date, with the dict that's selected being the one with the largest realtime_start
value).
An example list is:
obs = [{'date': '2012-10-01',
'realtime_end': '2013-02-18',
'realtime_start': '2012-11-15',
'value': '231.751'},
{'date': '2012-10-01',
'realtime_end': '9999-12-31',
'realtime_start': '2012-12-19',
'value': '231.623'},
{'date': '2012-11-01',
'realtime_end': '2013-02-18',
'realtime_start': '2012-12-14',
'value': '231.025'},
{'date': '2012-11-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-01-19',
'value': '231.071'},
{'date': '2012-12-01',
'realtime_end': '2013-02-18',
'realtime_start': '2013-01-16',
'value': '230.979'},
{'date': '2012-12-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-02-19',
'value': '231.137'},
{'date': '2012-12-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-03-19',
'value': '231.197'},
{'date': '2013-01-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-02-21',
'value': '231.198'},
{'date': '2013-01-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-03-21',
'value': '231.222'}]
I wish to subset the list, such that it only contains one dict for each date, with the dict selected which has the largest realtime_start
value.
In this case, after the list has been subset, it would be:
sub = [ {'date': '2012-10-01',
'realtime_end': '9999-12-31',
'realtime_start': '2012-12-19',
'value': '231.623'},
{'date': '2012-11-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-01-19',
'value': '231.071'},
{'date': '2012-12-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-03-19',
'value': '231.197'},
{'date': '2013-01-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-03-21',
'value': '231.222'}]
Additionally, say i specified a maximum date:
maxDate = "2013-02-21"
How would i subset such that the realtime_start
value was not greater than maxDate? In this case, i'd expect the following subset:
sub2 = [ {'date': '2012-10-01',
'realtime_end': '9999-12-31',
'realtime_start': '2012-12-19',
'value': '231.623'},
{'date': '2012-11-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-01-19',
'value': '231.071'},
{'date': '2012-12-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-02-19',
'value': '231.137'},
{'date': '2013-01-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-02-21',
'value': '231.198'} ]
How would i write such a subset operation in Python 2.7.3? Is this possible in Python?
thanks
You basically want to group your entries by the date
field, and then perform operations on the groups of entries associated with each of those date
s. The way I do this sort of thing is to use a plain ol' dict
. In such cases, I think of a dict
as a special kind of set
—a "decorated set", if you will—, whose every (necessarily hashable) element is "decorated" with some (in general non-hashable) payload (i.e., the associated dictionary value). In your example, each element of this "decorated set" is one of the possible values for the date
field among all the dicts in obs
, and its associated payload is the list of all the dicts in obs
that have that key as their date
field.
Thus,
In [4]: dobs = dict()
In [5]: for o in obs:
...: d = o['date']
...: if d not in dobs:
...: dobs[d] = []
...: dobs[d].append(o)
...:
One can use dict.setdefault
to write the body of the for
-loop more succinctly, like this:
In [7]: for o in obs:
...: dobs.setdefault(o['date'], []).append(o)
...:
Or one can pre-load the dictionary with empty lists, and then just append to them without the need to check if the key is already in the dict:
In [9]: dobs = dict([(d, []) for d in set([e['date'] for e in obs])])
In [10]: for o in obs:
....: dobs[o['date']].append(o)
....:
After any of the above, you'll end up with a dictionary, dobs
, whose keys are date
's and whose values are lists of all the dicts in obs
that have the corresponding key as the date
value.
Now you can go to town with this dict, and apply any kind of function to its values. For example, to extract, for each date
, the dict having the most recent realtime_start
, you could do this:
In [11]: rts = lambda x: x['realtime_start']
In [12]: [sorted(e, key=rts)[-1] for e in dobs.values() if e]
Out[12]:
[{'date': '2013-01-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-03-21',
'value': '231.222'},
{'date': '2012-12-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-03-19',
'value': '231.197'},
{'date': '2012-10-01',
'realtime_end': '9999-12-31',
'realtime_start': '2012-12-19',
'value': '231.623'},
{'date': '2012-11-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-01-19',
'value': '231.071'}]
(The if e
qualifier at the end of the comprehension above is not necessary here, but I include it in the name of "defensive programming". Without it, the code above would fail if any of the values in dobs
happens to be empty. We know that this won't be the case for dobs
, but it may become an issue in a more general setting. More about this below.)
You also ask how to perform the selection described above while capping realtime_start
at 2013-02-21
. For this problem, I find it conceptually cleaner to split the problem into two subproblems: first, generate subset of dobs
that meet the specified constraint on realtime_start
; then, perform the same operation as before on the restricted dict. Thus:
In [13]: dobs2 = dict([(k, [d for d in v if d['realtime_start'] <= maxDate])
....: for k, v in dobs.items()])
In [14]: [sorted(e, key=rts)[-1] for e in dobs2.values() if e]
Out[14]:
[{'date': '2013-01-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-02-21',
'value': '231.198'},
{'date': '2012-12-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-02-19',
'value': '231.137'},
{'date': '2012-10-01',
'realtime_end': '9999-12-31',
'realtime_start': '2012-12-19',
'value': '231.623'},
{'date': '2012-11-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-01-19',
'value': '231.071'}]
Once again, the if e
qualifier is not necessary in this case, but if maxDate
were low enough that some groups ended up empty, then it would be essential. (Without it, the attempt to access the last element of the first-encountered empty list would raise an IndexError
exception.)
As you may have noticed, the ordering of the results above differs from yours. This is because the built-in Python dict
s do not preserve ordering. If the ordering of the original obs
list is significant, then you can replace all calls to dict
with calls to collections.OrderedDict
. E.g.:
In [15]: from collections import OrderedDict
In [16]: dobs = OrderedDict()
In [17]: for o in obs:
....: dobs.setdefault(o['date'], []).append(o)
....:
In [18]: [sorted(e, key=rts)[-1] for e in dobs.values()]
Out[18]:
[{'date': '2012-10-01',
'realtime_end': '9999-12-31',
'realtime_start': '2012-12-19',
'value': '231.623'},
{'date': '2012-11-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-01-19',
'value': '231.071'},
{'date': '2012-12-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-03-19',
'value': '231.197'},
{'date': '2013-01-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-03-21',
'value': '231.222'}]
You could use itertools.groupby
:
>>> import itertools
>>> # sort so that the same dates are contiguous
>>> obs.sort(key=lambda x: x['date'])
>>> grouped = itertools.groupby(obs, lambda x: x['date'])
>>> m = [max(g, key=lambda x: x['realtime_start']) for k, g in grouped]
>>>
>>> import pprint
>>> pprint.pprint(m)
[{'date': '2012-10-01',
'realtime_end': '9999-12-31',
'realtime_start': '2012-12-19',
'value': '231.623'},
{'date': '2012-11-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-01-19',
'value': '231.071'},
{'date': '2012-12-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-03-19',
'value': '231.197'},
{'date': '2013-01-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-03-21',
'value': '231.222'}]
You can add other conditions too:
>>> grouped = itertools.groupby(obs, lambda x: x['date'])
>>> m = [max((w for w in g if w['realtime_start'] <= maxDate),
key=lambda x: x['realtime_start']) for k, g in grouped]
>>> pprint.pprint(m)
[{'date': '2012-10-01',
'realtime_end': '9999-12-31',
'realtime_start': '2012-12-19',
'value': '231.623'},
{'date': '2012-11-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-01-19',
'value': '231.071'},
{'date': '2012-12-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-02-19',
'value': '231.137'},
{'date': '2013-01-01',
'realtime_end': '9999-12-31',
'realtime_start': '2013-02-21',
'value': '231.198'}]
But I'd recommend looking at my favourite Python data manipulation library, pandas: it's great for tabular and time-series data, and data manipulation using it would be much easier (and more R-like in functionality) than anything you could roll yourself.