What is the best way to count observations by date in a Pandas DataFrame when the timestamps are non-unique?
df = pd.DataFrame({'User' : ['A', 'B', 'C'] * 40,
'Value' : np.random.randn(120),
'Time' : [np.random.choice(pd.date_range(datetime.datetime(2013,1,1,0,0,0),datetime.datetime(2013,1,3,0,0,0),freq='H')) for i in range(120)]})
Ideally, the output would provide the number of observations per day (or some other higher order unit of time). This could then be used to plot the activity over time.
2013-01-01 60
2013-01-02 60
Edit: another solution which is faster is to use
value_counts
(and normalize):I had thought this was more concisely written as a
resample
, if you use a DatetimeIndex:However it seems to be significantly slower, and (suprisingly) the Counter solution is fastest!
It's always worth checking out some timeits for these:
Unfortunately the
set_index
makes this more expensive:Comparing:
I had suspected that for more dates it would be different...
But Counter still wins...!
Edit: but is smashed by value_counts:
len(Series.unique()) might be even faster.
On my PC:
while
Interestingly, len(Series.unique()) is usually much faster than Series.nunique(). For small arrays with up to x000 items it's 10-15 times faster, for larger arrays with millions items it's 3-4 times faster.
The "un-Panda-ic" way of doing this would be using a Counter object on the series of datetimes converted to dates, converting this counter back to a series, and coercing the index on this series to datetimes.
A more "Panda-ic" way would be to use a groupby operation on the series and then aggregate the output by length.
EDIT: Another highly concise possibility, borrowed from here is to use the
nunique
class:Besides stylistic differences, does one have significant performance advantages over the other? Are there other methods built-in that I've overlooked?