What is the best way to count observations by date in a Pandas DataFrame when the timestamps are non-unique?
df = pd.DataFrame({'User' : ['A', 'B', 'C'] * 40,
'Value' : np.random.randn(120),
'Time' : [np.random.choice(pd.date_range(datetime.datetime(2013,1,1,0,0,0),datetime.datetime(2013,1,3,0,0,0),freq='H')) for i in range(120)]})
Ideally, the output would provide the number of observations per day (or some other higher order unit of time). This could then be used to plot the activity over time.
2013-01-01 60
2013-01-02 60
Edit: another solution which is faster is to use value_counts
(and normalize):
In [41]: %timeit df1 = df.set_index('Time'); pd.value_counts(df1.index.normalize(), sort=False)
1000 loops, best of 3: 586 µs per loop
I had thought this was more concisely written as a resample
, if you use a DatetimeIndex:
However it seems to be significantly slower, and (suprisingly) the Counter solution is fastest!
In [11]: df1 = df.set_index('Time')
In [12]: df1.User.resample('D', how=len)
Out[12]:
Time
2013-01-01 59
2013-01-02 58
2013-01-03 3
Freq: D, Name: User, dtype: int64
It's always worth checking out some timeits for these:
In [21]: %timeit df1.User.resample('D', how=len)
1000 loops, best of 3: 720 µs per loop
Unfortunately the set_index
makes this more expensive:
In [22]: %timeit df1 = df.set_index('Time'); df1.User.resample('D', how=len)
1000 loops, best of 3: 1.1 ms per loop
Comparing:
In [23]: %%timeit
....: grouped_dates = df.groupby(df['Time'].apply(lambda x : x.date()))
....: grouped_dates['Time'].aggregate(len)
....:
1000 loops, best of 3: 788 µs per loop
In [24]: %%timeit
....: counted_dates = Counter(df['Time'].apply(lambda x: x.date()))
....: counted_series = pd.Series(counted_dates)
....: counted_series.index = pd.to_datetime(counted_series.index)
....:
1000 loops, best of 3: 568 µs per loop
I had suspected that for more dates it would be different...
In [31]: df = pd.DataFrame({'User' : ['A', 'B', 'C'] * 400,
'Value' : np.random.randn(1200),
'Time' : [np.random.choice(pd.date_range(datetime.datetime(1992,1,1,0,0,0),datetime.datetime(2014,1,1,0,0,0),freq='H')) for i in range(1200)]})
In [32]: %timeit df1 = df.set_index('Time'); df1.User.resample('D', how=len)
10 loops, best of 3: 28.7 ms per loop
In [33]: %%timeit
....: grouped_dates = df.groupby(df['Time'].apply(lambda x : x.date()))
....: grouped_dates['Time'].aggregate(len)
....:
100 loops, best of 3: 6.82 ms per loop
In [34]: %%timeit
....: counted_dates = Counter(df['Time'].apply(lambda x: x.date()))
....: counted_series = pd.Series(counted_dates)
....: counted_series.index = pd.to_datetime(counted_series.index)
....:
100 loops, best of 3: 3.04 ms per loop
But Counter still wins...!
Edit: but is smashed by value_counts:
In [42]: %timeit df1 = df.set_index('Time'); pd.value_counts(df1.index.normalize(), sort=False)
1000 loops, best of 3: 989 µs per loop
The "un-Panda-ic" way of doing this would be using a Counter object on the series of datetimes converted to dates, converting this counter back to a series, and coercing the index on this series to datetimes.
In[1]: from collections import Counter
In[2]: counted_dates = Counter(df['Time'].apply(lambda x: x.date()))
In[3]: counted_series = pd.Series(counted_dates)
In[4]: counted_series.index = pd.to_datetime(counted_series.index)
In[5]: counted_series
Out[5]:
2013-01-01 60
2013-01-02 60
A more "Panda-ic" way would be to use a groupby operation on the series and then aggregate the output by length.
In[1]: grouped_dates = df.groupby(df['Time'].apply(lambda x : x.date()))
In[2]: grouped_dates['Time'].aggregate(len)
Out[2]:
2013-01-01 60
2013-01-02 60
EDIT: Another highly concise possibility, borrowed from here is to use the nunique
class:
In[1]: df.groupby(df['Time'].apply(lambda x : x.date())).agg({'Time':pd.Series.nunique})
Out[1]:
2013-01-01 60
2013-01-02 60
Besides stylistic differences, does one have significant performance advantages over the other? Are there other methods built-in that I've overlooked?
len(Series.unique()) might be even faster.
On my PC:
%timeit df1 = df.set_index('Time'); pd.value_counts(df1.index.normalize(), sort=False)
1000 loops, best of 3: 2.06 ms per loop
while
%timeit df1 = df.set_index('Time'); len(df1.index.normalize().unique())
1000 loops, best of 3: 1.04 ms per loop
Interestingly, len(Series.unique()) is usually much faster than Series.nunique().
For small arrays with up to x000 items it's 10-15 times faster, for larger arrays with millions items it's 3-4 times faster.