Python: Bar chart - plot sum of values by a) year

2019-08-14 05:52发布

问题:

I have time series data, i.e. by date (YYYY-MM-DD), returns, pnl, # of trades:

date             returns       pnl      no_trades
1998-01-01         0.01        0.05         5
1998-01-02        -0.04        0.12         2
...
2010-12-31         0.05        0.25         3

Now I would like to show horizontal bar charts with a) the average of the returns b) sum of the pnls

by:

1) year, i.e. 1998, 1999, ..., 2010

2) quarter across all years, i.e. Q1 (YYYY-01-01 to YYYY-03-31), Q2, .., Q4

Additionally, the sum of # of trades per 1) and 2) should denote a number next to each of the horizontal bars.

So in my opinion there needs to be two separate steps:

1) Get the data in the right format

2) Feed the data to the plot and then with overlay of multiple plots.

Sample data:

start = datetime(1998, 1, 1)
end = datetime(2001, 12, 31)
dates = pd.date_range(start, end, freq = 'D')

df = pd.DataFrame(np.random.randn(len(dates), 3), index = dates, 
                  columns = ['returns', 'pnl', 'no_trades'])

So that could be two horizontal bar charts for year and quarter each:

1) one for returns: bar chart, number in the middle of the bar, sum of no_trades at the end of the bar

2) one for pnl: bar chart, number in the middle of the bar, sum of no_trades at the end of the bar

Plus a dotted line vertical line across the going across the bars showing the average returns and pnl.

I could do it in excel (which in fact is adding columns with the respective view and then pivot chart it), but would prefer an "automatized" way with the possibility to reproduce (or understand how it's done) via python.

edit: as discussed in below comment, this is how far I've got; however, I am not sure whether this is the most the fastest approach with regards to 1). I am currently working on 2).

df_ret_year = df[['date', 'returns']].groupby(df['date'].dt.year).mean()
df_ret_quarter = df[['date', 'returns']].groupby(df['date'].dt.quarter).mean()

df_pnl_year = df[['date', 'pnl']].groupby(df['date'].dt.year).sum()
df_pnl_quarter = df[['date', 'pnl']].groupby(df['date'].dt.quarter).sum()

df_trades_year = df[['date', 'pnl']].groupby(df['date'].dt.year).sum()
df_trades_quarter = df[['date', 'pnl']].groupby(df['date'].dt.quarter).sum()

回答1:

start = datetime(1998, 1, 1)
end = datetime(2001, 12, 31)
dates = pd.date_range(start, end, freq = 'D')

Create the DataFrame with a MultiIndex - (year,quarter)

index = pd.MultiIndex.from_tuples([(thing.year, thing.quarter) for thing in dates])
df = pd.DataFrame(np.random.randn(len(dates), 3), index = index, 
                  columns = ['returns', 'pnl', 'no_trades'])

Then you can group by year, quarter or year and quarter:

gb_yr = df.groupby(level=0)
gb_qtr = df.groupby(level=1)
gb_yr_qtr = df.groupby(level=(0,1))

>>> 
>>> # yearly means
>>> gb_yr.mean()
       returns       pnl  no_trades
1998  0.080989 -0.019115   0.142576
1999 -0.040881 -0.005331   0.029815
2000 -0.036227 -0.100028  -0.009175
2001  0.097230 -0.019342  -0.089498
>>> 
>>> # quarterly means across all years
>>> gb_qtr.mean()
    returns       pnl  no_trades
1  0.036992  0.023923   0.048497
2  0.053445 -0.039583   0.076721
3  0.003891 -0.016180   0.004619
4  0.007145 -0.111050  -0.054988
>>> 
>>> # means by year and quarter
>>> gb_yr_qtr.mean()
         returns       pnl  no_trades
1998 1 -0.062570  0.139856   0.105288
     2  0.044946 -0.008685   0.200393
     3  0.152209  0.007341   0.119093
     4  0.185858 -0.211401   0.145347
1999 1  0.085799  0.072655   0.054060
     2  0.111595  0.002972   0.068792
     3 -0.194506 -0.093435   0.107210
     4 -0.161999 -0.001732  -0.109851
2000 1  0.001543 -0.083488   0.174226
     2 -0.064343 -0.158431  -0.071415
     3 -0.036334 -0.037008  -0.068717
     4 -0.045669 -0.121640  -0.069474
2001 1  0.123592 -0.032138  -0.140982
     2  0.121582  0.005810   0.109115
     3  0.094194  0.058382  -0.139110
     4  0.050388 -0.109429  -0.185975
>>>
>>> # operate on single columns
>>> gb_yr['pnl'].sum()
1998    -6.976917
1999    -1.945935
2000   -36.610206
2001    -7.060010
Name: pnl, dtype: float64

>>> # plotting
>>> from matplotlib import pyplot as plt
>>> gb_yr.mean().plot()
<matplotlib.axes._subplots.AxesSubplot object at 0x000000000C04BF28>
>>> plt.show()
>>> plt.close()