Let's say that I have a value that I've measured every day for the past 90 days. I would like to plot a histogram of the values, but I want to make it easy for the viewer to see where the measurements have accumulated over certain non-overlapping subsets of the past 90 days. I want to do this by "subdividing" each bar of the histogram into chunks. One chunk for the earliest observations, one for more recent, one for the most recent.
This sounds like a job for df.plot(kind='bar', stacked=True)
but I'm having trouble getting the details right.
Here's what I have so far:
import numpy as np
import pandas as pd
import seaborn as sbn
np.random.seed(0)
data = pd.DataFrame({'values': np.random.randn(90)})
data['bin'] = pd.cut(data['values'], 15, labels=False)
forhist = pd.DataFrame({'first70': data[:70].groupby('bin').count()['bin'],
'next15': data[70:85].groupby('bin').count()['bin'],
'last5': data[85:].groupby('bin').count()['bin']})
forhist.plot(kind='bar', stacked=True)
And that gives me:
This graph has some shortcomings:
- The bars are stacked in the wrong order.
last5
should be on top andnext15
in the middle. I.e. they should be stacked in the order of the columns inforhist
. - There is horizontal space between the bars
- The x-axis is labeled with integers rather than something indicative of the values the bins represent. My "first choice" would be to have the x-axis labelled exactly as it would be if I just ran
data['values'].hist()
. My "second choice" would be to have the x-axis labelled with the "bin names" that I would get if I didpd.cut(data['values'], 15)
. In my code, I usedlabels=False
because if I didn't do that, it would have used the bin edge labels (as strings) as the bar labels, and it would have put these in alphabetical order, making the graph basically useless.
What's the best way to approach this? I feel like I'm using very clumsy functions so far.
Ok, here's one way to attack it, using features from the
matplotlib
hist
function itself:Which gives: