I would like to compare a set of distributions of scores (score
), grouped by some categories (centrality
) and colored by some other (model
). I've tried the following with seaborn:
plt.figure(figsize=(14,6))
seaborn.boxplot(x="centrality", y="score", hue="model", data=data, palette=seaborn.color_palette("husl", len(models) +1))
seaborn.despine(offset=10, trim=True)
plt.savefig("/home/i11/staudt/Eval/properties-replication-test.pdf", bbox_inches="tight")
There are some problems I have with this plot:
- There is a large amount of outliers and I don't like how they are drawn here. Can I remove them? Can I change the appearance to show less clutter? Can I color them at least so that their color matches the box color?
- The
model
valueoriginal
is special because all other distributions should be compared to the distribution oforiginal
. This should be visually reflected in the plot. Can I makeoriginal
the first box of every group? Can I offset or mark it differently somehow? Would it be possible to draw a horizontal line through the median of eachoriginal
distribution and through the group of boxes? - some of the values of
score
are very small, how to do proper scaling of the y-axis to show them?
EDIT:
Here is an example with a log-scaled y-axis - also not yet ideal. Why do the some boxes seem cut off at the low end?
Outlier display
You should be able to pass any arguments to
seaborn.boxplot
that you can pass toplt.boxplot
(see documentation), so you could adjust the display of the outliers by settingflierprops
. Here are some examples of what you can do with your outliers.If you don't want to display them, you could do
or you could make them light gray like so:
Order of groups
You can set the order of the groups manually with
hue_order
, e.g.Scaling of y-axis
You could just get the minimum and maximum values of all y-values and set
y_lim
accordingly? Something like this:EDIT: This last point doesn't really make sense since the automatic
y_lim
range will already include all the values, but I'm leaving it just as an example of how to adjust these settings. As mentioned in the comments, log-scaling probably makes more sense.