I am learning the DropWizard Metrics library (formerly Coda Hale metrics) and I am confused as to when I should be using Meters
vs Timers
. According to the docs:
Meter: A meter measures the rate at which a set of events occur
and:
Timer: A timer is basically a histogram of the duration of a type of event and a meter of the rate of its occurrence
Based on these definitions, I can't discern the difference between these. What's confusing me is that Timer
is not used the way I would have expected it to be used. To me, Timer
is just that: a timer; it should measure the time diff between a start()
and stop()
. But it appears that Timers
also capture rates at which events occur, which feels like they are stepping on Meters
toes.
If I could see an example of what each component outputs that might help me understand when/where to use either of these.
You're confused in part because a DW Metrics Timer IS, among other things, a DW Metrics Meter.
A Meter is exclusively concerned with rates, measured in Hz (events per second). Each Meter results in 4(?) distinct metrics being published:
You use a Meter by recording a value at different points in your code -- DW Metrics automatically jots down the wall time of each call along with the value you gave it, and uses these to calculate the rate at which that value is increasing:
We would expect our rates to be 33.3 Hz, as 333 operations occurred and the time between the two calls to mark() was 10 seconds.
A Timer calculates these above 4 metrics (considering each Timer.Context to be one event) and adds to them a number of additional metrics:
There are something like 15 total metrics reported for each Timer.
In short: Timers report a LOT of metrics, and they can be tricky to understand, but once you do they're a quite powerful way to spot spikey behavior.
Fact is, just collecting the time spent between two points isn't a terribly useful metric. Consider: you have a block of code like this:
Let's assume that costlyOperation() has a constant cost, constant load and operates on a single thread. Inside a 1 minute reporting period, we should expect to time this operation 6000 times. Obviously, we will not be reporting the actual service time over the wire 6000x -- instead we need some way to summarize all those operations to fit our desired reporting window. DW Metrics' Timer does this for us, automatically, once a minute (our reporting period). After 5 minutes, our metrics registry would be reporting:
Now, let's consider we enter a period where occasionally our operation goes completely off the rails and blocks for an extended period:
Over a 1 minute collection period, we would now see fewer than 6000 executions, as every 1000th execution takes longer. Works out to about 5505. After the first minute (6 minutes total system time) of this we would now see:
If you graphed this, you'd see that most requests (the p50, p75, p99 etc) were completing in 10 ms, but one request out of 1000 (p99) was completing in 1s. This would also be seen as a slight reduction in the average rate (about 2%) and a sizable reduction in the 1 minute mean (nearly 9%).
If you only look at the over time means (either rate or duration), you'll never spot these spikes -- they get dragged into the background noise when averaged with a lot of successful operations. Similarly, just knowing the max isn't helpful, because it doesn't tell you how frequently the max occurs. This is why histograms are a powerful tool for tracking performance, and why DW Metrics' Timer publishes both a rate AND a histogram.