I currently have a Prometheus alert that fires when my success rate drops below 85%.
I would like to add the absolute numbers of the ratio to the alert description. How do I do that?
My YAML currently looks like this (I cleaned up some extraneous details):
groups:
- name: recording_rules
rules:
- record: number_of_successes_24h
expr: avg(sum by(instance)(my_status{kubernetes_name="my-prom",timeRange="1d",status=~"success"}))
- record: number_of_total_24h
expr: avg(sum by(instance)(my_status{kubernetes_name="my-prom",timeRange="1d"}))
- record: success_rate_24h
expr: clamp_max(number_of_successes_24h / number_of_total_24h * 100, 100)
- name: alerting_rules
rules:
- alert: LowSuccessRate24H
expr: success_rate_24h < 85
labels:
severity: critical
annotations:
summary: "CRITICAL: Low success rate 24h"
description: "Success rate in the last 24 hours went below 85% (value: {{ $value }}%)"
My question is, how do I add the number_of_successes_24h
and number_of_total_24h
into the description?
I read the official documentation at https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/, but I got lost; I searched SO, but I didn't find anything relevant.
I read that there were extra details available in $labels
, so I tried printing that as an example to see what was in it, but I got map[__name__:success_rate_24h]
, and I couldn't figure out how to see inside that.
Partial answers and guides welcome. Thanks.
Here's a simplified version of my
TasksMissing
alert, which outputs the number of tasks missing, the total number of tasks and the affected instances in the summary:The resulting description is expected read something like "2 of 3 foo-service instances are missing in prod: foo01.prod.foo.org:8080 foo02.prod.foo.org:8080".
The idea is that you use Go templates to generate a query (by populating a template with values from
$labels
usingprintf
) and then pipe that into the Prometheus-definedquery
function and get back either one result (that you can handle usingwith
) or multiple values (that you can iterate over usingrange
). Then you can print either the timeseries value directly or some label (e.g. the instance name).