How to make Prometheus alert description give both

2020-03-30 05:50发布

I currently have a Prometheus alert that fires when my success rate drops below 85%.

I would like to add the absolute numbers of the ratio to the alert description. How do I do that?

My YAML currently looks like this (I cleaned up some extraneous details):

groups:
  - name: recording_rules
    rules:
    - record: number_of_successes_24h
      expr: avg(sum by(instance)(my_status{kubernetes_name="my-prom",timeRange="1d",status=~"success"}))
    - record: number_of_total_24h
      expr: avg(sum by(instance)(my_status{kubernetes_name="my-prom",timeRange="1d"}))
    - record: success_rate_24h
      expr: clamp_max(number_of_successes_24h / number_of_total_24h * 100, 100)

  - name: alerting_rules
    rules:
    - alert: LowSuccessRate24H
      expr: success_rate_24h < 85
      labels:
        severity: critical
      annotations:
        summary: "CRITICAL: Low success rate 24h"
        description: "Success rate in the last 24 hours went below 85% (value: {{ $value }}%)"

My question is, how do I add the number_of_successes_24h and number_of_total_24h into the description?
I read the official documentation at https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/, but I got lost; I searched SO, but I didn't find anything relevant.

I read that there were extra details available in $labels, so I tried printing that as an example to see what was in it, but I got map[__name__:success_rate_24h], and I couldn't figure out how to see inside that.

Partial answers and guides welcome. Thanks.

1条回答
We Are One
2楼-- · 2020-03-30 06:52

Here's a simplified version of my TasksMissing alert, which outputs the number of tasks missing, the total number of tasks and the affected instances in the summary:

  - alert: TasksMissing
    expr: |
      job_env:up:ratio < .7
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Tasks missing for {{ $labels.job }} in {{ $labels.env }}
      description:
       '{{ with printf `job_env:up:count{job="%s",env="%s"} - job_env:up:sum{job="%s",env="%s"}` $labels.job $labels.env $labels.job $labels.env | query }}
          {{- . | first | value -}}
        {{ end }}
        of
        {{ with printf `job_env:up:count{job="%s",env="%s"}` $labels.job $labels.env | query }}
          {{- . | first | value -}}
        {{ end }}
        {{ $labels.job }} instances are missing in {{ $labels.env }}:
        {{ range printf `up{job="%s",env="%s"}==0` $labels.job $labels.env | query }}
          {{- .Labels.instance }}
        {{ end }}'

The resulting description is expected read something like "2 of 3 foo-service instances are missing in prod: foo01.prod.foo.org:8080 foo02.prod.foo.org:8080".

The idea is that you use Go templates to generate a query (by populating a template with values from $labels using printf) and then pipe that into the Prometheus-defined query function and get back either one result (that you can handle using with) or multiple values (that you can iterate over using range). Then you can print either the timeseries value directly or some label (e.g. the instance name).

查看更多
登录 后发表回答