Alert resolution timeouts in Prometheus

Posted on Dec 11, 2016 12:09:28 +01:00
Tags: prometheus monitoring

Horst Gutmann, software engineer from Graz, Austria.

I’m currently using Prometheus for quite a few services and esp. in combination with Grafana and AlertManager it has proven to be an extremely handy tool. For instance, we usually have alerts for every single major component of a service. If this component becomes unreachable, alerts are sent to a specific Slack channel.

What bothered me, though, was how long it took for the resolution message to arrive. By default, it takes 5 minutes; luckily, though, you can customize this in AlertManager’s global settings:

global:
  resolve_timeout: 20s

This would set the timeout to only 20 seconds, feels much more usable to me given that most of our check intervals are somewhere in the 5-15s range and the alerts are set to something between a 10-20s range. I’m pretty sure I will be tuning this setting in the future but for now this should do 😉

I have no idea why I didn’t see this setting in the documentation the first time around, but now I’m glad I looked again (albeit using a detour through the source code 😉)

Do you want to give me feedback about this article in private? Please send it to comments@zerokspot.com.

Alternatively, this website also supports Webmentions. If you write a post on a blog that supports this technique, I should get notified about your link 🙂