Skip to content

Alerts that continue to fire appear and disappear #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dougmeredith opened this issue Jan 28, 2025 · 5 comments
Open

Alerts that continue to fire appear and disappear #1

dougmeredith opened this issue Jan 28, 2025 · 5 comments

Comments

@dougmeredith
Copy link

Describe the bug
I have three alerts firing for two different rules. Sometimes all three show up, but mostly it's a subset, and they come and go. I can't say for certain, but I think that either both alerts for the same rule show up at the same time, or neither.

Additional context

I can see from the UAE log that all three are being reported to UAE.

@jamesread
Copy link
Owner

Heya @dougmeredith , thanks for taking the time to report this potential issue, and sorry that you're facing it. Happy to look into this.

I need to better understand the 3 alerts for 2 different rules - could you give me an example of your config? By the time alertmanager tells UAE, it should not matter which rules were used - UAE should just render 3 alerts.

Do they have identical descriptions or names?

Does UAE always log the same 3 received alerts? Is ot possible to share those logs?

Thanks.

@dougmeredith
Copy link
Author

Thanks for the quick response, Brian. I think you've got a useful app here!

I've done some experimenting, and I believe I know how this problem can be reproduced. My root router had group_by: ['severity'] set and this was being inherited by the child route for UAE. If I remove this setting from my root route, UAE behaves as expected.

Now here is where it gets fun: The obvious solution was to add group_by: [] to the route that uses UAE. Nope. While Alertmanager allows for overriding an inherited group_by, my experimentation shows me that it ignores all attempts to override with an empty set.

For now the only workaround that I can see is to not group at the root level, or any level that is a parent to a route that uses UAE.

@jamesread
Copy link
Owner

Thanks for the quick response, Brian. I think you've got a useful app here!

You're most welcome, :-) My name is James, though! haha. No worries.

I've done some experimenting, and I believe I know how this problem can be reproduced. My root router had group_by: ['severity'] set and this was being inherited by the child route for UAE. If I remove this setting from my root route, UAE behaves as expected.

Now here is where it gets fun: The obvious solution was to add group_by: [] to the route that uses UAE. Nope. While Alertmanager allows for overriding an inherited group_by, my experimentation shows me that it ignores all attempts to override with an empty set.

Okay dokey, I'll do some exploring with my Alertmanager config, that's easy and quick for me to check now you've pointed me in the right direction!

I'd be happy to add some flexibility to get around this in UAE if it circumvents the problem.

@dougmeredith
Copy link
Author

Ha! Sorry, I was just reading a message by a "Brian", when I sent that. lol

FYI, I've noted the Alertmanager issue with that project: prometheus/alertmanager#4221

@dougmeredith
Copy link
Author

I've discovered another problem, but I think it has the same root cause as this one, so I'm going to describe it here, rather that create a new issue.

If all alerts are resolved, the final alert that was firing never disappears. The "Last result" time then increases forever, until another alert fires.

Based on the instruction to use send_resolved: false, I suspect that your algorithm is this: When Alertmanager sends a notification of firing alerts, you take this as the definitive list of what is firing, and replace your stored list of alerts.

It's obvious from this why the last alert doesn't disappear when it is resolved, as in this situation, Alertmanager stops sending anything to UAR. It also explains the disappearing alerts when grouping is enabled. With grouping, Alert manager sends a separate message for each group, and I'd guess you process them each as if they are the complete list of alerts. The most recently sent group wins each time the display is refreshed.

I think the only real way to solve this is to handle resolved messages from Alertmanager, and to selectively remove alerts when they are reported as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants