Skip to content

Conversation

@eyazici90
Copy link
Contributor

@eyazici90 eyazici90 commented Apr 3, 2025

Context

Given resolvePeers is context aware, yet, it is used by background context that may lead to blocking call. This is due to the fact that LookupIPAddr is called under the hood that does dns look up which may be blocked in some platforms.

In case of dns lookups are blocked, it may lead to hanging goroutine due to the fact that main goroutine is being used for the call that leads to application hanging & being unresponsive as well as keep getting restarted by kube with no info in the app.

Below logs are taken from one of the alertmanager kube/pods.

│ ts=2025-04-03T11:12:38.668Z caller=main.go:181 level=info msg="Starting Alertmanager" version="(version=0.27.0, branch=HEAD, revision=x)" │
│ ts=2025-04-03T11:12:38.668Z caller=main.go:182 level=info build_context="(go=go1.21.7, platform=linux/amd64, user=root@x, date=20240228-11:51:20, tags=netgo)"

Cillumpolicy was not applied to allow to dns resolve yet. This led to app hangs with no logs & keep getting restarted due to liveness probes(/healthy, /ready) were not registered. Which was pretty painful to troubleshoot.

This was due to LookupIPAddr hanging on main goroutine.

select {
	case <-ctx.Done():
		// Our context was canceled. If we are the only
		// goroutine looking up this key, then drop the key
		// from the lookupGroup and cancel the lookup.
		// If there are other goroutines looking up this key,
		// let the lookup continue uncanceled, and let later
		// lookups with the same key share the result.
		// See issues 8602, 20703, 22724.
		if r.getLookupGroup().ForgetUnshared(lookupKey) {
			lookupGroupCancel()
			go dnsWaitGroupDone(ch, func() {})
		} else {
			go dnsWaitGroupDone(ch, lookupGroupCancel)
		}
		err := newDNSError(mapErr(ctx.Err()), host, "")
		if trace != nil && trace.DNSDone != nil {
			trace.DNSDone(nil, false, err)
		}
		return nil, err
	case r := <-ch:
		dnsWaitGroup.Done()
		lookupGroupCancel()
		err := r.Err
		if err != nil {
			if _, ok := err.(*DNSError); !ok {
				err = newDNSError(mapErr(err), host, "")
			}
		}
		if trace != nil && trace.DNSDone != nil {
			addrs, _ := r.Val.([]IPAddr)
			trace.DNSDone(ipAddrsEface(addrs), r.Shared, err)
		}
		return lookupIPReturn(r.Val, err, r.Shared)
	}

Letting the control flow continues will lead to l.Debug("resolved peers to following addresses", "peers", strings.Join(resolvedPeers, ",")) that will not block the app to log as well as registering liveness probes(/healthy, /ready).

@eyazici90
Copy link
Contributor Author

eyazici90 commented Apr 3, 2025

Hey @grobinson-grafana 👋
Could u take a quick glance at that, lmk what u think? Based on ur response, we may push it further or not.

resolvedPeers, err := resolvePeers(context.Background(), knownPeers, advertiseAddr, &net.Resolver{}, waitIfEmpty)
ctx, cancel := context.WithTimeout(context.Background(), resolveTimeout)
defer cancel()
resolvedPeers, err := resolvePeers(ctx, knownPeers, advertiseAddr, &net.Resolver{}, waitIfEmpty)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I understand. If this timesout, Alertmanager will still fail to start because the peer cannot be resolved, but at least this time it emits an error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think AM will fail to start. See https://github.com/prometheus/alertmanager/blob/main/cluster/cluster.go#L737-L740. It just takes the direct addr & moves on with the loop.

// inlining
ips, err := res.LookupIPAddr(ctx, host)
		if err != nil {
			// Assume direct address.
			resolvedPeers = append(resolvedPeers, peer)
			continue
		}

Afterwards we will see debug logs https://github.com/prometheus/alertmanager/blob/main/cluster/cluster.go#L173 resolved addr for peers.

	l.Debug("resolved peers to following addresses", "peers", strings.Join(resolvedPeers, ","))

Far better than hanging app with no logs emitted, imho 😓

Copy link
Collaborator

@grobinson-grafana grobinson-grafana Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how does Alertmanager behave if it continues without having resolved peers? I haven't had time to look into it I'm afraid so I'm asking here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there is a test for initially failed peers. I reckon, am(alertmngr) tries to reconnect for certain amount of time and eventually drops failed peers. Nevertheless, health probes and rest would work same. Only gossip settlings wont. That is my understanding at least :D

@eyazici90 eyazici90 marked this pull request as ready for review April 9, 2025 08:52
Given resolvePeers is context aware, yet, uses background that may lead to blocking call. This is due to the fact that `LookupIPAddr` is called under the hood that does dns look up which may be blocked in some platforms.

In case of dns lookups are blocked may lead to hanging goroutine due to the fact that main goroutine is being used for the call that leads to application hanging & being unresponsive as well as readiness check to fail.

Signed-off-by: emreya <[email protected]>
@eyazici90
Copy link
Contributor Author

Could u help me with review @SuperQ ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants