Describe the bug
Distributor.doBatch (pkg/distributor/distributor.go:956-958) creates a timer-backed context via context.WithTimeout, but cancel() is not deferred in doBatch. Instead, it is only called from the cleanup callback passed to ring.DoBatch:
// pkg/distributor/distributor.go:956-957
// Use a background context to make sure all ingesters get samples even if we return early
localCtx, cancel := context.WithTimeout(context.Background(), d.cfg.RemoteTimeout)
defer func() {
if errors.Is(localCtx.Err(), context.DeadlineExceeded) {
d.distributorIngesterPushTimeout.Inc()
}
}()
Note the defer only checks deadline status — it does not call cancel(). The only call to cancel() is inside the cleanup callback:
// pkg/distributor/distributor.go:993-998
return d.send(localCtx, ingester, timeseries, metadata, req.Source, req.DiscardOutOfOrder)
}, func() {
cortexpb.ReuseSlice(req.Timeseries)
req.Free()
cancel()
})
ring.DoBatch runs that cleanup callback only after all per-instance goroutines finish, via a background goroutine:
// pkg/ring/batch.go:134-139
// Perform cleanup at the end.
go func() {
wg.Wait()
cleanup()
}()
However, DoBatch itself can return before that cleanup goroutine runs:
// pkg/ring/batch.go:141-148
select {
case err := <-tracker.err:
return err
case <-tracker.done:
return nil
case <-ctx.Done():
return ctx.Err()
}
Leak paths
-
Early return via ctx.Done(): When the parent request context is cancelled, DoBatch returns immediately. The cleanup goroutine is still waiting on wg.Wait(). The localCtx timer remains active until all outstanding ingester goroutines finish (which could take up to RemoteTimeout). The timer resource and request buffers (req.Timeseries, req.Free()) are delayed.
-
Callback panic: If callback() panics inside a submitted goroutine (pkg/ring/batch.go:128), wg.Done() is never called (it is not deferred — line 130). This causes wg.Wait() to block forever, and cleanup() / cancel() is never called. The timer leaks until RemoteTimeout expires, and the request buffers are permanently leaked.
-
e.Submit() panic: If the executor's Submit() panics (e.g., WorkerPool.Stop() closes the channel, then Submit() sends on it — pkg/util/worker_pool.go:61,69), the cleanup goroutine at lines 135-139 may never even start.
Underlying DoBatch robustness issue
The submitted closures in DoBatch (pkg/ring/batch.go:127-131) do not use defer wg.Done():
wg.Add(len(instances))
for _, i := range instances {
e.Submit(func() {
err := callback(i.desc, i.indexes)
tracker.record(i, err)
wg.Done() // <-- not deferred; skipped on panic
})
}
This means any panic inside callback or tracker.record permanently blocks the cleanup goroutine. This is a systemic issue affecting all three production callers of DoBatch:
pkg/distributor/distributor.go:981 (this bug)
pkg/alertmanager/distributor.go:164
pkg/alertmanager/multitenant.go:1040
Expected behavior
doBatch should always release the timer associated with localCtx when it returns, regardless of how ring.DoBatch exits. ring.DoBatch should guarantee its cleanup callback is called on all paths, including panics.
Suggested fix direction
Two complementary fixes:
1. doBatch: defer cancel() with coordination
A simple defer cancel() would cancel localCtx while in-flight d.send() calls are still using it. Two approaches to handle this:
-
Split concerns: Let the defer only stop the timer (cancel() releases timer resources), and accept that in-flight sends using localCtx will observe a cancelled context. Since localCtx is derived from context.Background(), the sends are already bounded by RemoteTimeout — cancelling the context early just stops them sooner on the error/cancellation path, which is arguably correct behavior.
-
Use context.AfterFunc (Go 1.21+): Register cleanup on the parent context to cancel localCtx when the parent is done, while still allowing sends to complete normally.
2. ring.DoBatch: make wg.Done() panic-safe
Change the submitted closure to use defer wg.Done():
e.Submit(func() {
defer wg.Done()
err := callback(i.desc, i.indexes)
tracker.record(i, err)
})
This ensures cleanup() is always called even if callback panics, fixing the guarantee for all callers.
Environment
- Cortex master at current HEAD.
- Found during a static audit of context/timer/goroutine leak risks across
pkg/.
Additional context
The request buffer cleanup (cortexpb.ReuseSlice(req.Timeseries) and req.Free()) shares the same leak path — on panic, these are also never called, leaking protobuf allocations. A related pattern exists in pkg/ingester/client/client.go:164 where context.WithCancel's cancel is not deferred on the Run() error path.
🤖 Reported with help from Claude Code
Describe the bug
Distributor.doBatch(pkg/distributor/distributor.go:956-958) creates a timer-backed context viacontext.WithTimeout, butcancel()is not deferred indoBatch. Instead, it is only called from the cleanup callback passed toring.DoBatch:Note the
deferonly checks deadline status — it does not callcancel(). The only call tocancel()is inside the cleanup callback:ring.DoBatchruns that cleanup callback only after all per-instance goroutines finish, via a background goroutine:However,
DoBatchitself can return before that cleanup goroutine runs:Leak paths
Early return via
ctx.Done(): When the parent request context is cancelled,DoBatchreturns immediately. The cleanup goroutine is still waiting onwg.Wait(). ThelocalCtxtimer remains active until all outstanding ingester goroutines finish (which could take up toRemoteTimeout). The timer resource and request buffers (req.Timeseries,req.Free()) are delayed.Callback panic: If
callback()panics inside a submitted goroutine (pkg/ring/batch.go:128),wg.Done()is never called (it is not deferred — line 130). This causeswg.Wait()to block forever, andcleanup()/cancel()is never called. The timer leaks untilRemoteTimeoutexpires, and the request buffers are permanently leaked.e.Submit()panic: If the executor'sSubmit()panics (e.g.,WorkerPool.Stop()closes the channel, thenSubmit()sends on it —pkg/util/worker_pool.go:61,69), the cleanup goroutine at lines 135-139 may never even start.Underlying
DoBatchrobustness issueThe submitted closures in
DoBatch(pkg/ring/batch.go:127-131) do not usedefer wg.Done():This means any panic inside
callbackortracker.recordpermanently blocks the cleanup goroutine. This is a systemic issue affecting all three production callers ofDoBatch:pkg/distributor/distributor.go:981(this bug)pkg/alertmanager/distributor.go:164pkg/alertmanager/multitenant.go:1040Expected behavior
doBatchshould always release the timer associated withlocalCtxwhen it returns, regardless of howring.DoBatchexits.ring.DoBatchshould guarantee its cleanup callback is called on all paths, including panics.Suggested fix direction
Two complementary fixes:
1.
doBatch: defercancel()with coordinationA simple
defer cancel()would cancellocalCtxwhile in-flightd.send()calls are still using it. Two approaches to handle this:Split concerns: Let the defer only stop the timer (
cancel()releases timer resources), and accept that in-flight sends usinglocalCtxwill observe a cancelled context. SincelocalCtxis derived fromcontext.Background(), the sends are already bounded byRemoteTimeout— cancelling the context early just stops them sooner on the error/cancellation path, which is arguably correct behavior.Use
context.AfterFunc(Go 1.21+): Register cleanup on the parent context to cancellocalCtxwhen the parent is done, while still allowing sends to complete normally.2.
ring.DoBatch: makewg.Done()panic-safeChange the submitted closure to use
defer wg.Done():This ensures
cleanup()is always called even ifcallbackpanics, fixing the guarantee for all callers.Environment
pkg/.Additional context
The request buffer cleanup (
cortexpb.ReuseSlice(req.Timeseries)andreq.Free()) shares the same leak path — on panic, these are also never called, leaking protobuf allocations. A related pattern exists inpkg/ingester/client/client.go:164wherecontext.WithCancel's cancel is not deferred on theRun()error path.🤖 Reported with help from Claude Code