[BUG] Distributor: context.WithTimeout timer leak in doBatch — cancel() not deferred, relies solely on ring.DoBatch cleanup callback

## Describe the bug

`Distributor.doBatch` (`pkg/distributor/distributor.go:956-958`) creates a timer-backed context via `context.WithTimeout`, but `cancel()` is **not deferred** in `doBatch`. Instead, it is only called from the cleanup callback passed to `ring.DoBatch`:

```go
// pkg/distributor/distributor.go:956-957
// Use a background context to make sure all ingesters get samples even if we return early
localCtx, cancel := context.WithTimeout(context.Background(), d.cfg.RemoteTimeout)
defer func() {
    if errors.Is(localCtx.Err(), context.DeadlineExceeded) {
        d.distributorIngesterPushTimeout.Inc()
    }
}()
```

Note the `defer` only checks deadline status — it does **not** call `cancel()`. The only call to `cancel()` is inside the cleanup callback:

```go
// pkg/distributor/distributor.go:993-998
return d.send(localCtx, ingester, timeseries, metadata, req.Source, req.DiscardOutOfOrder)
}, func() {
    cortexpb.ReuseSlice(req.Timeseries)
    req.Free()
    cancel()
})
```

`ring.DoBatch` runs that cleanup callback **only** after all per-instance goroutines finish, via a background goroutine:

```go
// pkg/ring/batch.go:134-139
// Perform cleanup at the end.
go func() {
    wg.Wait()
    cleanup()
}()
```

However, `DoBatch` itself can return **before** that cleanup goroutine runs:

```go
// pkg/ring/batch.go:141-148
select {
case err := <-tracker.err:
    return err
case <-tracker.done:
    return nil
case <-ctx.Done():
    return ctx.Err()
}
```

### Leak paths

1. **Early return via `ctx.Done()`**: When the parent request context is cancelled, `DoBatch` returns immediately. The cleanup goroutine is still waiting on `wg.Wait()`. The `localCtx` timer remains active until all outstanding ingester goroutines finish (which could take up to `RemoteTimeout`). The timer resource and request buffers (`req.Timeseries`, `req.Free()`) are delayed.

2. **Callback panic**: If `callback()` panics inside a submitted goroutine (`pkg/ring/batch.go:128`), `wg.Done()` is never called (it is not deferred — line 130). This causes `wg.Wait()` to block forever, and `cleanup()` / `cancel()` is **never** called. The timer leaks until `RemoteTimeout` expires, and the request buffers are permanently leaked.

3. **`e.Submit()` panic**: If the executor's `Submit()` panics (e.g., `WorkerPool.Stop()` closes the channel, then `Submit()` sends on it — `pkg/util/worker_pool.go:61,69`), the cleanup goroutine at lines 135-139 may never even start.

### Underlying `DoBatch` robustness issue

The submitted closures in `DoBatch` (`pkg/ring/batch.go:127-131`) do not use `defer wg.Done()`:

```go
wg.Add(len(instances))
for _, i := range instances {
    e.Submit(func() {
        err := callback(i.desc, i.indexes)
        tracker.record(i, err)
        wg.Done()     // <-- not deferred; skipped on panic
    })
}
```

This means any panic inside `callback` or `tracker.record` permanently blocks the cleanup goroutine. This is a systemic issue affecting all three production callers of `DoBatch`:
- `pkg/distributor/distributor.go:981` (this bug)
- `pkg/alertmanager/distributor.go:164`
- `pkg/alertmanager/multitenant.go:1040`

## Expected behavior

`doBatch` should always release the timer associated with `localCtx` when it returns, regardless of how `ring.DoBatch` exits. `ring.DoBatch` should guarantee its cleanup callback is called on all paths, including panics.

## Suggested fix direction

Two complementary fixes:

**1. `doBatch`: defer `cancel()` with coordination**

A simple `defer cancel()` would cancel `localCtx` while in-flight `d.send()` calls are still using it. Two approaches to handle this:

- **Split concerns**: Let the defer only stop the timer (`cancel()` releases timer resources), and accept that in-flight sends using `localCtx` will observe a cancelled context. Since `localCtx` is derived from `context.Background()`, the sends are already bounded by `RemoteTimeout` — cancelling the context early just stops them sooner on the error/cancellation path, which is arguably correct behavior.

- **Use `context.AfterFunc`** (Go 1.21+): Register cleanup on the parent context to cancel `localCtx` when the parent is done, while still allowing sends to complete normally.

**2. `ring.DoBatch`: make `wg.Done()` panic-safe**

Change the submitted closure to use `defer wg.Done()`:

```go
e.Submit(func() {
    defer wg.Done()
    err := callback(i.desc, i.indexes)
    tracker.record(i, err)
})
```

This ensures `cleanup()` is always called even if `callback` panics, fixing the guarantee for all callers.

## Environment

- Cortex master at current HEAD.
- Found during a static audit of context/timer/goroutine leak risks across `pkg/`.

## Additional context

The request buffer cleanup (`cortexpb.ReuseSlice(req.Timeseries)` and `req.Free()`) shares the same leak path — on panic, these are also never called, leaking protobuf allocations. A related pattern exists in `pkg/ingester/client/client.go:164` where `context.WithCancel`'s cancel is not deferred on the `Run()` error path.

🤖 Reported with help from Claude Code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Distributor: context.WithTimeout timer leak in doBatch — cancel() not deferred, relies solely on ring.DoBatch cleanup callback #7558

Describe the bug

Leak paths

Underlying `DoBatch` robustness issue

Expected behavior

Suggested fix direction

Environment

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] Distributor: context.WithTimeout timer leak in doBatch — cancel() not deferred, relies solely on ring.DoBatch cleanup callback #7558

Description

Describe the bug

Leak paths

Underlying DoBatch robustness issue

Expected behavior

Suggested fix direction

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Underlying `DoBatch` robustness issue