sql: multi-pod tenant temp object cleanup races on stale ListSessions, can drop active session's temp data

**Describe the problem**

The temp object cleaner runs concurrently on every pod of a multi-pod tenant
with no coordination. It relies on `ListSessions` fanout to identify which
sessions are still alive, but the fanout uses each pod's *cached* view of
`system.sql_instances` and is not consistent across pods. A cleaner whose
SQL instance reader cache is briefly stale (missing a peer pod) does not
fan out to that peer, omits the sessions hosted there from its "active"
set, and deletes the live temp schemas for those sessions — silently
destroying the user's temp data while the session is still using it.

**Background**

- `pkg/sql/temporary_schema.go:692-707` gates system-tenant cleanup on the
  meta1 leaseholder, so only one cleaner runs per cycle. Tenant cleanup is
  intentionally ungated; the comment claims *"no harm in executing this
  logic without any type of coordination"* because each cleaner consults
  `ListSessions`.
- `ListSessions` for tenants resolves nodes via
  [`tenantFanoutClient.getAllNodes`](https://github.com/cockroachdb/cockroach/blob/master/pkg/server/fanout_clients.go#L184),
  which calls `sqlInstanceReader.GetAllInstances()`. That returns a cache
  populated by an initial scan + rangefeed watcher — no consistency
  guarantee across pods.
- [`waitForInstances`](https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/temporary_schema.go#L714-L718)
  in the cleaner is `sqlInstanceReader.WaitForStarted`, which only blocks
  on the *initial* scan, not on convergence to current state.

**When was this introduced**

[`21b5854b6077`](https://github.com/cockroachdb/cockroach/commit/21b5854b6077)
(\"sql: add support cleaning temporary tables on different pods\",
2021-08-27) enabled the multi-pod path and rewrote the comment to assume
coordination-free execution was safe.

**Reproduction / evidence**

#169663 — `TestTemporaryObjectCleaner` failure with auto-injected
external-process VC. The investigation in #169663's bot comment shows
nsql2 dropping the active session's `pg_temp_..._3` schema while the
session was confirmed alive at teardown 45s later. Three concurrent VC
cleaners all called `ListSessions`; one returned an inconsistent set.

**Production exposure**

Low at default settings (`sql.temp_object_cleaner.cleanup_interval` and
`sql.temp_object_cleaner.wait_interval` both default to 30 minutes, so
any eligible schema's session has been alive long enough for cache
convergence in normal conditions). Real if:

- A customer lowers `sql.temp_object_cleaner.wait_interval` to seconds.
- Pods come back from restart and cleanup fires before reader
  convergence.
- Network/rangefeed pathology wedges a pod's cache for an extended
  period.

Failure mode is corruption-class: an active SQL session's temp tables,
views, sequences are silently dropped while the session is still using
them.

**Suggested fix directions** (not prescribing one)

- *Tenant-level coordination.* Elect a single cleaner per tenant per
  cycle. Options: smallest-live-instance-ID heuristic (cheap but assumes
  converged cache to elect deterministically), sqlliveness-backed
  singleton lease, or a system table row used as a lease.
- *Stronger reader convergence guarantee in the cleaner.* Probe
  `system.sql_instances` directly (bypassing cache) and verify the cache
  matches before fanning out, or wait for a rangefeed checkpoint past a
  recent timestamp.
- *Defense-in-depth.* Re-verify `ListSessions` just before issuing each
  schema delete, with a small delay between the two calls.

**Environment**

Multi-pod tenant deployments only. Not applicable to system tenant.

Jira issue: CRDB-63710

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: multi-pod tenant temp object cleanup races on stale ListSessions, can drop active session's temp data #169912

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sql: multi-pod tenant temp object cleanup races on stale ListSessions, can drop active session's temp data #169912

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions