Describe the problem
The temp object cleaner runs concurrently on every pod of a multi-pod tenant
with no coordination. It relies on ListSessions fanout to identify which
sessions are still alive, but the fanout uses each pod's cached view of
system.sql_instances and is not consistent across pods. A cleaner whose
SQL instance reader cache is briefly stale (missing a peer pod) does not
fan out to that peer, omits the sessions hosted there from its "active"
set, and deletes the live temp schemas for those sessions — silently
destroying the user's temp data while the session is still using it.
Background
pkg/sql/temporary_schema.go:692-707 gates system-tenant cleanup on the
meta1 leaseholder, so only one cleaner runs per cycle. Tenant cleanup is
intentionally ungated; the comment claims "no harm in executing this
logic without any type of coordination" because each cleaner consults
ListSessions.
ListSessions for tenants resolves nodes via
tenantFanoutClient.getAllNodes,
which calls sqlInstanceReader.GetAllInstances(). That returns a cache
populated by an initial scan + rangefeed watcher — no consistency
guarantee across pods.
waitForInstances
in the cleaner is sqlInstanceReader.WaitForStarted, which only blocks
on the initial scan, not on convergence to current state.
When was this introduced
21b5854b6077
("sql: add support cleaning temporary tables on different pods",
2021-08-27) enabled the multi-pod path and rewrote the comment to assume
coordination-free execution was safe.
Reproduction / evidence
#169663 — TestTemporaryObjectCleaner failure with auto-injected
external-process VC. The investigation in #169663's bot comment shows
nsql2 dropping the active session's pg_temp_..._3 schema while the
session was confirmed alive at teardown 45s later. Three concurrent VC
cleaners all called ListSessions; one returned an inconsistent set.
Production exposure
Low at default settings (sql.temp_object_cleaner.cleanup_interval and
sql.temp_object_cleaner.wait_interval both default to 30 minutes, so
any eligible schema's session has been alive long enough for cache
convergence in normal conditions). Real if:
- A customer lowers
sql.temp_object_cleaner.wait_interval to seconds.
- Pods come back from restart and cleanup fires before reader
convergence.
- Network/rangefeed pathology wedges a pod's cache for an extended
period.
Failure mode is corruption-class: an active SQL session's temp tables,
views, sequences are silently dropped while the session is still using
them.
Suggested fix directions (not prescribing one)
- Tenant-level coordination. Elect a single cleaner per tenant per
cycle. Options: smallest-live-instance-ID heuristic (cheap but assumes
converged cache to elect deterministically), sqlliveness-backed
singleton lease, or a system table row used as a lease.
- Stronger reader convergence guarantee in the cleaner. Probe
system.sql_instances directly (bypassing cache) and verify the cache
matches before fanning out, or wait for a rangefeed checkpoint past a
recent timestamp.
- Defense-in-depth. Re-verify
ListSessions just before issuing each
schema delete, with a small delay between the two calls.
Environment
Multi-pod tenant deployments only. Not applicable to system tenant.
Jira issue: CRDB-63710
Describe the problem
The temp object cleaner runs concurrently on every pod of a multi-pod tenant
with no coordination. It relies on
ListSessionsfanout to identify whichsessions are still alive, but the fanout uses each pod's cached view of
system.sql_instancesand is not consistent across pods. A cleaner whoseSQL instance reader cache is briefly stale (missing a peer pod) does not
fan out to that peer, omits the sessions hosted there from its "active"
set, and deletes the live temp schemas for those sessions — silently
destroying the user's temp data while the session is still using it.
Background
pkg/sql/temporary_schema.go:692-707gates system-tenant cleanup on themeta1 leaseholder, so only one cleaner runs per cycle. Tenant cleanup is
intentionally ungated; the comment claims "no harm in executing this
logic without any type of coordination" because each cleaner consults
ListSessions.ListSessionsfor tenants resolves nodes viatenantFanoutClient.getAllNodes,which calls
sqlInstanceReader.GetAllInstances(). That returns a cachepopulated by an initial scan + rangefeed watcher — no consistency
guarantee across pods.
waitForInstancesin the cleaner is
sqlInstanceReader.WaitForStarted, which only blockson the initial scan, not on convergence to current state.
When was this introduced
21b5854b6077("sql: add support cleaning temporary tables on different pods",
2021-08-27) enabled the multi-pod path and rewrote the comment to assume
coordination-free execution was safe.
Reproduction / evidence
#169663 —
TestTemporaryObjectCleanerfailure with auto-injectedexternal-process VC. The investigation in #169663's bot comment shows
nsql2 dropping the active session's
pg_temp_..._3schema while thesession was confirmed alive at teardown 45s later. Three concurrent VC
cleaners all called
ListSessions; one returned an inconsistent set.Production exposure
Low at default settings (
sql.temp_object_cleaner.cleanup_intervalandsql.temp_object_cleaner.wait_intervalboth default to 30 minutes, soany eligible schema's session has been alive long enough for cache
convergence in normal conditions). Real if:
sql.temp_object_cleaner.wait_intervalto seconds.convergence.
period.
Failure mode is corruption-class: an active SQL session's temp tables,
views, sequences are silently dropped while the session is still using
them.
Suggested fix directions (not prescribing one)
cycle. Options: smallest-live-instance-ID heuristic (cheap but assumes
converged cache to elect deterministically), sqlliveness-backed
singleton lease, or a system table row used as a lease.
system.sql_instancesdirectly (bypassing cache) and verify the cachematches before fanning out, or wait for a rangefeed checkpoint past a
recent timestamp.
ListSessionsjust before issuing eachschema delete, with a small delay between the two calls.
Environment
Multi-pod tenant deployments only. Not applicable to system tenant.
Jira issue: CRDB-63710