Skip to content

sql: multi-pod tenant temp object cleanup races on stale ListSessions, can drop active session's temp data #169912

@DrewKimball

Description

@DrewKimball

Describe the problem

The temp object cleaner runs concurrently on every pod of a multi-pod tenant
with no coordination. It relies on ListSessions fanout to identify which
sessions are still alive, but the fanout uses each pod's cached view of
system.sql_instances and is not consistent across pods. A cleaner whose
SQL instance reader cache is briefly stale (missing a peer pod) does not
fan out to that peer, omits the sessions hosted there from its "active"
set, and deletes the live temp schemas for those sessions — silently
destroying the user's temp data while the session is still using it.

Background

  • pkg/sql/temporary_schema.go:692-707 gates system-tenant cleanup on the
    meta1 leaseholder, so only one cleaner runs per cycle. Tenant cleanup is
    intentionally ungated; the comment claims "no harm in executing this
    logic without any type of coordination"
    because each cleaner consults
    ListSessions.
  • ListSessions for tenants resolves nodes via
    tenantFanoutClient.getAllNodes,
    which calls sqlInstanceReader.GetAllInstances(). That returns a cache
    populated by an initial scan + rangefeed watcher — no consistency
    guarantee across pods.
  • waitForInstances
    in the cleaner is sqlInstanceReader.WaitForStarted, which only blocks
    on the initial scan, not on convergence to current state.

When was this introduced

21b5854b6077
("sql: add support cleaning temporary tables on different pods",
2021-08-27) enabled the multi-pod path and rewrote the comment to assume
coordination-free execution was safe.

Reproduction / evidence

#169663TestTemporaryObjectCleaner failure with auto-injected
external-process VC. The investigation in #169663's bot comment shows
nsql2 dropping the active session's pg_temp_..._3 schema while the
session was confirmed alive at teardown 45s later. Three concurrent VC
cleaners all called ListSessions; one returned an inconsistent set.

Production exposure

Low at default settings (sql.temp_object_cleaner.cleanup_interval and
sql.temp_object_cleaner.wait_interval both default to 30 minutes, so
any eligible schema's session has been alive long enough for cache
convergence in normal conditions). Real if:

  • A customer lowers sql.temp_object_cleaner.wait_interval to seconds.
  • Pods come back from restart and cleanup fires before reader
    convergence.
  • Network/rangefeed pathology wedges a pod's cache for an extended
    period.

Failure mode is corruption-class: an active SQL session's temp tables,
views, sequences are silently dropped while the session is still using
them.

Suggested fix directions (not prescribing one)

  • Tenant-level coordination. Elect a single cleaner per tenant per
    cycle. Options: smallest-live-instance-ID heuristic (cheap but assumes
    converged cache to elect deterministically), sqlliveness-backed
    singleton lease, or a system table row used as a lease.
  • Stronger reader convergence guarantee in the cleaner. Probe
    system.sql_instances directly (bypassing cache) and verify the cache
    matches before fanning out, or wait for a rangefeed checkpoint past a
    recent timestamp.
  • Defense-in-depth. Re-verify ListSessions just before issuing each
    schema delete, with a small delay between the two calls.

Environment

Multi-pod tenant deployments only. Not applicable to system tenant.

Jira issue: CRDB-63710

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-multitenancyRelated to multi-tenancyC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.O-agentFiled by an AI agent; usually the result of a human/agent investigation sessionT-sql-queriesSQL Queries Team

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions