Skip to content

How to handle subactor UDS sock-files leaked by hard_kill()/SIGKILL #454

Description

@goodboy

Summary

Under fork-spawn backends (especially main_thread_forkserver), when
tractor.spawn._spawn.hard_kill falls through to proc.kill()
(SIGKILL) — which it does whenever the graceful cancel exceeds
terminate_after=1.6s — the killed subactor's UDS sock-file at
${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock accumulates on disk
indefinitely.

Distinct from the discovery-client CLOSE_WAIT TCP fd leak in #452
different layer (spawn vs discovery), different transport (UDS vs
TCP), different lifecycle. Filing separately so each can be tracked
independently.

Root cause

The subactor's IPC server unlink lives in
tractor.ipc._server::_serve_ipc_eps's finally: block, which calls
tractor.ipc._uds.close_listener → os.unlink(addr.sockpath). SIGKILL
bypasses ALL Python execution → no finally blocks fire → sock-file
remains on disk forever.

$ ls $XDG_RUNTIME_DIR/tractor/
sleeper@492837.sock      # binder pid 492837 is dead
sync_blocking_sub@492901.sock
namesucka@491847.sock
... (accumulates over a test session)

Reproducer

tests/test_cancellation.py::test_cancel_while_childs_child_in_sync_sleep
under --tpt-proto=uds --spawn-backend=trio reliably leaks
sleeper@<pid>.sock. Mechanism:

  • Parent calls Portal.cancel_actor() → IPC cancel-req msg sent to
    sleeper.
  • sleeper is blocked in sync time.sleep(3) → trio scheduler can't
    deliver the Cancelled until the sleep returns.
  • hard_kill's move_on_after(1.6s) deadline fires.
  • proc.kill() → SIGKILL → no Python cleanup.
  • sleeper@<pid>.sock orphaned in $XDG_RUNTIME_DIR/tractor/.

Tests that consistently leak under --tpt-proto=uds --spawn-backend=trio:

  • test_cancel_via_SIGINT_other_task[trio] — leaks 3
    namesucka@<pid>.sock
  • test_cancel_while_childs_child_in_sync_sleep — leaks
    sleeper@<pid>.sock (and sync_blocking_sub@<pid>.sock in the
    True variant)
  • test_fast_graceful_cancel_when_spawn_task_in_soft_proc_wait_for_daemon[trio]
    — leaks fast_boi@<pid>.sock

Side effects

  • fd-table pressure across long pytest sessions — eventually
    EMFILE.
  • Test-suite flakiness amplifier — under --tpt-proto=uds, a
    single hard-killed subactor leaves a sock file that a sibling
    test's wait_for_actor/find_actor discovery probes can
    accidentally hit (FileExistsError on rebind, or epoll_register
    on a half-closed peer-FIN'd fd).
  • Kernel inode accumulation — though tractor uses
    XDG_RUNTIME_DIR (tmpfs on most distros), sock inodes still
    consume kernel resources until the filesystem is unmounted.

Detection (autouse fixture)

tractor._testing._reap._track_orphaned_uds_per_test (committed in
1cdc7fb3) snapshots $XDG_RUNTIME_DIR/tractor/ before+after each
test and emits UserWarning: UDS sock-file LEAK detected from test (reaping) when new orphaned sockfiles appear. Per-test scoping makes
blame obvious vs a session-end blanket sweep.

Companion CLI extension: scripts/tractor-reap --uds / --uds-only
(committed in 0996a836) for post-mortem cleanup when a session
crashed.

Fix

tractor.spawn._reap.unlink_uds_bind_addrs() — invoked from
hard_kill unconditionally post-SIGKILL. Two cleanup paths in order:

  • Explicit bind_addrs — when the parent set the subactor's bind
    addrs at spawn time, unlink each UDS-flavored sockpath directly.
  • Self-assigned reconstruction — when bind_addrs is empty (the
    common case: subactor picked its own random sock via
    UDSAddress.get_random()), reconstruct the path from
    (subactor.aid.name, proc.pid) using the same <name>@<pid>.sock
    convention. Works because the subactor uses its own os.getpid()
    at bind time, which equals proc.pid from the parent's view.

Idempotent: FileNotFoundError (graceful exit already-unlinked, sock
never bound under early-spawn cancel, or transport wasn't UDS this
run) is silenced; other OSErrors log a warning but never raise.

Future work — authoritative bind-addr tracking

The convention-based path (2) above hardcodes the <name>@<pid>.sock
convention from tractor.ipc._uds.UDSAddress. If that convention
ever changes — or the subactor binds to a non-default
bindspace/filedir — we'll silently fail to unlink.

A more authoritative approach:

  • Subactors register their bound UDS sockpaths in a per-process
    registry inside tractor.ipc._uds at start_listener() time.
  • The subactor reports its bound sockpath(s) back to the parent over
    IPC immediately post-bind (extension to SpawnSpec reply / a new
    handshake msg).
  • Parent caches the subactor's authoritative sockpaths.
  • unlink_uds_bind_addrs() checks the cache FIRST, falls back to
    convention-reconstruction if the subactor died before reporting.

Documented as a TODO in tractor.spawn._reap's module docstring;
tracking via a follow-up issue if needed.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcancellationSC teardown semantics and anti-zombie semanticslo IPClocalhost IPC primitives and APIssupervision

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions