Summary
Under fork-spawn backends (especially main_thread_forkserver), when
tractor.spawn._spawn.hard_kill falls through to proc.kill()
(SIGKILL) — which it does whenever the graceful cancel exceeds
terminate_after=1.6s — the killed subactor's UDS sock-file at
${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock accumulates on disk
indefinitely.
Distinct from the discovery-client CLOSE_WAIT TCP fd leak in #452 —
different layer (spawn vs discovery), different transport (UDS vs
TCP), different lifecycle. Filing separately so each can be tracked
independently.
Root cause
The subactor's IPC server unlink lives in
tractor.ipc._server::_serve_ipc_eps's finally: block, which calls
tractor.ipc._uds.close_listener → os.unlink(addr.sockpath). SIGKILL
bypasses ALL Python execution → no finally blocks fire → sock-file
remains on disk forever.
$ ls $XDG_RUNTIME_DIR/tractor/
sleeper@492837.sock # binder pid 492837 is dead
sync_blocking_sub@492901.sock
namesucka@491847.sock
... (accumulates over a test session)
Reproducer
tests/test_cancellation.py::test_cancel_while_childs_child_in_sync_sleep
under --tpt-proto=uds --spawn-backend=trio reliably leaks
sleeper@<pid>.sock. Mechanism:
- Parent calls
Portal.cancel_actor() → IPC cancel-req msg sent to
sleeper.
sleeper is blocked in sync time.sleep(3) → trio scheduler can't
deliver the Cancelled until the sleep returns.
hard_kill's move_on_after(1.6s) deadline fires.
proc.kill() → SIGKILL → no Python cleanup.
sleeper@<pid>.sock orphaned in $XDG_RUNTIME_DIR/tractor/.
Tests that consistently leak under --tpt-proto=uds --spawn-backend=trio:
test_cancel_via_SIGINT_other_task[trio] — leaks 3
namesucka@<pid>.sock
test_cancel_while_childs_child_in_sync_sleep — leaks
sleeper@<pid>.sock (and sync_blocking_sub@<pid>.sock in the
True variant)
test_fast_graceful_cancel_when_spawn_task_in_soft_proc_wait_for_daemon[trio]
— leaks fast_boi@<pid>.sock
Side effects
- fd-table pressure across long pytest sessions — eventually
EMFILE.
- Test-suite flakiness amplifier — under
--tpt-proto=uds, a
single hard-killed subactor leaves a sock file that a sibling
test's wait_for_actor/find_actor discovery probes can
accidentally hit (FileExistsError on rebind, or epoll_register
on a half-closed peer-FIN'd fd).
- Kernel inode accumulation — though tractor uses
XDG_RUNTIME_DIR (tmpfs on most distros), sock inodes still
consume kernel resources until the filesystem is unmounted.
Detection (autouse fixture)
tractor._testing._reap._track_orphaned_uds_per_test (committed in
1cdc7fb3) snapshots $XDG_RUNTIME_DIR/tractor/ before+after each
test and emits UserWarning: UDS sock-file LEAK detected from test (reaping) when new orphaned sockfiles appear. Per-test scoping makes
blame obvious vs a session-end blanket sweep.
Companion CLI extension: scripts/tractor-reap --uds / --uds-only
(committed in 0996a836) for post-mortem cleanup when a session
crashed.
Fix
tractor.spawn._reap.unlink_uds_bind_addrs() — invoked from
hard_kill unconditionally post-SIGKILL. Two cleanup paths in order:
- Explicit
bind_addrs — when the parent set the subactor's bind
addrs at spawn time, unlink each UDS-flavored sockpath directly.
- Self-assigned reconstruction — when
bind_addrs is empty (the
common case: subactor picked its own random sock via
UDSAddress.get_random()), reconstruct the path from
(subactor.aid.name, proc.pid) using the same <name>@<pid>.sock
convention. Works because the subactor uses its own os.getpid()
at bind time, which equals proc.pid from the parent's view.
Idempotent: FileNotFoundError (graceful exit already-unlinked, sock
never bound under early-spawn cancel, or transport wasn't UDS this
run) is silenced; other OSErrors log a warning but never raise.
Future work — authoritative bind-addr tracking
The convention-based path (2) above hardcodes the <name>@<pid>.sock
convention from tractor.ipc._uds.UDSAddress. If that convention
ever changes — or the subactor binds to a non-default
bindspace/filedir — we'll silently fail to unlink.
A more authoritative approach:
- Subactors register their bound UDS sockpaths in a per-process
registry inside tractor.ipc._uds at start_listener() time.
- The subactor reports its bound sockpath(s) back to the parent over
IPC immediately post-bind (extension to SpawnSpec reply / a new
handshake msg).
- Parent caches the subactor's authoritative sockpaths.
unlink_uds_bind_addrs() checks the cache FIRST, falls back to
convention-reconstruction if the subactor died before reporting.
Documented as a TODO in tractor.spawn._reap's module docstring;
tracking via a follow-up issue if needed.
Related
Summary
Under fork-spawn backends (especially
main_thread_forkserver), whentractor.spawn._spawn.hard_killfalls through toproc.kill()(SIGKILL) — which it does whenever the graceful cancel exceeds
terminate_after=1.6s— the killed subactor's UDS sock-file at${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sockaccumulates on diskindefinitely.
Distinct from the discovery-client
CLOSE_WAITTCP fd leak in #452 —different layer (spawn vs discovery), different transport (UDS vs
TCP), different lifecycle. Filing separately so each can be tracked
independently.
Root cause
The subactor's IPC server unlink lives in
tractor.ipc._server::_serve_ipc_eps'sfinally:block, which callstractor.ipc._uds.close_listener → os.unlink(addr.sockpath). SIGKILLbypasses ALL Python execution → no
finallyblocks fire → sock-fileremains on disk forever.
Reproducer
tests/test_cancellation.py::test_cancel_while_childs_child_in_sync_sleepunder
--tpt-proto=uds --spawn-backend=trioreliably leakssleeper@<pid>.sock. Mechanism:Portal.cancel_actor()→ IPC cancel-req msg sent tosleeper.sleeperis blocked in synctime.sleep(3)→ trio scheduler can'tdeliver the
Cancelleduntil the sleep returns.hard_kill'smove_on_after(1.6s)deadline fires.proc.kill()→ SIGKILL → no Python cleanup.sleeper@<pid>.sockorphaned in$XDG_RUNTIME_DIR/tractor/.Tests that consistently leak under
--tpt-proto=uds --spawn-backend=trio:test_cancel_via_SIGINT_other_task[trio]— leaks 3namesucka@<pid>.socktest_cancel_while_childs_child_in_sync_sleep— leakssleeper@<pid>.sock(andsync_blocking_sub@<pid>.sockin theTrue variant)
test_fast_graceful_cancel_when_spawn_task_in_soft_proc_wait_for_daemon[trio]— leaks
fast_boi@<pid>.sockSide effects
EMFILE.--tpt-proto=uds, asingle hard-killed subactor leaves a sock file that a sibling
test's
wait_for_actor/find_actordiscovery probes canaccidentally hit (
FileExistsErroron rebind, orepoll_registeron a half-closed peer-FIN'd fd).
XDG_RUNTIME_DIR(tmpfs on most distros), sock inodes stillconsume kernel resources until the filesystem is unmounted.
Detection (autouse fixture)
tractor._testing._reap._track_orphaned_uds_per_test(committed in1cdc7fb3) snapshots$XDG_RUNTIME_DIR/tractor/before+after eachtest and emits
UserWarning: UDS sock-file LEAK detected from test (reaping)when new orphaned sockfiles appear. Per-test scoping makesblame obvious vs a session-end blanket sweep.
Companion CLI extension:
scripts/tractor-reap --uds/--uds-only(committed in
0996a836) for post-mortem cleanup when a sessioncrashed.
Fix
tractor.spawn._reap.unlink_uds_bind_addrs()— invoked fromhard_killunconditionally post-SIGKILL. Two cleanup paths in order:bind_addrs— when the parent set the subactor's bindaddrs at spawn time, unlink each UDS-flavored sockpath directly.
bind_addrsis empty (thecommon case: subactor picked its own random sock via
UDSAddress.get_random()), reconstruct the path from(subactor.aid.name, proc.pid)using the same<name>@<pid>.sockconvention. Works because the subactor uses its own
os.getpid()at bind time, which equals
proc.pidfrom the parent's view.Idempotent:
FileNotFoundError(graceful exit already-unlinked, socknever bound under early-spawn cancel, or transport wasn't UDS this
run) is silenced; other
OSErrors log a warning but never raise.Future work — authoritative bind-addr tracking
The convention-based path (2) above hardcodes the
<name>@<pid>.sockconvention from
tractor.ipc._uds.UDSAddress. If that conventionever changes — or the subactor binds to a non-default
bindspace/filedir— we'll silently fail to unlink.A more authoritative approach:
registry inside
tractor.ipc._udsatstart_listener()time.IPC immediately post-bind (extension to
SpawnSpecreply / a newhandshake msg).
unlink_uds_bind_addrs()checks the cache FIRST, falls back toconvention-reconstruction if the subactor died before reporting.
Documented as a TODO in
tractor.spawn._reap's module docstring;tracking via a follow-up issue if needed.
Related
CLOSE_WAITfds when registrar server-side closes #452 — the discovery-clientCLOSE_WAITTCP fd leak. Different bugclass (TCP/discovery layer vs UDS/spawn layer) but same broader
theme of "fork-spawn unmasked latent cleanup gaps".
ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md— different bug entirely (upstream trio
WakeupSocketpair.drain()EOF busy-loop), but the patch for THAT one is what made these tests
reliable enough to observe the UDS leaks consistently in CI.
fork()can be hacked now?' #379 — subint umbrella tracking issue.