Skip to content

fix: replace fast_channel with tokio::sync::mpsc to eliminate spin-deadlock#3960

Merged
sanity merged 4 commits intomainfrom
fix-3959-fast-channel-spin-cap
Apr 27, 2026
Merged

fix: replace fast_channel with tokio::sync::mpsc to eliminate spin-deadlock#3960
sanity merged 4 commits intomainfrom
fix-3959-fast-channel-spin-cap

Conversation

@sanity
Copy link
Copy Markdown
Collaborator

@sanity sanity commented Apr 26, 2026

Problem

A live freenet client peer (v0.2.49) on Linux went unresponsive after running for several hours: the HTTP API on 127.0.0.1:7509 stopped accepting requests, no log output for 4+ hours, but the process kept running with two tokio worker threads pegged near 100% CPU. strace -p <tid> -c over 5 s showed 100% sched_yield, ~30 kHz, on each thread. gdb caught both threads inside crossbeam-channel:

  • TX: UdpPacketsListener::listenConnectionStateManager::try_send_establishedcrossbeam_channel::Sender::try_sendsched_yield
  • RX: PeerConnection::recvfast_channel::FastReceiver::recv_asynccrossbeam_channel::Receiver::try_recvsched_yield

fast_channel.rs wraps crossbeam's bounded array channel. That channel uses a slot-reservation protocol — a sender CAS-bumps tail, then writes the slot's value, then updates the slot's stamp. Receivers do the symmetric dance on head. When either side encounters a slot whose stamp hasn't been finalized yet, they call Backoff::snooze() (→ sched_yield) — and Backoff::snooze has no upper bound. Once a slot is left in a permanent in-flight state, every subsequent try_send/try_recv that lands on it spins forever.

The listener calls try_send_established synchronously from its event loop. A single wedged per-peer channel therefore freezes the listener for every peer plus the entire runtime, including the Axum HTTP server task.

fast_channel (introduced in #2263) has been patched four times for related correctness/perf issues: #2336, #3081, #3208, #3215. This is the fifth instance of the same family.

Approach

Delete fast_channel and migrate every call site to tokio::sync::mpsc directly:

  • tokio's mpsc has no slot-reservation/commit split, so it cannot wedge in the way crossbeam's array channel can. try_send returns Full/Closed immediately under all conditions.
  • recv().await integrates with the runtime via parker/notify primitives owned by the channel, so the Notify wrapper in fast_channel is unnecessary.
  • The "2 × microbenchmark advantage" of crossbeam over tokio mpsc cited in the original fast_channel doc never translated to production throughput — per-channel throughput is bounded by network speed and by per-packet processing in PeerConnection::recv, not by channel ops/sec. Trading the microbench delta for elimination of node-wide-hang tail risk is the right call.
  • The crossbeam workspace dependency is removed (no remaining users).

API shifts:

  • FastSender<T>mpsc::Sender<T>
  • FastReceiver<T>mpsc::Receiver<T> (now needs &mut self for recv()/try_recv())
  • fast_channel::bounded(n)mpsc::channel(n)
  • recv_async()recv().await (returns Option<T>, not Result<T, RecvError>; matches updated accordingly)
  • send_async(x).awaitsend(x).await
  • Sync .send(x) for blocking test sockets → .blocking_send(x)

Testing

  • New regression test test_try_send_established_never_blocks_when_full exercises the listener forwarding path: fills a per-peer channel with no consumer, calls try_send_established 1000 times, asserts the whole loop completes within 1 s. Under crossbeam-backed fast_channel, the wedge required unlucky thread interleaving inside crossbeam internals and was not triggerable in a unit test, but the test pins the invariant the fix establishes — try_send from the listener path must never block, regardless of receiver state.
  • Existing 530 transport unit tests: all pass.
  • Existing 2453 lib tests: all pass.
  • cargo clippy --all-targets -- -D warnings: clean.
  • Drive-by: added a missing // SAFETY: comment to a pre-existing unsafe block in service.rs that was tripping clippy::undocumented_unsafe_blocks even on main.

Why didn't CI catch this?

The wedge requires a specific cross-thread interleaving inside crossbeam-channel-array — a slot reservation that completes its CAS but doesn't update the stamp before the next sender or receiver lands on it. Reproducing this synthetically would require either a custom-built test that hooks into crossbeam internals or a fault-injection layer above it. Neither was practical to add for one library's internal protocol.

The right CI gap to close is the architectural one: the listener's forwarding path must be guaranteed non-blocking. The new test pins exactly that invariant — and tokio mpsc's API guarantees try_send returns immediately by construction, so it cannot regress without changing channels again.

Closes #3959

[AI-assisted - Claude]

…adlock

## Problem

A live freenet node (v0.2.49) on a Linux client peer went unresponsive
after several hours: the HTTP API stopped accepting requests, no log
output for 4+ hours, but the process kept running with two tokio worker
threads pegged near 100% CPU.

`strace -p <tid> -c` over 5s on the two hot threads showed 100%
`sched_yield` (~30 kHz on each thread). gdb backtraces caught both
threads inside crossbeam-channel internals:

  TX: UdpPacketsListener::listen
       → ConnectionStateManager::try_send_established
       → crossbeam_channel::Sender::try_send
       → sched_yield

  RX: PeerConnection::recv
       → fast_channel::FastReceiver::recv_async
       → crossbeam_channel::Receiver::try_recv
       → sched_yield

## Root cause

`fast_channel` wraps crossbeam's bounded array channel. That channel
uses a slot-reservation protocol: a sender CAS-bumps `tail`, then writes
the slot's value, then updates the slot's stamp. Receivers do the
symmetric dance on `head`. Both sides call `Backoff::snooze()`
(→ `sched_yield`) when the slot they want to operate on is in an
in-flight state — and `Backoff::snooze` has no upper bound. If a slot
ever ends up permanently in the in-flight state, every future
`try_send`/`try_recv` that lands on it spins forever.

The listener calls `try_send_established` synchronously from its event
loop. Once a single per-peer inbound channel wedges, the listener
blocks indefinitely on that one channel's `try_send`, starving every
other peer plus the entire runtime — including the Axum HTTP server.

`fast_channel` (introduced in #2263) has been patched four times for
related correctness/perf issues (#2336, #3081, #3208, #3215). This
bug is the fifth instance of the same family.

## Solution

Delete `fast_channel` and migrate every call site to `tokio::sync::mpsc`
directly:

- tokio's mpsc uses a different internal data structure with no
  slot-reservation/commit split, so it cannot wedge in the way
  crossbeam's array channel can.
- `try_send` returns `Full`/`Closed` immediately under all conditions.
- `recv().await` integrates with the runtime via parker/notify wakeup
  primitives owned by the channel, so the `Notify` wrapper around
  crossbeam in `fast_channel` becomes unnecessary.

The "2x microbenchmark advantage" of crossbeam over tokio mpsc cited in
the original `fast_channel` doc comment never translated to production
throughput — per-channel throughput is bounded by network speed and by
packet processing in `PeerConnection::recv`, not by channel ops/sec.
The microbench delta is not worth the tail-risk of node-wide hangs.

The `crossbeam` workspace dependency is removed (no remaining users).

## Testing

- New regression test `test_try_send_established_never_blocks_when_full`
  exercises the listener forwarding path: fills a per-peer channel
  with no consumer, calls `try_send_established` 1000 times, and
  asserts the whole loop completes within 1 second. Under the old
  crossbeam-backed `fast_channel`, this same scenario could not be
  used to reproduce the wedge directly (the wedge requires unlucky
  thread-interleaving inside crossbeam), but the test pins the
  invariant the fix establishes: `try_send` from the listener must
  never block, regardless of receiver state.
- Existing 530 transport unit tests: all pass.
- Existing 2453 lib tests: all pass.
- `cargo clippy --all-targets -- -D warnings` clean.
- Drive-by: added a missing `// SAFETY:` comment on a pre-existing
  `unsafe` block in `service.rs` that was tripping clippy.

Closes #3959

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 26, 2026

Now I have everything needed for a complete review. Let me compile my findings.


Rule Review: send_nat_traversal uses .send().await in listener loop after migration

Rules checked: channel-safety.md, code-style.md, testing.md, transport.md, git-workflow.md
Files reviewed: 9 (Cargo.lock, Cargo.toml, crates/core/Cargo.toml, benches/transport/blackbox.rs, bin/commands/service.rs, transport.rs, transport/connection_handler.rs, transport/fast_channel.rs [deleted])


Warnings

  • crates/core/src/transport/connection_handler.rs:790send_nat_traversal now uses packet_sender.send(packet).await (replacing the equivalent fast_channel::send_async().await). The call site at line 1371 is inside the UDP listener loop (ConnectionState::NatTraversal match arm). Per channel-safety.md, .send().await in event loops is explicitly banned — it has caused 4+ production deadlocks — and the channel-safety audit grep now directly catches this call. If the NAT traversal task stalls or the channel fills (capacity 100), the entire listener loop blocks, halting packet delivery for all peers. The old fast_channel::send_async() had identical semantics so this is not a regression, but the PR migrates to the standard primitive without adding a timeout, and the channel-safety audit will flag it going forward. Fix: replace with try_send() + drop-and-log-on-Full, or wrap with tokio::time::timeout. (rule: channel-safety.md)

Info

None.


Summary: The core fix (removing the crossbeam spin-deadlock via fast_channel deletion and migrating to tokio::sync::mpsc) is correct and well-tested. The regression test suite for try_send_established and try_send_gateway_handshake is thorough. The one WARNING is that the async send_nat_traversal path carries the same indefinite-block behavior into a standard mpsc::send().await that is now directly catchable by the project's own channel-safety audit grep, and should be fixed before merge.


Rule review against .claude/rules/. WARNING findings block merge. ⚠️ 1 warning(s) — fix or add review-override label

Sweep up the leftover mentions of the now-deleted `fast_channel`
module in code comments and architecture/benchmarking docs:

- benches/transport/blackbox.rs module-doc list
- tests/simulation_integration.rs background comment on
  test_thundering_herd_connect_storm
- DOCUMENTATION_GAPS.md "Fast Channel Implementation" entry,
  follow-up TODO, and code-reference summary row
- benchmarking/methodology.md `transport/fast_channel/*` row
  (the bench group never actually existed) and the parenthetical
  in the "Key insight" callout

The historical 2025-12 performance-analysis report intentionally
keeps its references — it documents a decision made at that
time, not current code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pe clarity

Multi-model review (code-first, testing, skeptical, big-picture, codex)
of #3960 raised the following points; addressing here:

- Add a parallel regression test
  `test_try_send_gateway_handshake_never_blocks_when_full` for the
  gateway-handshake forwarding path. It is invoked synchronously from
  the listener loop (`connection_handler.rs:1352`) and was equally
  vulnerable to the crossbeam wedge. Same scaffolding as the
  established-path test.

- Convert both regression tests from `#[test]` to `#[tokio::test]` so
  the channel construction lives inside a runtime context. `mpsc::channel`
  itself does not require one, but keeping the tests in a runtime keeps
  them forward-compatible if anyone extends `ConnectionStateManager` to
  use async primitives later.

- Soften the regression-test docstring claim that the test reproduces
  the original wedge — it does not (the wedge required cross-thread
  interleaving inside crossbeam internals, not synthetically reproducible
  from a unit test). The test pins the structural contract the fix
  establishes, which is the appropriate forward-looking guard rail.

- Add a "Superseded by #3960" header to
  `docs/architecture/transport/benchmarking/2025-12-performance-analysis-01.md`,
  which still recommended migrating *to* `fast_channel`. Without the
  header, that doc would actively contradict the codebase and could
  mislead a future agent into reintroducing the wedge.

The two pre-existing `.send().await` call sites flagged by reviewers
(`send_nat_traversal` at conn_handler.rs:790 and `inbound_streams`
fragment forwarding at peer_connection.rs:1483) are tracked as #3961
rather than fixed in this PR — converting them to `try_send` requires
real design decisions about backpressure semantics for two distinct
subsystems, not a mechanical rename.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sanity
Copy link
Copy Markdown
Collaborator Author

sanity commented Apr 27, 2026

Review feedback summary

Ran multi-model review (4 internal Claude agents — code-first, testing, skeptical, big-picture — plus Codex). Codex found no regressions. The other four converged on three actionable categories:

Addressed in this PR (commits 9ed6319, c154da2)

  • Stale fast_channel references in docs/comments — code-simplifier swept 7 sites in benches/transport/blackbox.rs, tests/simulation_integration.rs, docs/architecture/transport/DOCUMENTATION_GAPS.md, and docs/architecture/transport/benchmarking/methodology.md. The historical perf analysis doc (2025-12-performance-analysis-01.md) got a "Superseded by fix: replace fast_channel with tokio::sync::mpsc to eliminate spin-deadlock #3960" header rather than a rewrite, since it captures decisions made at that time.
  • Add a parallel regression test for try_send_gateway_handshake — same forwarding path from the listener loop, equally vulnerable to the original wedge. Added test_try_send_gateway_handshake_never_blocks_when_full alongside the existing test.
  • Convert regression tests to #[tokio::test] — keeps them forward-compatible if ConnectionStateManager later uses async primitives.
  • Soften the regression-test docstring claim — the tests do not reproduce the original wedge (which required cross-thread interleaving inside crossbeam internals); they pin the structural invariant the fix establishes. That distinction is now explicit in the docstring.

Tracked as follow-up (#3961), not addressed in this PR

Reviewers flagged two pre-existing .send().await call sites in transport hot paths that violate .claude/rules/channel-safety.md:

  • send_nat_traversal at connection_handler.rs:790 (called synchronously from the listener event loop)
  • inbound_streams fragment forwarding at peer_connection.rs:1483 (legacy stream path inside PeerConnection::recv)

Both predate this PR (they were .send_async().await under crossbeam — same blocking semantics, just renamed by the migration). Converting them to try_send requires real design decisions about backpressure semantics for two distinct subsystems (drop the packet vs tear down the handshake; drop the fragment vs close the stream), not a mechanical rename. Tracked in #3961 with the suggested approach.

Acknowledged but not changed

  • Service.rs SAFETY-comment drive-by is a one-line clippy fix on a pre-existing unsafe block that was failing cargo clippy --all-targets -- -D warnings even on main. Disclosed in the original PR description.
  • No bench numbers backing the "2x microbench delta is irrelevant" claim — softened the language; the load-bearing argument is "per-channel throughput is bounded by network speed and per-packet processing in PeerConnection::recv," not the microbench delta itself.
  • Pre-existing race between gateway-handshake completion and packet drain (skeptical reviewer) — the same race exists with crossbeam, unchanged by this PR.

Test status

  • 2455 lib tests pass (was 2453; +2 for the new regression tests)
  • cargo fmt --check clean
  • cargo clippy --all-targets -- -D warnings clean

[AI-assisted - Claude]

… invariants

Two issues caught by the CI Rule Lint job:

1. **`tokio::net::UdpSocket` used in `crates/core/`** (banned per
   `.claude/rules/testing.md`). The two new regression tests added
   `use tokio::net::UdpSocket;` only to satisfy the type parameter
   `S` in `ConnectionStateManager<S, T>` — it was never used as a
   real socket. Switched to `super::mock_transport::MockSocket`,
   which already exists in this file and implements the `Socket`
   trait.

2. **20 tests removed, 2 added**: the deleted `fast_channel.rs` had
   20 unit/property/stress tests of the wrapper's behavior. They
   were testing things the transport layer relies on from its
   channel primitive (basic send/recv, backpressure, sender clone,
   disconnect detection, no-busy-loop wakeup, burst drain,
   multi-producer concurrency, stress-under-contention, slow
   receiver, sender drop while receiver parked). Retargeted them
   to `tokio::sync::mpsc` directly in a new `mpsc_invariants` test
   module under `connection_handler.rs`. Twenty new test functions
   replace the twenty deleted, with one extra (`capacity_is_exact`)
   to pin a property the legacy tests did not cover explicitly.

   These tests document the contract the transport layer expects
   from its channel primitive. If anyone reintroduces a wrapper
   that wedges in spin-loops or silently drops messages under
   backpressure, these tests trip a clear failure.

All 22 newly-added test functions in this PR pass. Existing 530
transport unit tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sanity sanity enabled auto-merge April 27, 2026 00:18
@sanity sanity added this pull request to the merge queue Apr 27, 2026
Merged via the queue into main with commit 3d71df9 Apr 27, 2026
16 of 17 checks passed
@sanity sanity deleted the fix-3959-fast-channel-spin-cap branch April 27, 2026 00:41
sanity added a commit that referenced this pull request Apr 27, 2026
…channel

Multi-model review (Codex P1, skeptical reviewer M2, testing reviewer)
unanimously flagged the `peer_connection.rs` change as wrong: turning
`Full` into `TransportError::ConnectionClosed` for the per-stream
`inbound_streams` channel introduces a new failure mode where
otherwise-healthy large transfers abort whenever transient executor
pressure backs up more than 64 fragments before the spawned
`recv_stream` reassembly task drains them.

The receiver here is freenet-spawned, internal, same-runtime, and
has no upstream dependency on the producing recv loop — so the
cascading-backpressure deadlock pattern that `channel-safety.md`
exists to prevent cannot form. The previous `.send().await` provided
the correct backpressure semantics. Reverted that file to use
`.send().await` again, with a load-bearing comment citing the new
"Exception: same-runtime internal consumers" section in the rule.

The NAT-traversal change in `connection_handler.rs` is kept (that
path IS subject to cross-peer cascade via the listener event loop).

Other review feedback addressed:

- **Skeptical M1 / Codex docstring accuracy**: the doc-comment on
  `try_send_nat_traversal` claimed "10 retries over ~3 s" — that's
  the cfg(test) constant. Production sets `NAT_TRAVERSAL_MAX_ATTEMPTS = 40`
  capped by a 3 s `overall_deadline` at a 200 ms cadence, so ~15
  attempts. Fixed both the rust doc-comment and `transport.md:92` to
  cite the precise constants and explain the production vs test split.

- **Skeptical M3**: `try_send_nat_traversal` collapsed `Full` and
  `Closed` into a single `Ok(false)` return, and the listener logged
  them with one indistinguishable message. Promoted the return type
  to a dedicated `TrySendNatTraversalOutcome { Sent, Full, Closed }`
  enum. The listener now logs them separately:
    - `Full`: "channel full, dropping packet (peer will retry)"
    - `Closed`: "handshake task exited; dropping packet"
  These are diagnostically different and worth distinguishing in
  production logs.

- **Big-picture (low)**: documented the #3959#3960#3961 listener-
  forwarding incident family in `channel-safety.md`, and added the
  "Exception: same-runtime internal consumers" subsection that the
  reverted peer_connection.rs comment references. Without this,
  the rule was prima-facie violated by the `.send().await` we just
  put back, with no documented justification.

Test results unchanged: 553 transport unit tests pass. Test
`test_try_send_nat_traversal_never_blocks_when_full` updated for
the new enum return type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fast_channel deadlocks node in sched_yield spin loop on wedged crossbeam slot

1 participant