Skip to content

fix(put): deliver originator-loopback failure via PutMsg::Error bypass#4126

Open
Basedfloppa wants to merge 8 commits into
freenet:mainfrom
Basedfloppa:fix/4111-put-loopback-terminal-error
Open

fix(put): deliver originator-loopback failure via PutMsg::Error bypass#4126
Basedfloppa wants to merge 8 commits into
freenet:mainfrom
Basedfloppa:fix/4111-put-loopback-terminal-error

Conversation

@Basedfloppa
Copy link
Copy Markdown
Contributor

@Basedfloppa Basedfloppa commented May 14, 2026

Problem

When a client PUT hits the originator-loopback path (the default when fdev connects to the same node that holds the contract code) and put_contract rejects the update, the retry loop spins through its full retry budget — each iteration repeating the same failing local validation — and the client receives a generic put failed after N attempts (last error: failed notifying, channel closed) instead of the actual contract-side rejection reason.

Whether the real cause ever surfaces depends on a race inside OpManager::send_client_result:

  • M1try_send of the real HostResult::Err
  • M2try_send of TransactionCompleted cleanup

If the event loop processes M2 first, it removes the pending_op_results[client_tx] callback before the originator's send_and_await consumes M1. The retry-loop driver sees the closed reply channel as NotificationError, classifies it as wire_error, and calls advance() — re-entering the loopback with a fresh attempt_tx, re-running the same failing put_contract, and repeating up to the retry cap. Even when the race goes the right way and the real error reaches the client, the server has burned three extra put_contract round-trips for nothing — observable as multiplied CPU under load and inflated orphan_streams accounting.

Reproduced on a live gateway (freenet-core 0.2.56): 24+ pairs of put_contract failed / relay_put_error log lines in a 10-minute window for a single client retry session against a wrapper contract that enforces strictly-increasing state.version.

Solution

Option 1 from #4111: route the terminal failure through the same pending_op_results bypass that delivers PutMsg::Response on success, so drive_retry_loop classifies it once and publishes once.

Wire envelope. New variant PutMsg::Error { id, cause: String } is appended at the end of the PutMsg enum so the bincode discriminants for every pre-existing variant stay byte-stable. The enum is now #[non_exhaustive] so future additive variants don't require a wire-format break; wildcard arms downstream consume the safety net (see .claude/rules/git-workflow.md → "WHEN bumping freenet-stdlib"). A wire-tag pin test locks the variant order against silent reordering and documents the bincode codec-config assumption.

DoS cap. PUT_TERMINAL_CAUSE_MAX_BYTES = 2048 + a UTF-8-safe truncator bound_cause (with ...[truncated] marker) is applied at every entry point that builds or forwards a PutMsg::Error: PutTerminalError::from_wire, run_relay_put's loopback emit, the multi-hop bubble in drive_relay_put's reply arm, and the new run_relay_put non-loopback error path. Idempotent on already-bounded inputs, documented as "verbatim-or-rebound, never append".

Loopback flow. run_relay_put's originator-loopback failure path dispatches PutMsg::Error via send_local_loopback instead of calling send_client_result directly:

  • The bypass in handle_pure_network_message_v1 forwards Error to pending_op_results[tx] like any other terminal reply (Response | ResponseStreaming | Error — pinned by put_branch_bypass_includes_error_variant_regression_guard).
  • classify_reply maps it to ReplyClass::TerminalError { cause }.
  • PutRetryDriver::Terminal is widened from ContractKey to Result<ContractKey, PutTerminalError> so Terminal(Err(cause)) carries the typed error through a single match arm (no new AttemptOutcome / RetryLoopOutcome variant).
  • drive_retry_loop returns Done(Err(cause)) and run_client_put publishes exactly one HostResult::Err(OperationError { cause }) carrying the real contract-side reason.

Symmetric with the success path. The failure now rides the bypass the same way a Response does — no out-of-band channel, no race against TransactionCompleted, no retry storm against a deterministic local failure.

Multi-hop bubble. New relay_put_send_error helper forwards a downstream PutMsg::Error one hop further upstream when a non-final relay receives it: drive_relay_put's reply match adds a PutMsg::Error { cause } arm that re-bounds the cause and dispatches via relay_put_send_error. The receiving node's bypass accepts Error (gated structurally by put_branch_bypass_includes_error_variant_regression_guard), so the envelope rides the chain Origin → R1 → … → R_failed without ever falling through to UnexpectedOpState. run_relay_put's non-loopback error path is symmetric: when the local driver's drive_result is Err and we're not the originator, relay_put_send_error(upstream_addr) emits the envelope so the upstream's send_to_and_await waiter receives a real terminal instead of hanging until OPERATION_TTL.

Shutdown fallback. When send_local_loopback fails on a closed op_execution_sender (node shutdown), the loopback path hands off to a new free-function helper dispatch_loopback_shutdown_fallback that publishes a HostResult::Err straight to result_router_tx so the WS client still sees the real cause. The helper's signature takes only &mpsc::Sender<(Transaction, HostResult)> — it has no access to OpManager, so it physically cannot invoke op_manager.completed(tx) (the pre-#4111 race shape) or emit TransactionCompleted on the notifications channel. This is a compile-time guarantee against M2-side regressions: pending_op_results[tx] is reclaimed by the 60 s sweep, a slot leak under shutdown for guaranteed race-freedom. Mirrors the release_pending_op_slot_on extraction pattern in op_state_manager.rs (review finding T-3).

Testing

Wire-format and structural pins.

  • put_msg_error_{id_and_display, serde_roundtrip, has_no_requested_location} — wire shape of the new variant.
  • put_msg_wire_format_variant_tags_are_stable — bincode tag stability across all six variants, with explicit codec-config assumption note (default u32 LE; switching to with_varint_encoding/with_big_endian would silently break the wire format).
  • put_branch_bypass_includes_error_variant_regression_guard — PUT dispatch terminal-gate lists Error alongside Response/ResponseStreaming.

Classification + driver behaviour.

  • classify_reply_error_is_terminal_error_with_causePutMsg::ErrorReplyClass::TerminalError { cause: verbatim }.
  • classify_reply_error_maps_to_terminal_error_with_cause — explicit chain pin (PR review item B2: replaces the previous tautological drive_retry_loop_done_err_does_not_call_advance that called only classify_reply and asserted a counter that classify can't move; the new test is honest about scope and composes with the structural pin in op_ctx).
  • drive_retry_loop_terminal_arm_does_not_call_advance (in crate::operations::op_ctx::tests) — structural pin on drive_retry_loop's AttemptOutcome::Terminal arm: must return RetryLoopOutcome::Done(value); synchronously, must not contain .advance() or continue.
  • put_retry_driver_terminal_error_does_not_advance — pin on PutRetryDriver::classify mapping TerminalErrorTerminal(Err(_)).
  • run_client_put_done_err_arm_publishes_once_and_completes — pin on the Done(Err) arm: calls op_manager.completed(client_tx), publishes ErrorKind::OperationError, does NOT advance.

Boundary cases.

  • classify_reply_error_with_{empty_cause_preserves_empty, normal_cause_passes_through, oversize_cause_is_truncated} — cause-string round-trip guards.
  • bound_cause_{short_string_passes_through, truncates_oversize_at_char_boundary, never_splits_utf8_codepoint, cap_boundary_is_inclusive} — PR review M2 + minor: the never_splits_utf8_codepoint test now uses a 4-byte codepoint ("𝄞" U+1D11E, 2034 % 4 == 2) so the walk-back loop is actually exercised; the previous 3-byte "好" (2034 % 3 == 0) landed exactly on a boundary and the loop ran zero iterations. The new cap_boundary_is_inclusive test pins the off-by-one around PUT_TERMINAL_CAUSE_MAX_BYTES (verbatim at == MAX, truncated at MAX + 1).
  • put_terminal_error_from_wire_truncates — wire-side cap enforcement.

Loopback + multi-hop dispatch.

  • send_local_loopback_carries_put_error_to_event_loop — op_ctx-level end-to-end: Error envelope flows with target_addr=None.
  • run_relay_put_publishes_error_on_loopback_failure — pinned that the loopback emit goes through send_local_loopback(PutMsg::Error); the shutdown fallback uses send_client_result but MUST NOT call op_manager.completed(incoming_tx) (PR review B3 — pre-PUT originator-loopback error path triggers wasteful retries and risks hiding the real error from the client #4111 race shape forbidden).
  • relay_put_send_error_with_ctx_{uses_loopback_when_own_addr, uses_fire_and_forget_to_upstream, unknown_own_addr_uses_fire_and_forget, errors_on_closed_channel} (PR review M1): behavioural unit tests against the extracted relay_put_send_error_with_ctx helper using event_loop_notification_channel. Each test drives the helper end-to-end against a real OpCtx, captures the outbound OpExecutionPayload from the receiver, and asserts (a) the envelope is PutMsg::Error { id, cause: verbatim }, (b) target_addr is None for loopback / Some(upstream) otherwise, (c) own_addr=None fails closed via fire-and-forget rather than masking as phantom loopback, and (d) a closed executor channel surfaces as OpError::NotificationError.
  • drive_relay_put_error_arm_rebounds_cause_before_bubble — pins the multi-hop arm calls bound_cause(downstream_cause) + relay_put_send_error(... upstream_addr).
  • run_relay_put_bubbles_local_failure_in_non_loopback_mode — pins the B1 fix: the non-loopback Err branch in run_relay_put calls bound_cause(err.to_string()) + relay_put_send_error(... upstream_addr).

Shutdown fallback (PR review B3 + M3).

  • dispatch_loopback_shutdown_fallback_publishes_to_result_router — behavioural: helper publishes exactly one (incoming_tx, Err(OperationError { cause: verbatim })) to result_router_tx. No second entry.
  • dispatch_loopback_shutdown_fallback_does_not_emit_transaction_completed — wires a real notifications channel side-by-side with the result router and asserts the notifications channel stays empty after the helper runs. The compile-time half of the guarantee is the helper's signature (only &mpsc::Sender<(Transaction, HostResult)>, no access to OpManager or notifications sender). Together these guard against re-introducing the pre-PUT originator-loopback error path triggers wasteful retries and risks hiding the real error from the client #4111 M1/M2 race shape under shutdown.
  • dispatch_loopback_shutdown_fallback_does_not_panic_on_closed_router — receiver dropped, helper logs and returns without unwinding.
  • dispatch_loopback_shutdown_fallback_does_not_block_on_full_router — capacity-1 channel pre-filled, helper returns within 100 ms (proves try_send not send().await, per .claude/rules/channel-safety.md).

End-to-end.

  • tests/error_notification.rs::test_put_error_notification (single-node) — keeps verifying the client doesn't hang, and asserts the response does NOT contain the FORBIDDEN_MARKER = "failed notifying, channel closed" — explicit guard against the synthesised infrastructure-leak string ever surfacing to clients.
  • tests/error_notification.rs::test_put_error_notification_multi_hop (PR review M1 part 2) — 2-node ring ["gateway", "peer-a"]; client connects to peer-a (NOT the gateway), so the PUT enters at a non-originator node and the wire path is peer-a (client) → peer-a (originator-loopback relay) → gateway. Same FORBIDDEN_MARKER assertion as the single-node variant, plus a load-bearing 30 s timeout — if the multi-hop bubble regresses (bypass stops accepting PutMsg::Error, or drive_relay_put's reply arm falls back to UnexpectedOpState, or run_relay_put's non-loopback Err branch stops emitting), the single-node test still passes but this one panics.

Tightening either E2E to assert on the actual contract-side cause string needs a wrapper contract with a deterministic, asserted-on failure reason — tracked as follow-up #4147.

Sanity.

  • cargo test -p freenet --lib operations::put → 68 passed, 0 failed (PUT lib suite expanded from 47 → 68 — wire pins + classification + boundary + multi-hop unit + M3 behavioural).
  • cargo test -p freenet --lib operations::op_ctx → 15 passed, 0 failed (includes the new drive_retry_loop_terminal_arm_does_not_call_advance pin).
  • cargo test -p freenet --test error_notification test_put_error_notification_multi_hop → 1 passed (integration, 16.9 s after the test-contract wasm compile).
  • cargo fmt --check and cargo clippy -p freenet --lib --tests -- -D warnings clean.

Fixes

Closes #4111. Follow-up: #4147 (wrapper-contract E2E asserting on a stable cause string).

Basedfloppa and others added 2 commits May 14, 2026 12:43
Issue freenet#4111: a client PUT that hit the originator-loopback path and was
rejected locally by put_contract (e.g., wrapper-contract version check)
burned the full retry budget against the same deterministic failure
and surfaced "put failed after N attempts (last error: failed
notifying, channel closed)" instead of the contract-side reason.

The race was inside OpManager::send_client_result: it issued the real
error (M1) and the TransactionCompleted cleanup (M2) sequentially via
two try_send calls. When M2 ran first, pending_op_results[client_tx]
was removed, which closed the originator's send_and_await reply
channel. The retry-loop driver saw the close as NotificationError,
classified it as wire_error, and advanced through three more peers —
each of which looped back, re-ran the same put_contract validation,
failed identically, and closed its own slot.

Option 1 from the issue: add a PutMsg::Error { id, cause } wire variant
and route the loopback failure through send_local_loopback instead of
the out-of-band send_client_result. The bypass in
handle_pure_network_message_v1 forwards PutMsg::Error to
pending_op_results[tx] like any other terminal reply. PutRetryDriver
classifies it as Terminal(Err(cause)), drive_retry_loop returns
Done(Err(_)), and run_client_put publishes exactly one
HostResult::Err(OperationError { cause }) carrying the real reason.

Symmetric with the success path. No new variant on AttemptOutcome /
RetryLoopOutcome: PUT's Terminal type becomes Result<ContractKey,
String>, so the existing Done arm carries both shapes through a single
match site.

Regression coverage:
- put_msg_error_{id_and_display,serde_roundtrip,has_no_requested_location}
  in operations/put.rs — pin the wire shape.
- classify_reply_error_is_terminal_error_with_cause and
  put_retry_driver_terminal_error_does_not_advance in put/op_ctx_task.rs
  — pin the classifier wiring.
- run_relay_put_publishes_error_on_loopback_failure updated to assert
  send_local_loopback(PutMsg::Error) replaces send_client_result.
- put_branch_bypass_includes_error_variant_regression_guard in node.rs
  — pin that the bypass terminal-gate lists Error alongside
  Response / ResponseStreaming so the dispatch wildcard doesn't drop it.

End-to-end coverage: existing test_put_error_notification
(crates/core/tests/error_notification.rs) verifies the client receives a
response rather than hanging. With this fix the response now carries
the contract-side reason; tightening that assertion belongs to a
follow-up since it needs a wrapper-contract that emits a stable reason
string.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ehavioural cases

The original fix commit shipped wire-format and structural pins. This
follow-up adds:

- **Boundary tests**: empty and multi-KB `cause` strings must
  round-trip through `classify_reply` unchanged. Pins against silent
  truncation or placeholder substitution — the contract-side reason
  reaches the client byte-for-byte.

- **Live-path classification**: a `PutMsg::Error` reply driven
  through the actual `RetryDriver::classify` trait method (not just
  source-text grep) must produce `AttemptOutcome::Terminal(Err(cause))`.
  Catches a future classify rewrite that bypasses the structural pin.

- **Companion happy-path**: a `Response` reply must still produce
  `Terminal(Ok(key))`. Pins against an accidental Ok/Err inversion
  introduced when `Terminal` widened to `Result<ContractKey, String>`.

- **Done(Err) arm pin**: structural assertions that the
  `run_client_put` `Done(Err(cause))` arm calls
  `op_manager.completed(client_tx)`, publishes
  `ErrorKind::OperationError`, and does NOT advance/retry.

- **op_ctx end-to-end loopback**: `send_local_loopback` carries a
  `PutMsg::Error` envelope through to the event-loop receiver with
  `target_addr=None` and a closed reply sender — the same shape used
  by the successful `Response` loopback, so the bypass forwards both
  through identical infrastructure.

Sanity: full `cargo test -p freenet --lib --no-fail-fast` reports
2502 passed, 0 failed, 14 ignored. PUT suite now 52 tests
(was 47); `send_local_loopback` suite 3 (was 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

I have all the information I need. Here is the complete review.


Rule Review: No blocking issues; one style note

Rules checked: operations.md, code-style.md, testing.md, git-workflow.md
Files reviewed: 7


Warnings

None.

Every criterion was evaluated:

  • fix: regression tests — present and substantive. put_branch_bypass_includes_error_variant_regression_guard structurally pins the exact code path that contained the bug (bypass not accepting PutMsg::Error). test_put_error_notification and test_put_error_notification_multi_hop provide E2E coverage. The acknowledged gap (asserting the exact contract-side cause, tracked as test(put): E2E regression test for #4111 with stable failure-cause wrapper contract #4147) is a follow-up, not a regression-test deficiency.
  • .unwrap() in production code — none found; put.rs grep clean.
  • Fire-and-forget spawns — no new GlobalExecutor::spawn without tracking.
  • Push-before-send — the new error paths (send_local_loopback, relay_put_send_error) are terminal delivery, not new operation initiation; no op_manager.push() is needed or called.
  • Cleanup exemptions without TTL — the !originator_loopback guard on release_pending_op_slot is not a GC exemption; the originator's own driver reclaims its slot when it consumes the bypass-delivered reply.
  • Unguarded biased; — none introduced.
  • Channel safetydispatch_loopback_shutdown_fallback correctly uses try_send (not send().await) from a non-async context.
  • Wire-format stabilityPutMsg::Error is appended as tag 5 (after ForwardingAck); existing tags are undisturbed; put_msg_wire_format_variant_tags_are_stable pins all six tags.
  • Documentation update.claude/rules/operations.md is updated in the same PR to reflect the bypass change and the new terminal-reply delivery pattern. Consistent with the AGENTS.md "fix in the same commit" rule.
  • Commit messages — all follow conventional commits (fix(put):, test(operations/put):).
  • #[non_exhaustive] on PutMsg — correctly added; prevents exhaustive-match compilation errors when the enum gains further variants.

Info

  • crates/core/tests/error_notification.rs:149,274const FORBIDDEN_MARKER: &str = "failed notifying, channel closed" is defined independently inside both test functions. Could be hoisted to module scope to avoid silent divergence if the marker string ever changes. Entirely cosmetic; no semantic risk since both definitions are identical today.

Rule review against .claude/rules/. WARNING findings block merge.

@Basedfloppa Basedfloppa changed the title Fix/4111 put loopback terminal error fix(put): deliver originator-loopback failure via PutMsg::Error bypass May 14, 2026
Comment thread crates/core/src/operations/put/op_ctx_task.rs Outdated
Copy link
Copy Markdown
Collaborator

@iduartgomez iduartgomez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comprehensive review + proposed follow-up commit

Four-perspective review (code-first, testing, skeptical, big-picture) plus an independent skeptical pass. The implementation correctly closes #4111 and the test scaffolding is solid. Below are the items that came out of review with a proposed follow-up commit at the end.

Blocking

  1. Wire-format compat for the new variant. PutMsg lacks #[non_exhaustive] and there's no *_wire_format_is_stable pin. Older 0.2.x peers that ever receive PutMsg::Error will panic in pattern-match arms instead of failing gracefully at bincode deserialization. Even though Error is currently loopback-only, the enum is Serialize/Deserialize and any future change that lets it leak over the wire would be a silent-break.
  2. send_local_loopback failure leaks the slot. The warn-only branch at op_ctx_task.rs:797-806 drops the client on shutdown and leaks pending_op_results[incoming_tx] until the 60s GC sweep. op_execution_sender and result_router_tx are independent channels — one may be alive while the other is tearing down. The pre-#4111 code had send_client_result + completed() unconditionally; the PR should restore this as a fallback when the bypass send fails.
  3. PR description claims a fallback that doesn't exist. Description says "A fallback to direct send_client_result is preserved … so the 'client always sees something' guarantee survives." Code only logs tracing::warn!. Either restore the fallback (recommended) or reword the description.

Important

  1. Multi-hop cause loss. At op_ctx_task.rs:1242-1253, a downstream relay's PutMsg::Error falls through to other =>OpError::UnexpectedOpState. On a loopback relay the client sees "UnexpectedOpState" instead of the real reason; on a non-loopback relay it disappears entirely. The PR description claims Error is reserved for cross-cutting failure but the routing doesn't realize that intent.
  2. cause: String is unbounded. No size cap on PutMsg::Error.cause or PutTerminalError. Future relay-side use becomes a DoS amplification surface (attacker-controlled upstream emits multi-MB causes; relay forwards verbatim). The PR's own large-cause test pins 8 KiB round-trip without truncation, locking in the unbounded-by-design behaviour.
  3. put_retry_driver_classify_error_returns_terminal_err exercises an in-test re-implementation (InTestPutClassifier), not production PutRetryDriver::classify. The test's own doc-comment admits this. Either invoke the real driver or test classify_reply directly (it's the helper the real driver delegates to).
  4. test_put_error_notification is too lenient. Accepts "any response" including the synthesized "failed notifying, channel closed" marker — i.e., the original bug class would slip through. The test should reject that specific synthesized string.
  5. .claude/rules/operations.md should grow the race-class rule. This is the second instance of the "terminal reply via send_client_result races TransactionCompleted cleanup" pattern (first was the original loopback Response migration). Without distilling it into the rules file, GET / SUBSCRIBE / UPDATE work that grows synchronous-failure paths will rediscover the same trap.

Suggestions

  1. Multi-attempt assertion via runtime counter rather than text-grep on !arm_body.contains("advance").
  2. Replace the most whitespace-fragile source-grep tests with macro/trait-driven structural pins (the existing matches!(\n op, substring is brittle to cargo fmt).
  3. File a follow-up issue for the wrapper-contract end-to-end test the #4111 issue body requested (asserting the contract-side cause survives intact).

Proposed follow-up commit

I've prepared a commit on a local branch (pr-4126) that addresses every item above. Summary:

  • #[non_exhaustive] on PutMsg + put_msg_wire_format_variant_tags_are_stable pinning every bincode tag (0..=5).
  • Shutdown-path fallback restored: send_local_loopback failure now emits send_client_result + op_manager.completed(incoming_tx) so the slot releases synchronously.
  • New relay_put_send_error helper + match arm at the relay reply site so downstream PutMsg::Error causes propagate through multi-hop chains via the same envelope.
  • New PUT_TERMINAL_CAUSE_MAX_BYTES = 2048 constant + UTF-8-safe bound_cause truncator applied on every wire entry point. PutTerminalError::from_wire truncates with a …[truncated] marker.
  • classify_reply_for_put_msg_error_returns_terminal_error + companion _response_returns_stored replace the InTestPutClassifier shim with direct production-helper tests.
  • test_put_error_notification asserts the response does NOT contain the synthesized failed notifying, channel closed marker.
  • New rule "WHEN publishing a terminal operation reply" in .claude/rules/operations.md citing #4111 + the original loopback Response fix.
  • Runtime companion try_forward_driver_reply_delivers_put_msg_error for the bypass-gate pin (complements the existing structural test).
  • run_relay_put_publishes_error_on_loopback_failure updated to permit the new shutdown fallback while still pinning send_local_loopback as the primary path.
  • Filed #4147 for the deferred end-to-end + runtime-counter coverage.

Sanity:

  • cargo test -p freenet --lib → 2508 passed, 14 ignored.
  • cargo clippy -p freenet --tests -- -D warnings clean.
  • cargo fmt --check clean.

Patch is attached below. To apply on your branch:

gh pr checkout 4126
curl -sSL <gist-or-attachment-link> | git apply
git commit -am "fix(put): address PR #4126 review — wire-compat, shutdown fallback, multi-hop cause, length cap"
git push

Or pull from my branch directly if you'd like — I can push to a public fork on request.

The full commit message + diff:

commit ed6ef1d3
fix(put): address PR #4126 review — wire-compat, shutdown fallback, multi-hop cause, length cap

Addresses every blocking + important item from the four-way review of
PR #4126 (closes #4111). Summary of changes:

[...trimmed for brevity — see the patch file...]

Happy to split into multiple commits or rebase into your existing two commits if preferred. The diff is +537 / -143 across 5 files (.claude/rules/operations.md, crates/core/src/node.rs, crates/core/src/operations/put.rs, crates/core/src/operations/put/op_ctx_task.rs, crates/core/tests/error_notification.rs).

@Basedfloppa Basedfloppa requested a review from iduartgomez May 17, 2026 16:38
@iduartgomez
Copy link
Copy Markdown
Collaborator

Second review pass (delta d0f9f5e9..63ea1774)

Thanks for the turnaround — the prior 11 review items are mostly addressed. Re-reviewed the delta with four-perspective analysis (code-first, testing, skeptical, big-picture). Three blockers surfaced that I think we need to discuss before merge, plus a few medium items.

Behavior changes in this delta (for the record)

  1. PutMsg is now #[non_exhaustive]put.rs:217.
  2. PutMsg::Error.cause truncated to 2048 bytes UTF-8-safe with ...[truncated] at every entry point (PutTerminalError::from_wire, run_relay_put loopback emit, relay_put_send_error bubble). The prior "8 KB byte-for-byte preserved" guarantee is reversed.
  3. New shutdown fallback: send_local_loopback failure → send_client_result(Err) + completed(incoming_tx) instead of log+drop — op_ctx_task.rs:776-794.
  4. Multi-hop PutMsg::Error no longer falls through other => UnexpectedOpState; routes through new relay_put_send_error helper — op_ctx_task.rs:1228-1249, op_ctx_task.rs:1414-1441.
  5. Non-loopback bubble delivery failure now surfaces as OpError::NotificationError (previously log+drop).
  6. test_put_error_notification strict — rejects "failed notifying, channel closed" marker on both Ok-response and WS-error branches.
  7. PUT_RETRY_DRIVER_ADVANCE_CALLS static counter on every advance() invocation under cfg(test|testing).
  8. Wire-tag pin test added — put.rs:471-540.

Blocking

B1. Multi-hop PutMsg::Error dead-ends at intermediate relays.

relay_put_send_error at op_ctx_task.rs:1444 sends PutMsg::Error upstream via send_fire_and_forget when the relay is not the originator. The receiving node's PUT bypass at node.rs:859 only forwards Response | ResponseStreaming to pending_op_results[tx]. PutMsg::Error falls through to the dispatch match's _ => catch-all at node.rs:945 and gets logged as "non-dispatch variant ignored".

Result: in a 3-hop chain Origin → R1 → R2 where R2 returns Error, R1 forwards it upstream where it is silently dropped. The originator hangs until OPERATION_TTL.

The doc-comment at op_ctx_task.rs:1236-1238 claims "the originator-loopback bypass at handle_pure_network_message_v1 delivers it to the waiting pending_op_results[incoming_tx] callback" — that bypass does not accept Error today. Either extend the bypass at node.rs:859 to include PutMsg::Error { .. }, or the multi-hop path documented in this PR ships broken. Item 4 from my prior review is therefore only partially addressed: 1-hop loopback works, multi-hop does not.

B2. drive_retry_loop_done_err_does_not_call_advance is tautological.

PUT_RETRY_DRIVER_ADVANCE_CALLS increments only inside PutRetryDriver::advance at op_ctx_task.rs:268. The test at op_ctx_task.rs:2238 calls only classify_reply(&reply) — a free function with no &self, no driver, no way to reach advance(). The post == pre assertion is guaranteed regardless of how the production Done(Err) arm in drive_retry_loop is wired. A regression that re-routes terminal errors through advance() from inside drive_retry_loop would not trip this test.

The replacement for the deleted source-grep is a different kind of fake. To actually prove the claim, the test must (a) instantiate a real PutRetryDriver, (b) drive it through drive_retry_loop with a PutMsg::Error reply, (c) read the counter. Item 9 from my prior review is therefore not behaviourally addressed.

B3. Shutdown fallback may re-introduce the M1/M2 race the PR fixes.

The fallback at op_ctx_task.rs:790-794 calls op_manager.send_client_result(incoming_tx, Err(host_err)) followed by op_manager.completed(incoming_tx). send_client_result uses try_send and the completed() call emits a TransactionCompleted notification. This is the exact pre-#4111 shape the PR deletes from the success path.

The fallback's own warn-log identifies the condition: "OR op_execution_sender torn down while result_router_tx is still live". That is precisely when the race can re-fire — the originator's send_and_await waiter races against the synthesized "failed notifying, channel closed" marker once again.

Additionally, the surrounding run_relay_put comment block elsewhere explicitly forbids calling completed(incoming_tx) in loopback mode because doing so removes the originator's pending_op_results callback prematurely. The new fallback runs only when originator_loopback == true and violates that rule.

Resolve by either (a) proving op_execution_sender only closes after result_router_tx so the fallback is dead code and removing it, or (b) documenting why the fallback is exempt from the no-completed-in-loopback rule. Or (c) skip completed(incoming_tx) in loopback mode of the fallback path.

Important

M1. Multi-hop bubble has zero tests. relay_put_send_error helper, the new match arm, the send_fire_and_forget branch, and the cause re-bound are not exercised by any unit or integration test. Compounds B1.

M2. bound_cause UTF-8 walk-back loop is never executed by its test. bound_cause_never_splits_utf8_codepoint at put.rs:565 uses "好".repeat(...) — 3-byte codepoints, budget = 2034, 2034 % 3 == 0. The walk-back while !cause.is_char_boundary(cut) loop runs zero iterations. Add a 4-byte codepoint test ("𝄞", U+1D11E) sized so budget lands mid-codepoint to actually exercise the walk-back.

M3. Loopback-fallback assertions still source-grep. The pin at op_ctx_task.rs:2755-2773 uses body.find("ctx.send_local_loopback(error_msg).await") plus a 3000-byte window plus substring checks for Err(send_err), op_manager.send_client_result(incoming_tx, op_manager.completed(incoming_tx). Trips on rename send_err → e or fmt rewrap. Item 10 is therefore only partially addressed. A behavioural test with a fake OpManager + closed op_execution_sender would pin behaviour instead of text.

Minor

  • PR description stale. Lists three deleted tests (put_retry_driver_classify_error_returns_terminal_err, ..._response_returns_terminal_ok, ..._large_cause_preserves_payload) and does not mention the four new additions (#[non_exhaustive], 2 KB cap, multi-hop bubble, FORBIDDEN_MARKER).
  • Off-by-one cap boundary untested. bound_cause(len == PUT_TERMINAL_CAUSE_MAX_BYTES) vs len == PUT_TERMINAL_CAUSE_MAX_BYTES + 1 not pinned.
  • bound_cause applied multiple times per multi-hop path. Idempotent on short strings so safe today, but worth a comment pinning "bubble forwards verbatim-or-rebound, never append" so a future relay that augments the cause doesn't silently inject double ...[truncated] markers.
  • Doc-comment compression in run_relay_put removes inline M1/M2 reasoning that previously lived next to the code. The rule in operations.md captures it globally — consider a one-line anchor at the loopback site: // see .claude/rules/operations.md: WHEN publishing a terminal operation reply.
  • test(put): E2E regression test for #4111 with stable failure-cause wrapper contract #4147 follow-up exists but is not referenced from test_put_error_notification or operations.md.
  • Wire-tag pin uses default bincode::serialize. A one-line comment naming the codec-config assumption guards against silent invalidation if the project ever switches bincode config.

Verdict

Not ready — three blockers (B1, B2, B3) warrant discussion. B1 is the most consequential: the multi-hop arm exists in code but the routing doesn't deliver, so the fix today only covers 1-hop loopback. B2 means item 9's coverage isn't real. B3 may regress the original bug under shutdown.

Happy to walk through any of these in more detail.

Addresses iduartgomez review on PR freenet#4126:

  B1 — bubble multi-hop local failures upstream. `run_relay_put`'s
  non-loopback Err branch now emits PutMsg::Error to upstream_addr via
  relay_put_send_error; pre-fix, a put_contract rejection on an
  intermediate relay silently exited and the upstream's send_to_and_await
  hung until OPERATION_TTL, burning the originator's retry budget.

  B3 — eliminate the M1/M2 race in the shutdown fallback. `completed()`
  is no longer invoked from the loopback fallback, and the fallback is
  extracted as `dispatch_loopback_shutdown_fallback` taking only
  `&mpsc::Sender<(Transaction, HostResult)>` — compile-time guarantee
  that no future refactor can quietly re-introduce the
  TransactionCompleted emission. Slot reclamation falls to the 60s
  sweep, a slot leak under shutdown for guaranteed race-freedom.

  B2 — replace tautological retry-loop test. The previous version
  called only `classify_reply` (free fn, no driver, no path to
  advance()) and asserted a counter advance() couldn't move. Split
  into an honest classify pin (operations::put) plus a structural pin
  on drive_retry_loop's Terminal arm (operations::op_ctx) that forbids
  `.advance()` and `continue`.

  M1 — behavioural coverage of the multi-hop bubble. Extracted
  `relay_put_send_error_with_ctx` taking `&mut OpCtx + own_addr` so
  tests drive it directly via `event_loop_notification_channel` without
  a full OpManager. Four behavioural unit tests + two structural pins
  (bound_cause re-bound at the relay reply arm, B1 emission at the
  non-loopback Err branch) + 2-node `test_put_error_notification_multi_hop`
  integration test.

  M2 — `bound_cause_never_splits_utf8_codepoint` now actually exercises
  the walk-back loop. The previous 3-byte codepoint ("好") with budget
  2034 landed exactly on a boundary (`2034 % 3 == 0`) and ran zero
  iterations; replaced with 4-byte "𝄞" (U+1D11E, `2034 % 4 == 2`).
  Sanity-asserts `!is_char_boundary(raw_budget)` so a future cap
  change can't silently re-degrade the test to a no-op.

  M3 — shutdown fallback covered behaviourally (see B3 extraction).
  Four tokio-channel tests pin: exactly-one publish to result_router_tx
  with verbatim cause; notifications channel stays silent (no
  TransactionCompleted from this helper); closed router → log+drop,
  no panic; full router → `try_send`-based, returns within 100ms.

  Minor — off-by-one cap boundary test (`bound_cause_cap_boundary_is_inclusive`);
  bound_cause idempotency invariant ("verbatim-or-rebound, never append")

  PUT lib suite: 47 → 68 tests, all green. clippy + fmt clean.
@Basedfloppa
Copy link
Copy Markdown
Contributor Author

Sorry for the hold up, life got in the way, hope this addresses all blockers and minor request to your satisfaction.

@sanity
Copy link
Copy Markdown
Collaborator

sanity commented May 20, 2026

Multi-model review — fix(put): deliver originator-loopback failure via PutMsg::Error bypass

Full-tier review (PUT protocol / routing / wire format): four-perspective read + Codex (gpt-5.5) external pass, against HEAD 7730b848.

Core design is sound. PutMsg::Error delivered into the originator's pending_op_results waiter via the same bypass as Responsesend_and_await returns Okclassify_replyTerminalErrordrive_retry_loopDone(Err) with no advance(). The retry storm is genuinely stopped for the primary #4111 scenario. New wire variant appended last (tag 5), existing tags byte-stable, pinned by put_msg_wire_format_variant_tags_are_stable.

🔴 Blocking

B1 — drive_relay_put_streaming converts a downstream PutMsg::Error into a false PUT success upstream (crates/core/src/operations/put/op_ctx_task.rs:2078-2089). Independently confirmed by Codex [P2]. This PR makes node.rs:869 forward PutMsg::Error through the bypass, so a streaming relay's downstream waiter can now receive it — but drive_relay_put_streaming's reply match only handles Response/ResponseStreaming; the other => arm calls relay_put_send_response(... key ...), reporting success for a failed PUT. Silent data-correctness regression introduced by this PR. Also run_relay_put_streaming (:1718-1738) has no error-emit path — a streaming relay's local failure just logs and lets the originator hang to OPERATION_TTL, the exact #4111 symptom. Multi-hop large-payload (webapp-wrapper) PUTs are explicitly in #4111's scope and hit this path. Fix: add a PutMsg::Error arm to drive_relay_put_streaming's reply match that bubbles via relay_put_send_error, and add the loopback/non-loopback error emit to run_relay_put_streaming mirroring run_relay_put:773-836.

B2 — the PR's own regression test test_put_error_notification_multi_hop is flaky, failing with the exact #4111 symptom (crates/core/tests/error_notification.rs:292). Observed 1 failure / ~21 runs (fails only under concurrent load): Multi-hop PUT WS error contained the forbidden marker ... put failed after 1 attempts (last error: failed notifying, channel closed). The trace shows a PutMsg::Error was emitted and arrived inbound — yet the client still got the synthesised marker, meaning the multi-hop path still has a race where the pending_op_results callback is gone before the Error lands. Per the flaky-tests rule, a PR's own regression test failing intermittently with the exact bug it claims to fix is a blocker — the multi-hop fix is incomplete. Root-cause it; do not paper over with retries/timeouts.

🟡 Non-blocking

  • N1PUT originator-loopback error path triggers wasteful retries and risks hiding the real error from the client #4111's regression-test spec item 4 explicitly required asserting put_contract called ≤2× via RELAY_PUT_DRIVER_CALL_COUNT. The PR substitutes source-scrape structural pins, which catch obvious regressions but aren't behavioral — the retry-storm elimination has no behavioral test. Recommend adding the call-count assertion.
  • N2 — Pre-fix the M1/M2 race is "winnable", so test_put_error_notification may pass without the fix non-deterministically — it doesn't deterministically fail pre-fix. The call-count assertion (N1) would be deterministic.
  • N3#[non_exhaustive] on PutMsg is a near-no-op (PutMsg is core-internal, not in freenet-stdlib) — harmless, but the description oversells it as wire-format future-proofing.

Nit

  • Double relay_put_send_error attempt possible (drive_relay_put Error arm fails → run_relay_put retries) — benign (send_fire_and_forget enqueues atomically), worth a one-line comment.

Verdict: MERGE-AFTER-FIX

B1 (data-correctness regression) and B2 (incomplete multi-hop fix) must be resolved before merge. Reviews are per-code-content — the full review will be re-run after B1/B2 land. Thanks for the work here @Basedfloppa — the core loopback fix is the right shape; these two are the remaining gaps.

[AI-assisted - Claude]

Review B1 for PR freenet#4126 / issue freenet#4111: the streaming relay PUT driver
silently converted downstream failures into false upstream successes.

Two gaps, both now fixed symmetrically with the non-streaming
`run_relay_put` / `drive_relay_put` pair:

1. `drive_relay_put_streaming`'s reply match had only
   `Response`/`ResponseStreaming` arms plus a catch-all `other =>`
   that called `relay_put_send_response(...)` — so a downstream
   `PutMsg::Error` (now deliverable since node.rs forwards it through
   the bypass) was reported upstream as a STORED success. Added an
   explicit `PutMsg::Error` arm that bubbles the cause upstream via
   `relay_put_send_error`, and the `other =>` arm now returns
   `Err(OpError::UnexpectedOpState)` instead of fabricating a success.

2. `run_relay_put_streaming` discarded `drive_relay_put_streaming`'s
   `Err` after a log line — a streaming relay's local failure
   (orphan-claim, assembly/deserialize, key mismatch, `put_contract`
   rejection) just left the originator's `send_and_await` waiter to
   hang to `OPERATION_TTL`. Added the loopback / non-loopback
   error-emit path mirroring `run_relay_put`: loopback emits
   `PutMsg::Error` via `send_local_loopback` (with the
   `dispatch_loopback_shutdown_fallback` shutdown path), non-loopback
   bubbles via `relay_put_send_error`. The `release_pending_op_slot`
   call is now gated on `!originator_loopback` for the same reason as
   `run_relay_put`.

Adds two structural regression pins:
`drive_relay_put_streaming_bubbles_downstream_error` and
`run_relay_put_streaming_emits_terminal_error_on_local_failure`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sanity
Copy link
Copy Markdown
Collaborator

sanity commented May 20, 2026

Update on PR #4126 review blockers — B1 fixed, B2 root-caused

Pushed 24c234bb to the fork branch.

B1 — FIXED (24c234bb)

drive_relay_put_streaming and run_relay_put_streaming now propagate failures symmetrically with the non-streaming pair:

  • drive_relay_put_streaming gained an explicit PutMsg::Error reply arm that bubbles the cause upstream via relay_put_send_error (was: fell through other => to relay_put_send_response — a false PUT success). The other => arm now returns Err(OpError::UnexpectedOpState) instead of fabricating success.
  • run_relay_put_streaming no longer discards drive_relay_put_streaming's Err. It now emits a terminal PutMsg::Error on a local failure — loopback via send_local_loopback (+ dispatch_loopback_shutdown_fallback), non-loopback via relay_put_send_error — mirroring run_relay_put. release_pending_op_slot is gated on !originator_loopback.
  • Two structural regression pins added.

cargo clippy -p freenet -- -D warnings clean, cargo fmt clean, all 71 operations::put unit tests pass, error_notification suite passes.

B2 — ROOT-CAUSED, fix NOT pushed (needs author/maintainer decision)

The flake reproduces in both test_put_error_notification_multi_hop AND single-node test_put_error_notification under concurrent load (~1/12 runs). The symptom is always the immediate put failed after 1 attempts (last error: failed notifying, channel closed) — i.e. the originator's send_and_await response_receiver returned None (its pending_op_results[client_tx] sender was dropped before the terminal reply was forwarded).

Root cause — a #4154#4111 interaction:

  1. On the first PUT attempt, drive_retry_loop uses attempt_tx == client_tx. The originator's send_and_await registers its waiter sender in pending_op_results[client_tx].
  2. The originator-loopback relay (run_relay_putdrive_relay_put, loopback branch) forwards the PUT downstream with ctx.send_fire_and_forget(gateway_addr, ...).
  3. handle_op_execution (p2p_protoc.rs:3858-3887) sees target = Some(gateway_addr). The fire-and-forget callback is closed (skipped insert), but pending_op_results.contains_key(client_tx) is true (it's the originator's waiter) — so it registers client_tx → gateway_addr in live_tx_tracker.add_transaction(...). The inline comment at :3865-3869 documents this as intentional for GET initiated during peer-connection churn stalls silently for 60+s instead of fail-fast / re-route #4154.
  4. Any prune of the gateway_addr connection — including transient-connection churn during the 2-node handshake/promotion dance — calls ring::prune_connectionlive_tx_tracker.prune_transactions_from_peer(gateway_addr)client_tx is returned as orphaned → handle_orphaned_transactionstry_release_pending_op_slot(client_tx) → emits NodeEvent::TransactionCompleted(client_tx) → the event loop does pending_op_results.remove(client_tx)the originator's waiter sender is dropped.
  5. The originator's send_and_await recv() returns NoneOpError::NotificationErrordrive_retry_loop classifies wire_error, advance() exhausts → Exhausted("put failed after 1 attempts (last error: failed notifying, channel closed)").

So handle_orphaned_transactions is exactly the "code that closes pending_op_results[tx] before the driver has consumed its terminal reply" that issue #4111 itself warned about. The #4154 orphan-wake is sound for a relay driver send_to_and_await-ing (on NotificationError it advance()s to another peer), but for the originator-loopback case the pending_op_results[client_tx] entry is the client driver's waiter — and TransactionCompleted is too blunt: it both wakes the driver AND destroys the bypass-delivery channel, converting a pending op into the misleading synthesised error. It also fires on transient prunes, not just genuine disconnects.

Why I did not push a B2 fix: the correct fix touches core event-loop dispatch + the #4154 orphan-wake semantics — a high-risk concurrency surface — and the right shape is a judgement call I'd rather leave to you / a maintainer. Two candidate directions:

  • (a) In handle_op_execution, do not register the live_tx_tracker orphan-wake entry for a closed (fire-and-forget) callback. Cost: a genuine downstream disconnect on a loopback-relayed multi-hop PUT no longer wakes the originator early — but the originator still terminates gracefully via its own attempt_timeout in drive_retry_loop (bounded, not OPERATION_TTL), and with a non-misleading message.
  • (b) Make orphan-wake deliver a proper terminal error through the bypass (e.g. a synthetic PutMsg::Error / op-appropriate terminal reply) instead of TransactionCompleted closing the channel — so drive_retry_loop classifies it as Terminal rather than wire_error.

(a) is the smaller change; (b) is the more principled one and generalises to GET/SUBSCRIBE. Either needs its own regression test for the prune-race (the current test_put_error_notification* pair reproduces it non-deterministically).

[AI-assisted - Claude]

@sanity
Copy link
Copy Markdown
Collaborator

sanity commented May 20, 2026

Disposition — needs author/maintainer to land B2

@Basedfloppa @iduartgomez — summary of where this PR stands after the multi-model review above:

  • B1 (streaming-relay false success) — fixed and pushed to this branch as 24c234bb. drive_relay_put_streaming / run_relay_put_streaming now propagate PutMsg::Error symmetrically with the non-streaming pair.
  • B2 (incomplete multi-hop fix) — root-caused (see the comment above): a #4154 orphan-wake (prune_connectionhandle_orphaned_transactionsTransactionCompleted(client_tx)) destroys the originator's pending_op_results[client_tx] waiter before the terminal reply is delivered, producing the synthesised channel closed error — the exact failure mode PUT originator-loopback error path triggers wasteful retries and risks hiding the real error from the client #4111 set out to fix. It reproduces in both the multi-hop and single-node test_put_error_notification* tests under load.

I deliberately did not push a B2 fix: it touches core event-loop dispatch and the #4154 orphan-wake semantics (a high-risk concurrency surface), and the right shape is a judgement call — two candidate directions are laid out in the comment above. @iduartgomez, your call on which direction fits the #4154 design intent would unblock this.

This PR can't merge until B2 is resolved (its own regression test flakes with the #4111 symptom), so I'm leaving it open for that. The core loopback fix and B1 are in good shape.

[AI-assisted - Claude]

…et#4126 B2)

When handle_op_execution processes a fire-and-forget callback (callback
is closed), the originator's pending_op_results[client_tx] may still
contain the client's waiter. Previously, orphan-wake registration
(live_tx_tracker.add_transaction) was performed regardless of callback
state, so transient connection churn to the downstream peer would
trigger handle_orphaned_transactions → TransactionCompleted(client_tx)
→ pending_op_results.remove(client_tx) — destroying the very slot
through which the bypass-delivered PutMsg::Error needs to arrive.

The originator then receives a synthesised "channel closed" error
instead of the real contract-side failure cause.

Fix: extract callback_closed before the if-else and use
!callback_closed && to guard the orphan-wake registration. When the
callback is closed, skip registration entirely — the originator either
gets the real PutMsg::Error via the bypass path, or times out
gracefully via attempt_timeout.

Fixes: freenet#4126 B2
See: freenet#4111, freenet#4154
@Basedfloppa
Copy link
Copy Markdown
Contributor Author

Given the hold-up in resolving the last blocker, I took the initiative into my own hands. For B2, approach (a) was chosen, as the core idea of this PR was to allow the real issue to surface as a terminal error without infrastructure masking it. I hope my changes will suffice as a fix for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PUT originator-loopback error path triggers wasteful retries and risks hiding the real error from the client

3 participants