fix(webapp): treat Phase 2 batch-stream retries as idempotent (TRI-9944)#3766
Conversation
When the SDK created a batch and then streamed its items (Phase 2 of the 2-phase batch API), a lost response would trigger the SDK's network-retry path. For small, fast-completing batches the original request had already enqueued every item, sealed the batch, and the runs flipped the batch to PROCESSING or COMPLETED by the time the retry arrived. The retry then failed the pre-loop check at streamBatchItems.server.ts:109 with a 422 — surfacing a customer-visible BatchTriggerError for a batch whose runs had actually succeeded. StreamBatchItemsService.call now returns the standard sealed:true success response (itemsAccepted: 0, itemsDeduplicated: 0, runCount: batch.runCount) when the batch is already sealed or in PROCESSING/COMPLETED, matching the idempotency already applied at the two post-loop race-condition branches in the same file (lines 226 and 306). ABORTED and other unexpected non-PENDING states still throw. Tests: - Rewrote the existing "already sealed" race test from expecting a throw to expecting sealed:true (Phase 2 retry idempotency). - Added a COMPLETED-pre-loop test mirroring the exact customer scenario (single-item batch, status=COMPLETED, sealed=false — tryCompleteBatch sets status without setting sealed). - Added a negative ABORTED test to lock in that terminal-failure states still surface as errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
WalkthroughStreamBatchItemsService.call() now uses an exported isIdempotentRetrySuccess(status, sealed, processingCompletedAt) helper to detect Phase 2 idempotent-success states. When a retry finds a batch already sealed or advanced to PROCESSING, COMPLETED, PARTIAL_FAILED, or has processingCompletedAt set while PENDING, the service returns { sealed: true, itemsAccepted: 0, itemsDeduplicated: 0, runCount } instead of throwing. Fast-path, post-enqueue re-check, and concurrent-seal fallback use this helper; ABORTED still causes a ServiceValidationError. Tests and changelog updated accordingly. Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…race Addresses two PR review findings: CodeRabbit: sealed=true + ABORTED would silently succeed under the previous `if (batch.sealed || ...)` check. V2's batch completion callback can set status=ABORTED (failedRunCount > 0 && successfulRunCount === 0) on a batch that streamBatchItems already sealed — leaving sealed=true alongside a terminally-failed batch. A Phase 2 retry of such a batch must surface the error, not mask it. Devin: PARTIAL_FAILED (failedRunCount > 0 with at least one success) is a real V2 completion-callback status, but neither the pre-loop check nor the post-loop race handlers (lines 226 and 306) accepted it as success. A retry whose original stream succeeded would either 422 at the pre-loop or hit "unexpected state" at the post-loop seal- failed branch. Changes: - Pre-loop: replace the broad `sealed || PROCESSING || COMPLETED` check with an `isIdempotentRetrySuccess` boolean that admits PROCESSING, COMPLETED, PARTIAL_FAILED, or (sealed && PENDING) — ABORTED falls through to the throw. - Post-loop count-mismatch (line 226 region): add PARTIAL_FAILED to the success short-circuit alongside sealed and COMPLETED. - Post-loop seal-failed (line 306 region): add PARTIAL_FAILED to the success short-circuit alongside (sealed && PROCESSING) and COMPLETED. Tests (TDD red-then-green): - New: pre-loop sealed=true + ABORTED → throws (CodeRabbit's case). - New: pre-loop PARTIAL_FAILED → returns sealed:true. - New: post-loop seal-failed race with PARTIAL_FAILED → returns sealed:true (uses the same racingPrisma pattern as the existing COMPLETED race test). All 34 tests in streamBatchItems.test.ts pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…branches After settling the operational contract — ABORTED throws because zero TaskRun records exist for the customer to monitor; every other terminal state returns sealed:true because TaskRun records exist (some may be in failed state, but per-run signals reach the customer through run monitoring) — three inconsistencies remained between the pre-loop check and the two post-loop race handlers: 1. Seal-failed branch threw "unexpected state" on sealed=true + PENDING, which is the legitimate post-callback "all runs created" state (V2 batchCompletionCallback resets PROCESSING → PENDING and leaves sealed=true). Pre-loop and count-mismatch both accept this state. 2. Count-mismatch branch admitted sealed=true + ABORTED via the bare `currentBatch?.sealed` clause, returning sealed:true. Pre-loop throws on this state. The count-mismatch outcome would silently hide a batch where zero TaskRuns were created. 3. Count-mismatch branch's fall-through return (sealed:false) implies "retry with missing items", which is wrong for ABORTED — a fresh batch is needed. Extracted the per-status policy into an exported helper: isIdempotentRetrySuccess(status, sealed) returns true for PROCESSING, COMPLETED, PARTIAL_FAILED, or (sealed && PENDING). ABORTED is excluded so the customer's batchTrigger() retry fires. All three branches now call the same helper. The count-mismatch branch additionally throws explicitly on ABORTED before falling through to the sealed:false return. Tests (TDD red-then-green): - New: seal-failed race with sealed=true + PENDING returns sealed:true (was throwing "unexpected state"). Uses racingPrisma to set the exact post-callback shape during the seal updateMany. - New: count-mismatch race with sealed=true + ABORTED throws ServiceValidationError (was returning sealed:false). Uses a call-counter on findFirst to flip the batch state between the pre-loop read and the re-query. All 36 tests in streamBatchItems.test.ts pass; webapp typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erAndWait reports) Another customer hit a related mode of the same Phase 2 stream class of bug: parent batchTriggerAndWait throwing BatchTriggerError despite every child run completing successfully. Pattern: 10 occurrences over 2 days, all 4-item batches, parents fail ~45s after the 200 — exactly five SDK stream-retry attempts exhausting. Trace of the failure mode against the existing code: 1. SDK POST /items sends 4 items. Server enqueues all 4. 2. BatchQueue rushes through them (independent items, fast). All 4 TaskRuns created. 3. batchCompletionCallback fires — sets processingCompletedAt = now(), successfulRunCount = 4, runIds. Status stays PENDING (the callback's "all created" happy path). sealed stays false (callback never touches it). 4. cleanup() runs, deletes the Redis batch metadata. 5. Our service's getBatchEnqueuedCount returns 0. Count-mismatch branch: 0 != 4. 6. Re-query: status=PENDING, sealed=false. Neither (sealed && PENDING) nor any of PROCESSING/COMPLETED/PARTIAL_FAILED matched → fell through to the sealed:false "client should stream more items" return. Server: 200 + sealed:false (matches the customer's "first POST returned 200, 8.1s"). 7. SDK retries the stream. engine.enqueueBatchItem at batch-queue/ index.ts:346 throws `Batch not found or not initialized` because cleanup deleted the metadata. Five retries exhaust → SDK throws BatchTriggerError (~45s after the 200). The discriminator that distinguishes "callback fired, work is done" from "client should stream more items" is processingCompletedAt: it's written exclusively by the V2 batchCompletionCallback (verified by grep across the run-engine and webapp). Nothing else touches it. Extended isIdempotentRetrySuccess to take processingCompletedAt as a third argument: (status === PENDING) && (sealed === true || processingCompletedAt != null) now means "callback fired, every item has a TaskRun, return sealed:true". The same helper is used by all three branches (pre-loop, count-mismatch, seal-failed) so the contract stays uniform. All three findFirst selects add `processingCompletedAt`. ABORTED still excluded everywhere. Test helper createBatch now accepts PARTIAL_FAILED (per CodeRabbit's adjacent nit on the previous commit) and processingCompletedAt. Tests (TDD red-then-green): - New: pre-loop with sealed=false + PENDING + processingCompletedAt set → returns sealed:true. Exercises the path a Phase 2 retry would hit if it arrived after the original count-mismatch returned sealed:false. - New: count-mismatch race with the customer's exact shape (sealed=false + PENDING + processingCompletedAt flipped between pre-loop read and re-query) → returns sealed:true. Uses the findFirst-counter racing pattern to reproduce the production timing. All 38 tests in streamBatchItems.test.ts pass; webapp typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Closes TRI-9944.
Customers were seeing
BatchTriggerErrorlogs (status 422,Batch ... is not in PENDING status (current: COMPLETED)) even though their runs completed successfully. The SDK was hitting Phase 2 of the 2-phase batch API a second time on a network retry; by the time the retry arrived for a fast-completing single-item batch, the batch had already gonePENDING→PROCESSING→COMPLETED, and the server rejected the retry instead of treating it as a success.StreamBatchItemsService.callnow returns the standardsealed: truesuccess response when the batch is already sealed or has moved pastPENDINGintoPROCESSING/COMPLETED. This mirrors the idempotency already applied at the two post-loop race branches in the same file (lines 226 and 306).ABORTEDand any other unexpected state still throw.Safe because:
sealed: trueto decide success — a no-op success response is exactly what stops the retry loop.Changes
apps/webapp/app/runEngine/services/streamBatchItems.server.ts— replace the pre-loopsealed/non-PENDINGthrows with an idempotent success return forsealed || PROCESSING || COMPLETED.apps/webapp/test/engine/streamBatchItems.test.ts— flip the existing "already sealed" race-condition test from expecting a throw to expectingsealed: true; add a COMPLETED-pre-loop test (the exact customer scenario: single-item batch,status=COMPLETED,sealed=falsesincetryCompleteBatchsets status without sealing); add an ABORTED negative test..server-changes/batch-stream-phase2-retry-idempotency.md.Follow-up commits — scope expanded across all three branches
Review surfaced two adjacent classes of bug in the same file. The original fix only covered the pre-loop check; the same flawed idempotency reasoning also lived in the post-loop count-mismatch and seal-failed handlers. Subsequent commits unified all three branches behind a single helper and added two more "work is done" states the original PR missed:
sealed=true+ABORTEDmust throw, not succeed. The first revision admitted anysealedbatch as success. But the V2batchCompletionCallback(runEngineHandlers.server.ts:946) can setstatus=ABORTED(every run failed) on a batch this endpoint already sealed, leavingsealed=truenext to a terminally-failed batch. Surfacing this as success masks the failure and prevents the customer's own retry from creating a fresh batch. CodeRabbit caught this; commit e8caa1d tightens the check to excludesealed + ABORTEDeverywhere.PARTIAL_FAILEDis also an idempotent-success state. Same V2 callback can setstatus=PARTIAL_FAILEDwhen some run-creation attempts failed but at least one succeeded. The original PR did not list this state, so the pre-loop and the two race branches all rejected it. Devin flagged this; commit e8caa1d adds it to the success list.sealed=false+PENDING+processingCompletedAtset). Reported by a customer on 4-itembatchTriggerAndWait. BatchQueue rushes through every item before the service finishes its loop, the V2 callback fires (writingprocessingCompletedAtand leavingstatus=PENDINGbecause all runs created cleanly),cleanup()deletes the Redis metadata, thengetBatchEnqueuedCountreturns0 ≠ runCount. The count-mismatch branch returnedsealed:falsebecause it couldn't distinguish "callback fired, work is done" from "client should stream more items". The SDK then retried the stream against the cleaned-up batch, the engine threwBatch not found or not initialized, retries exhausted, and the customer sawBatchTriggerErrordespite every child run completing successfully. Commit f28f53f addsprocessingCompletedAtas the discriminator: it is set exclusively by the V2 completion callback (runEngineHandlers.server.ts:968), so(status=PENDING) && (sealed || processingCompletedAt != null)cleanly separates the two cases.Commit 5a25abf extracts the four-line condition into
isIdempotentRetrySuccess(status, sealed, processingCompletedAt)and routes the pre-loop check, the count-mismatch handler, and the seal-failed handler through it, so any future "work is done" state only needs to be added in one place.ABORTEDis explicitly excluded in all three branches: it means zero TaskRun records exist for the customer to monitor (every per-item attempt failed AND the pre-failed-TaskRun fallback also failed), so the trigger call must throw to give theirbatchTrigger()retry the chance to create a fresh batch.Test coverage grew alongside: each of the three branches has positive cases for every admitted state (
PROCESSING,COMPLETED,PARTIAL_FAILED,PENDING+sealed,PENDING+processingCompletedAt) and a negative case that pins down theABORTEDthrow.Test plan
pnpm run test ./test/engine/streamBatchItems.test.ts --run— 31/31 pass, including the 3 new TRI-9944 cases.pnpm run typecheck --filter webapp— passes.🤖 Generated with Claude Code