-
Notifications
You must be signed in to change notification settings - Fork 286
Redrive on transient workflow-server transport failures instead of failing the run #2445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
d1ea0af
5f02af4
0dad802
bd9919f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| --- | ||
| '@workflow/world-vercel': patch | ||
| '@workflow/core': patch | ||
| --- | ||
|
|
||
| Treat transient world-vercel transport failures as retryable, surfacing them as a `TRANSPORT` type `WorkflowWorldError`, to be retried by the queue instead of failing the run. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,7 +23,11 @@ import { | |
| type WorkflowRun, | ||
| type World, | ||
| } from '@workflow/world'; | ||
| import { classifyRunError, isWorldContractError } from './classify-error.js'; | ||
| import { | ||
| classifyRunError, | ||
| isRetryableWorldError, | ||
| isWorldContractError, | ||
| } from './classify-error.js'; | ||
| import { describeError } from './describe-error.js'; | ||
| import { WorkflowSuspension } from './global.js'; | ||
| import { runtimeLogger } from './logger.js'; | ||
|
|
@@ -1354,6 +1358,34 @@ export function workflowEntrypoint( | |
| return; | ||
| } | ||
| } else { | ||
| // Transient infrastructure failures talking to the | ||
| // world (workflow-server) — an exhausted RetryAgent | ||
| // (UND_ERR_REQ_RETRY from a sustained 429/503 storm), | ||
| // a dropped socket, a connect/DNS failure, or a client | ||
| // timeout — must NOT fail the run. Rethrow so the queue | ||
| // redelivers and a fresh invocation retries the replay | ||
| // once the backend recovers. The @vercel/queue handler | ||
| // applies a fast (1s→60s) backoff by delivery count, | ||
| // avoiding the ~5min default visibility-timeout redrive | ||
| // (and never killing the process via run_failed). | ||
| if (isRetryableWorldError(err)) { | ||
| runLogger.warn( | ||
| 'Transient world error during replay; redelivering via queue instead of failing the run', | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AI Review: Note FYI (not blocking): when a transient failure persists all the way to the delivery cap, the run is failed with
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good call — deferring this one as a follow-up. The |
||
| { | ||
| errorName: | ||
| err instanceof Error | ||
| ? err.name | ||
| : 'UnknownError', | ||
| errorMessage: | ||
| err instanceof Error | ||
| ? err.message | ||
| : String(err), | ||
| deliveryAttempt: metadata.attempt, | ||
| } | ||
| ); | ||
| throw err; | ||
| } | ||
|
|
||
| let terminalError = err; | ||
| if (ReplayDivergenceError.is(err)) { | ||
| const divergenceCount = | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correctness here rests on server-side idempotency of writes — worth stating explicitly.
Redelivering re-runs the replay, which re-issues whatever write was in flight. For a POST that actually succeeded server-side but whose response was lost (e.g.
UND_ERR_SOCKET/ body timeout after the server committed the event), the redrive re-attempts that write. This is only safe because event creation is idempotent (seeevents.ts"is idempotent and safe for the caller to retry").The whole fix depends on that holding for every write reachable from the replay path. A sentence in the PR description confirming idempotency coverage (and any non-idempotent endpoints, if any) would make the safety argument explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, and confirmed: every write reachable from the replay path is idempotent — event creation dedupes server-side by correlation/idempotency key (the
events.ts"idempotent and safe for the caller to retry" note), which is exactly what makes redrive safe for a write whose response was lost (e.g.UND_ERR_SOCKET/ body timeout after the server committed). I've added a sentence to the PR description making this explicit.