[core] Turbo mode: fast-path the first invocation#2526
Conversation
🦋 Changeset detectedLatest commit: 749c1f5 The changes in this PR will be included in the next version bump. This PR includes changesets to release 16 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
c46f8a7 to
2962fd6
Compare
🧪 E2E Test Results✅ All tests passed Summary
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
|
…urbo mode the first delivery synthesizes `startedAt` from the runtime-local clock while later non-turbo deliveries load the server-canonical `startedAt`, so replay regenerates different correlation IDs and throws `ReplayDivergenceError`.
This commit fixes the issue reported at packages/core/src/runtime.ts:763
## The bug
In `packages/core/src/workflow.ts` (`runWorkflow`), the workflow orchestrator context exposed:
```ts
generateUlid: () => ulid(+startedAt),
```
where `startedAt = workflowRun.startedAt`. Every durable correlation ID is derived from this:
- `step.ts:25` → `step_${ctx.generateUlid()}`
- `workflow/hook.ts:73` → `hook_${ctx.generateUlid()}`
- `workflow/sleep.ts:18` → `wait_${ctx.generateUlid()}`
- `workflow/attribute-dispatcher.ts:20` → `attr_${ctx.generateUlid()}`
The 48-bit time prefix of every correlation ID therefore equals `+startedAt`. For replay to succeed, the value fed to `ulid()` **must be identical on every delivery** — otherwise `EventsConsumer.onUnconsumedEvent` fires and rejects with `ReplayDivergenceError`.
## Why turbo breaks it
`startedAt` is **not** replay-stable under turbo:
- **Turbo first delivery** (`runtime.ts` ~L753): the run is synthesized locally with `startedAt: now`, where `now = new Date()` is the runtime-local clock. The first delivery's `generateUlid` thus encodes the local `now`, and any `step_started`/`wait`/`hook_created` events persisted in this delivery carry correlation IDs encoding that local `now`.
- **Backend persistence**: the backgrounded `run_started` write records the storage layer's own clock as the canonical `startedAt` (`world-local events-storage` uses `currentRun.startedAt ?? now`), which differs from the runtime's local `now`.
- **Next (non-turbo) delivery**: the run is loaded from the backend with the server-canonical `startedAt`. `generateUlid()` now produces ULIDs with a different time prefix, so the regenerated correlation IDs no longer match the persisted ones → `ReplayDivergenceError`.
The divergence only requires a ≥1 ms difference between the two ms-resolution clocks, so it is intermittent but real — and turbo is on by default.
This was already a known hazard: the RNG `seed` and the VM clock `fixedTimestamp` were *deliberately* decoupled from `startedAt`/`createdAt` (see the comment "Dropping the timestamp means the seed no longer depends on startedAt/createdAt, so it ... can be computed before any server round-trip"). `generateUlid` was simply missed in that refactor.
## The fix
Feed `generateUlid` the same replay-stable value already used for the seed and VM clock:
```ts
generateUlid: () => ulid(fixedTimestamp),
```
where `fixedTimestamp = runIdCreatedAt(workflowRun.runId) ?? +workflowRun.createdAt`. Production run IDs are always `wrun_<ulid>` (minted client-side in `start()`), so `runIdCreatedAt` recovers the same epoch-ms value the instant the queue message arrives — identical on turbo and non-turbo deliveries alike. Correlation IDs become replay-stable in all delivery paths.
`workflowStartedAt` (line 296, a user-facing `Date` exposed to workflow code) intentionally keeps using `startedAt` — it is not a correlation ID and is not part of replay matching.
## Test compatibility
The two integration tests that compute expected correlation IDs through the real `runWorkflow` path use non-ULID run IDs (`wrun_stale_wait_replay`, `wrun_test`). For those, `runIdCreatedAt` returns `undefined` and `fixedTimestamp` falls back to `+createdAt`, which equals `+startedAt` in those fixtures — so `ulid(fixedTimestamp)` yields the same IDs as before and the assertions still hold. The unit-test fixtures that hand-build their own `generateUlid: () => ulid(workflowStartedAt)` do not go through `runWorkflow` and are unaffected.
Co-authored-by: Vercel <vercel[bot]@users.noreply.github.com>
Co-authored-by: VaguelySerious <mittgfu@gmail.com>
…ch step_completed Turbo overlaps start round-trips with step bodies but still awaits each step_completed before advancing. Documents the considered "run-ahead" extension (defer step writes to a background queue, run sequential steps ahead) and why it was not pursued: crash re-execution blast radius, and divergent branches when a step runs against a non-durable result a redelivery can re-decide. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… IDs The generateUlid fix (correlation IDs keyed on the replay-stable fixedTimestamp = runIdCreatedAt(runId) instead of startedAt) changes the ULID time prefix for fixtures whose ULID runId encodes a different time than their startedAt. This race-replay fixture used a 2025 ULID runId with a 2024 startedAt, so its step_ correlation ID prefixes move from the startedAt-derived 01HK153X00 to the runId-derived 01K75533W5 (suffixes, seed-derived, are unchanged). Realigns the fixture; no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tered-step barrier, docs)
- forceOptimisticStart now defers to an explicit WORKFLOW_OPTIMISTIC_INLINE_START=0:
turbo still forces optimistic start when the flag is unset, but an operator's
explicit opt-out (the "body runs before start is confirmed" property) wins.
Adds isOptimisticInlineStartExplicitlyDisabled().
- Gate the unregistered-step ("step not found") lazy step_started on
runReadyBarrier so it never precedes the backgrounded run_started under turbo.
- Document that the forced-optimistic first step body's stream/ops writes run
before run_started (stream-safety caveat + WORKFLOW_TURBO=0), and that a run
cancelled/expired before its first delivery still runs the first step body
(reconciled away) under turbo.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The two turbo tests that wait for the inline step body to run before releasing the run-ready barrier used vi.waitFor's 1s default, which the full VM replay can exceed on cold Windows CI (intermittent "expected [] to include 'body'"). Bump to 15s, matching the existing queue_dispatch_start waitFor in the same suite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dge)
Turbo's immediate re-invoke exits returned `{ timeoutSeconds: 0 }`, which makes
the queue reschedule the CURRENT delivery's message. That message carries
`runInput`, and on async queues (graphile-worker / world-vercel) a reschedule
comes back as delivery attempt 1 — so turbo re-engaged, skipped the event-log
load again, replayed against an empty log, never observed the hook/attr event it
had just written, re-suspended, and rescheduled forever. The run wedged (every
hook + experimental_setAttributes e2e test timed out on world-postgres and
world-vercel; world-local's reschedule increments the attempt, so it was unaffected).
Turbo now re-invokes via an explicit continuation that carries NO `runInput`
(`reinvoke()`), so the next delivery is a normal non-turbo load-and-replay that
observes the committed events and makes progress. Applies to the hasHookConflict,
hasAttributeEvents, hasAwaitedHookCreation, and throttle re-invoke exits.
Verified against world-postgres: hook.getConflict() + experimental_setAttributes
workflows that previously wedged now complete.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| executionContext: runInput.executionContext, | ||
| input: runInput.input, | ||
| attributes: runInput.attributes ?? {}, | ||
| startedAt: now, |
There was a problem hiding this comment.
Follow-up to the (now-resolved) correlation-ID divergence: the replay-matching fix is solid — seed/fixedTimestamp/generateUlid are all runId-derived now, so step/wait/hook IDs are stable across deliveries. ✅
But one residual remains from this synthesized startedAt: now. It still flows into the orchestrator-visible getWorkflowMetadata().workflowStartedAt (workflow.ts:296 new Date(+startedAt) → WORKFLOW_CONTEXT_SYMBOL), and getWorkflowMetadata() is replayed user code. On the turbo first delivery that value is the local clock; on any later (non-turbo) delivery it's the server-canonical startedAt — so a workflow that branches on it (e.g. if (Date.now() - +meta.workflowStartedAt > THRESHOLD) …, where Date.now() is now fixedTimestamp-stable but workflowStartedAt is not) can take a different path on resume → ReplayDivergenceError.
Much narrower than the original bug (only workflows reading workflowStartedAt in replayed control flow), but it means the "replay is fully decoupled from startedAt" framing isn't quite complete — the user-facing value is still delivery-dependent. Worth either deriving the synthesized workflowStartedAt from the same replay-stable runIdCreatedAt(runId) value, or adding a one-line caveat that workflowStartedAt may differ by the start→first-delivery latency on the first invocation and shouldn't drive replayed branching.
There was a problem hiding this comment.
Pushed 0cdbdfdb8 to address this so it is not left silently open:
- Doc caveat in the turbo-mode changelog (
### workflowStartedAt reflects the first delivery's clock) — explicitly says to treat it as an approximate human-facing timestamp and not branch replayed control flow on it. - Regression test in
workflow.test.tsproving step correlation IDs are regenerated from the run-ID-derivedfixedTimestamp(notstartedAt) and stay stable across deliveries. Verified it fails (ReplayDivergenceError) whengenerateUlidis reverted toulid(+startedAt).
Left the deeper code fix — deriving the orchestrator-visible workflowStartedAt from fixedTimestamp so the value itself is replay-stable rather than just documented — as your call, since it changes the public getWorkflowMetadata().workflowStartedAt semantics for non-turbo runs too (it would become run-creation time instead of run_started time).
Add a workflow.test.ts regression that replays a recorded step under a startedAt that diverges from createdAt, proving step correlation IDs are regenerated from the run-ID-derived fixedTimestamp (not startedAt) and so stay stable across deliveries. Reverting generateUlid to ulid(+startedAt) fails this test. Document in the turbo-mode changelog that getWorkflowMetadata().workflowStartedAt reflects the first delivery's clock under turbo (local on the first delivery, server-canonical on later ones) and must not drive replayed control flow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
pranaygp
left a comment
There was a problem hiding this comment.
Approving. Reviewed thoroughly across correctness, testing, and compatibility.
Correctness — Traced the full turbo path (runtime.ts, step-executor.ts, suspension-handler.ts, constants.ts). The critical replay-determinism bug from the first review is fixed (generateUlid now keyed off the run-ID-derived fixedTimestamp, not startedAt). The new reinvoke async-queue-wedge fix is comprehensive — every turbo-reachable reschedule routes through it or queueMessage without runInput; background-step returns can't be turbo. Earlier findings #4 (explicit WORKFLOW_OPTIMISTIC_INLINE_START=0 now honored) and #5 (unregistered-step step_started now barrier-gated) are fixed; #2 (step-body stream writes) and #3 (cancelled-before-delivery) are documented tradeoffs with WORKFLOW_TURBO=0 as the kill switch.
Testing — Unit suites pass; the determinism fix is now guarded by a regression test (verified it fails with ReplayDivergenceError when generateUlid is reverted). Parent commit ddcb5aabe (identical runtime code) was fully green across all 13 E2E Vercel Prod apps + all 3 Benchmark Vercel apps.
Compatibility — Turbo is purely client-side, needs no world/backend change, is detected from existing signals (no new message field), and is opt-out. The correlation-ID derivation change completes the already-merged #2525 decoupling; replay-stable per-run going forward, with only the standard #2525-class in-flight deploy-boundary disruption (beta-accepted, recovers cleanly). Changeset minor is correct.
Two non-blocking notes:
- The current red checks (2× Benchmark Vercel, express E2E Vercel Prod) are the known Vercel-prod flakes — they rotate run-to-run and the identical-runtime parent commit was fully green. Please re-run those jobs for a clean merge gate.
- The
workflowStartedAtresidual (discussion above) is documented + tested; deriving it fromfixedTimestampto make the public value itself replay-stable is an optional follow-up at your discretion.
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
|
No backport to Turbo mode is a To override, re-run the Backport to stable workflow manually via |
Adds turbo mode (on by default) to fast-path the very first delivery of a run's first invocation — where time-to-first-step matters most. Stacked on #2516.
On that first delivery the runtime:
run_started— synthesizes the run entity locally from the queued run input so replay begins immediately; the round-trip overlaps replay (reuses the resilient-start create-on-the-fly contract).WORKFLOW_OPTIMISTIC_INLINE_START— the step body runs immediately; only thestep_startedwrite waits on the backgroundedrun_started.Net effect: the first step body starts after just the in-process replay, with
run_started/step_startedhappening around it and noevents.listbefore it.Why it's safe
attempt === 1is the first delivery (plus: not a background-step or recovery replay). No new message field, no world/backend change.step_startedis chained on it (body still runs immediately), the suspension handler awaits it before any eager write, and terminal run writes await it too. The log staysrun_created → run_started → step_created → step_started → step_completed.WORKFLOW_TURBO=0).Config
On by default;
WORKFLOW_TURBO=0/falsedisables it (kill-switch for non-idempotent/stream-unsafe first-step bodies).Docs
New changelog page
docs/content/docs/v5/changelog/turbo-mode.md(+ meta). Preview links to follow once the docs deployment is up.