[debug] Validation run: combine #2113 SDK + #447 server e6722b2 by TooTallNate · Pull Request #2146 · vercel/workflow

TooTallNate · 2026-05-28T21:51:11Z

NOT FOR MERGE. Draft PR opened only to trigger the tarballs-checks workflow so we get a tarball URL to pin the repro app to.

Purpose

End-to-end validation that the combined fix closes the production-visible defect end to end. Pairs:

@workflow/core from this branch (= top of [core] Optimistic concurrency control for branch-decision event writes #2113 + the existing SDK fixes from e5cd686)
workflow-server preview deployment for vercel/workflow-server#447 at commit e6722b2 (https://workflow-server-git-peter-event-write-cas.vercel.sh)

What this branch adds on top of #2113

WORKFLOW_SERVER_URL_OVERRIDE pinned to the Version Packages (beta) #447 preview URL
x-vercel-protection-bypass header forwarded from WORKFLOW_VERCEL_PROTECTION_BYPASS env var (so the repro app can hit the preview through Vercel Deployment Protection)

Both changes are gated to this branch only and will not be cherry-picked into either real PR.

Validation plan

Run the standard stress repro shape against the pinned tarball + preview pair (40 cycles × 200 workflows). Classify outcomes across:

completed
still running at final check
failed: CORRUPTED_EVENT_LOG
failed: USER_ERROR
failed: WORLD_CONTRACT_ERROR
failed: other

Last run pre-#447-server-fix: ~2/40 cycles surfaced CORRUPTED_EVENT_LOG on stable; 0/40 with this PR's predecessor against an earlier #447 preview but with 132 stuck-running + 23 USER_ERROR + 4 WORLD_CONTRACT_ERROR uncategorized.

Goal of this run: confirm not just CORRUPTED_EVENT_LOG = 0 but also stuck/USER_ERROR/WORLD_CONTRACT_ERROR are clean, since those would be the symptom of the materialization-before-fence orphan scenarios Peter walked.

changeset-bot · 2026-05-28T21:51:16Z

⚠️ No Changeset found

Latest commit: 7613014

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

vercel · 2026-05-28T21:51:18Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
example-nextjs-workflow-turbopack	Ready	Preview, Comment	May 29, 2026 7:53am
example-nextjs-workflow-webpack	Ready	Preview, Comment	May 29, 2026 7:53am
example-workflow	Ready	Preview, Comment	May 29, 2026 7:53am
workbench-astro-workflow	Ready	Preview, Comment	May 29, 2026 7:53am
workbench-express-workflow	Ready	Preview, Comment	May 29, 2026 7:53am
workbench-fastify-workflow	Ready	Preview, Comment	May 29, 2026 7:53am
workbench-hono-workflow	Ready	Preview, Comment	May 29, 2026 7:53am
workbench-nitro-workflow	Ready	Preview, Comment	May 29, 2026 7:53am
workbench-nuxt-workflow	Ready	Preview, Comment	May 29, 2026 7:53am
workbench-sveltekit-workflow	Ready	Preview, Comment	May 29, 2026 7:53am
workbench-tanstack-start-workflow	Ready	Preview, Comment	May 29, 2026 7:53am
workbench-vite-workflow	Ready	Preview, Comment	May 29, 2026 7:53am
workflow-docs	Ready	Preview, Comment, Open in v0	May 29, 2026 7:53am
workflow-swc-playground	Ready	Preview, Comment	May 29, 2026 7:53am
workflow-tarballs	Ready	Preview, Comment	May 29, 2026 7:53am
workflow-web	Ready	Preview, Comment	May 29, 2026 7:53am

github-actions · 2026-05-28T21:51:21Z

🧪 E2E Test Results

❌ Some tests failed

Summary

	Passed	Failed	Skipped	Total
❌ ▲ Vercel Production	1190	32	219	1441
✅ 💻 Local Development	1615	0	219	1834
✅ 📦 Local Production	1615	0	219	1834
✅ 🐘 Local Postgres	1615	0	219	1834
✅ 🪟 Windows	131	0	0	131
❌ 📋 Other	739	2	176	917
Total	6905	34	1052	7991

❌ Failed Tests

▲ Vercel Production (32 failed)

astro (1 failed):

AbortController abortExternalSignalWorkflow: signal passed as workflow input

example (1 failed):

fibonacciWorkflow - recursive workflow composition via start() | wrun_01KSSBZ965W97AFDHKER08Y5HC | 🔍 observability

express (3 failed):

outputStreamWorkflow positive startIndex (skips first chunk)
AbortController abortReasonWorkflow: abort reason preserved across boundaries
AbortController abortVoidSleepTimeoutWorkflow: documented void sleep().then(abort) pattern works

fastify (2 failed):

sleepInLoopWorkflow - sleep inside loop with steps actually delays each iteration | wrun_01KSSC42D9CV29AQTYQ5VEJM1B | 🔍 observability
AbortController abortSurvivesReplayWorkflow: controller state consistent across replay

hono (3 failed):

parallelSleepWorkflow | wrun_01KSSBMQR32XYATD4XZD0KQF1A | 🔍 observability
closureVariableWorkflow - nested step functions with closure variables | wrun_01KSSBYPFQB35J48R1VYGS7Y9E | 🔍 observability
spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step | wrun_01KSSBYRZW4NC1XCQVYX1BGMES | 🔍 observability

nextjs-turbopack (4 failed):

DurableAgent e2e experimental_onStepStart (GAP) completes but callbacks are not called (GAP)
fibonacciWorkflow - recursive workflow composition via start() | wrun_01KSSBZ965W97AFDHKER08Y5HC | 🔍 observability
Calculator.calculate - static workflow method using static step methods from another class | wrun_01KSSC05YKKWQY506Y50YTFD9R | 🔍 observability
errorSubclassRoundTripWorkflow - first-class Error subclasses survive every serialization boundary | wrun_01KSSC1XG3F71GH2WHPDG25YN8 | 🔍 observability

nextjs-webpack (1 failed):

AbortController abortFromStepWorkflow: step abort cancels an in-flight sibling step

nitro (1 failed):

fibonacciWorkflow - recursive workflow composition via start() | wrun_01KSSBZ965W97AFDHKER08Y5HC | 🔍 observability

nuxt (7 failed):

fibonacciWorkflow - recursive workflow composition via start() | wrun_01KSSBZ965W97AFDHKER08Y5HC | 🔍 observability
health check (queue-based) - workflow endpoint responds to health check messages
health check (CLI) - workflow health command reports healthy endpoints
pathsAliasWorkflow - TypeScript path aliases resolve correctly | wrun_01KSSBZYTZQ54CV3EQMZKKVRNT | 🔍 observability
AbortController abortAfterCompletionWorkflow: abort after step completes is a no-op
AbortController abortExternalSignalWorkflow: signal passed as workflow input
AbortController abortAnyInWorkflowWorkflow: AbortSignal.any composes signals inside the workflow VM

sveltekit (4 failed):

DurableAgent e2e core single tool call
DurableAgent e2e experimental_onToolCallStart (GAP) completes but callbacks are not called (GAP)
closureVariableWorkflow - nested step functions with closure variables | wrun_01KSSBYPFQB35J48R1VYGS7Y9E | 🔍 observability
fibonacciWorkflow - recursive workflow composition via start() | wrun_01KSSBZ965W97AFDHKER08Y5HC | 🔍 observability

vite (5 failed):

readableStreamWorkflow | wrun_01KSSBG7SCK2VJYZ4HGC1Q6H1K | 🔍 observability
hookWorkflow is not resumable via public webhook endpoint | wrun_01KSSBH63H4Y656X71066ZKSFY | 🔍 observability
runClassSerializationWorkflow - Run instances serialize across workflow/step boundaries | wrun_01KSSBXVA9MZPQCBQ8CTKQY50Z | 🔍 observability
sleepWithSequentialStepsWorkflow - sequential steps work with concurrent sleep (control) | wrun_01KSSC4D3MHRF84CKJDYMSRCZQ | 🔍 observability
AbortController abortTimeoutWorkflow: timeout cancels long-running step

📋 Other (2 failed)

e2e-vercel-prod-tanstack-start (2 failed):

stepWinsRaceWorkflow | wrun_01KSSBMZWHM1PN5WRWBSRGR530
fibonacciWorkflow - recursive workflow composition via start() | wrun_01KSSBZ965W97AFDHKER08Y5HC

Details by Category

❌ ▲ Vercel Production

App	Passed	Failed	Skipped
❌ astro	104	1	26
❌ example	104	1	26
❌ express	102	3	26
❌ fastify	103	2	26
❌ hono	102	3	26
❌ nextjs-turbopack	125	4	2
❌ nextjs-webpack	128	1	2
❌ nitro	104	1	26
❌ nuxt	98	7	26
❌ sveltekit	120	4	7
❌ vite	100	5	26

✅ 💻 Local Development

App	Passed	Skipped
✅ astro-stable	106	25
✅ express-stable	106	25
✅ fastify-stable	106	25
✅ hono-stable	106	25
✅ nextjs-turbopack-canary	112	19
✅ nextjs-turbopack-stable-lazy-discovery-disabled	131	0
✅ nextjs-turbopack-stable-lazy-discovery-enabled	131	0
✅ nextjs-webpack-canary	112	19
✅ nextjs-webpack-stable-lazy-discovery-disabled	131	0
✅ nextjs-webpack-stable-lazy-discovery-enabled	131	0
✅ nitro-stable	106	25
✅ nuxt-stable	106	25
✅ sveltekit-stable	125	6
✅ vite-stable	106	25

✅ 📦 Local Production

App	Passed	Skipped
✅ astro-stable	106	25
✅ express-stable	106	25
✅ fastify-stable	106	25
✅ hono-stable	106	25
✅ nextjs-turbopack-canary	112	19
✅ nextjs-turbopack-stable-lazy-discovery-disabled	131	0
✅ nextjs-turbopack-stable-lazy-discovery-enabled	131	0
✅ nextjs-webpack-canary	112	19
✅ nextjs-webpack-stable-lazy-discovery-disabled	131	0
✅ nextjs-webpack-stable-lazy-discovery-enabled	131	0
✅ nitro-stable	106	25
✅ nuxt-stable	106	25
✅ sveltekit-stable	125	6
✅ vite-stable	106	25

✅ 🐘 Local Postgres

App	Passed	Skipped
✅ astro-stable	106	25
✅ express-stable	106	25
✅ fastify-stable	106	25
✅ hono-stable	106	25
✅ nextjs-turbopack-canary	112	19
✅ nextjs-turbopack-stable-lazy-discovery-disabled	131	0
✅ nextjs-turbopack-stable-lazy-discovery-enabled	131	0
✅ nextjs-webpack-canary	112	19
✅ nextjs-webpack-stable-lazy-discovery-disabled	131	0
✅ nextjs-webpack-stable-lazy-discovery-enabled	131	0
✅ nitro-stable	106	25
✅ nuxt-stable	106	25
✅ sveltekit-stable	125	6
✅ vite-stable	106	25

✅ 🪟 Windows

App	Passed	Failed	Skipped
✅ nextjs-turbopack	131	0	0

❌ 📋 Other

App	Passed	Failed	Skipped
✅ e2e-local-dev-nest-stable	106	0	25
✅ e2e-local-dev-tanstack-start-	106	0	25
✅ e2e-local-postgres-nest-stable	106	0	25
✅ e2e-local-postgres-tanstack-start-	106	0	25
✅ e2e-local-prod-nest-stable	106	0	25
✅ e2e-local-prod-tanstack-start-	106	0	25
❌ e2e-vercel-prod-tanstack-start	103	2	26

📋 View full workflow run

❌ Some E2E test jobs failed:

Vercel Prod: failure
Local Dev: success
Local Prod: success
Local Postgres: success
Windows: success

Check the workflow run for details.

Building on 98c9741's bail-on-fence-conflict, propagate the fence-conflict signal upward as `staleSnapshot: true` so the entire current replay's queue results are abandoned rather than just the individual write skipped. The narrower 'skip the write, continue the loop' shape from 98c9741 re-introduced CORRUPTED_EVENT_LOG under stress: when two concurrent invocations make divergent VM decisions from different event-log snapshots, the winner's fenced write succeeds and the loser's bails. But the loser's VM had already derived its own queue results from the stale snapshot — if it continues past the conflict and queues them, those queue items can drive subsequent ticks that consume the winner's events as their own, surfacing as the original step_mismatch shape ("step_started for step_X belongs to <name-A> but consumer is <name-B>"). The right behavior is the one Pranay sketched on Slack: 'if new events have been introduced to the log after a concurrent replay has started, the invocation queue results must be abandoned. That replay is invalid.' Implementation: - `FencedWriteResult` now carries a `staleSnapshot` boolean so callers can distinguish 'fence conflict — abandon entire replay' from 'entity already exists — skip this write but keep going'. - `handleSuspension` short-circuits and returns `{ staleSnapshot: true, pendingSteps: [] }` the moment any fenced write rejects with a fence conflict. Subsequent step/wait writes from that replay never run. - Runtime tick detects `staleSnapshot: true` and `return`s cleanly (no `run_failed` event). The canonical invocation is left to make progress; the run stays `running`. The elapsed-wait scan (`wait_completed`) deliberately keeps its continue-on-conflict shape: the work it derives is purely timer-based (which waits have elapsed), not a VM branch decision, so a stale snapshot doesn't change the set of waits to complete. Only the suspension handler's writes are guarded by the abandon-the-tick semantic. Tests: 1018 core tests pass.

The abandon-tick change (fbaa2bf) correctly stops a stale-snapshot replay from queueing divergent work, but it returned without re-enqueueing. Under a hook burst, every tick that would consume the late-arriving hook_received events could race and abandon, leaving the run 'running' with pending hooks and no tick scheduled to advance it. Stress testing showed ~28/40 runs stalled this way (valid fence, real events, just no continuation). Return { timeoutSeconds: 0 } on stale-snapshot abandon instead of a bare return — the same immediate re-enqueue idiom the hook-conflict path already uses. This guarantees a fresh tick re-runs against the canonical event log. This is bounded (one re-enqueue per abandoned tick) and converges: paired with the server-side atomic fence+event write (no phantom fences), the canonical replay makes forward progress, so the re-enqueued tick advances the log rather than spinning — unlike the original MAX_FENCE_RETRIES storm this design replaced.

The orphaned-step-dispatch recovery (re-queue step_created / step_retrying events that never reached step_started) was gated on `metadata.attempt > 1`, i.e. only on queue redeliveries. That misses the stale-snapshot abandon path: when a tick writes a fenced step_created and then abandons on a *later* fenced write (returning staleSnapshot + re-enqueuing), it never reaches the step-queueing code. The re-enqueue produces a *fresh* queue message (attempt 1), not a redelivery, so the attempt-gated recovery never fired — leaving the run stalled with a valid fence and an orphaned step_created that no one dispatches. Run the recovery scan on every invocation. It is safe unconditionally: step dispatch is queued with `idempotencyKey: step.correlationId`, so re-queueing an already-dispatched step is deduped by the queue. Steps this tick created are still queued via `createdStepCorrelationIds` and selected for inline execution via `ownedPendingSteps` (unchanged); recovery only adds orphans this tick did not create, which are queued (never inline-executed) — correct, since their creating tick abandoned. Observed in stress testing: with the atomic-fence server fix eliminating phantom fences, a residual set of runs stalled with a real fence + a step_created that never started. This closes that gap. Tests: 1018 core tests pass.

…ion" This reverts commit a42c1c1.

…shot replay The previous attempt (unconditional orphaned-step recovery scan, reverted in d441126) re-queued every pending step_created on every invocation. That violated the single-owner-per-step invariant: a non-owner tick could re-dispatch a step another tick was already running, producing a duplicate step_started and a CORRUPTED_EVENT_LOG ("Unconsumed event in event log: eventType=step_started"). 2/40 runs hit this in stress. Safer approach: when handleSuspension abandons on a stale snapshot, it returns the steps it ALREADY wrote a fenced step_created for (the ones in createdStepCorrelationIds) as pendingSteps. Those writes succeeded against a matching fence inside the atomic transaction, so they're canonical and owned by exactly this tick. The runtime's staleSnapshot branch dispatches just those owned steps (with idempotencyKey: correlationId) before re-enqueuing, so: - no orphaned step_created (the step that this tick created always gets an owner to dispatch it), and - no double-dispatch (only the single owning tick queues each step; other ticks that abandon before writing the step_created never claim ownership of it). This pairs ownership with dispatch instead of blindly recovering, which is what made the unconditional scan unsafe. Tests: 1018 core tests pass.

NOT FOR MERGE. This branch ('debug/validate-occ-fix-20260528') exists to run end-to-end stress validation of the combined fix: - @workflow/core: #2113 (e5cd686, top of branch) - workflow-server: vercel/workflow-server#447 (e6722b2) WORKFLOW_SERVER_URL_OVERRIDE is pinned to the workflow-server preview deployment for #447 so the repro app exercises both PRs together. The WORKFLOW_VERCEL_PROTECTION_BYPASS env var is forwarded as the bare 'x-vercel-protection-bypass' header to bypass the preview's Vercel Deployment Protection. Setting 'x-vercel-set-bypass-cookie: true' is deliberately NOT done — it triggers a 307 redirect loop on Node undici.

vercel Bot deployed to Preview – workflow-web May 28, 2026 21:51 View deployment

vercel Bot deployed to Preview – workflow-tarballs May 28, 2026 21:52 View deployment

vercel Bot deployed to Preview – workbench-hono-workflow May 28, 2026 21:52 View deployment

vercel Bot deployed to Preview – workbench-express-workflow May 28, 2026 21:52 View deployment

vercel Bot deployed to Preview – workbench-fastify-workflow May 28, 2026 21:52 View deployment

vercel Bot deployed to Preview – workbench-nitro-workflow May 28, 2026 21:52 View deployment

vercel Bot deployed to Preview – workbench-astro-workflow May 28, 2026 21:52 View deployment

vercel Bot deployed to Preview – workbench-sveltekit-workflow May 28, 2026 21:52 View deployment

vercel Bot deployed to Preview – workbench-vite-workflow May 28, 2026 21:52 View deployment

vercel Bot deployed to Preview – example-workflow May 28, 2026 21:52 View deployment

vercel Bot deployed to Preview – workbench-tanstack-start-workflow May 28, 2026 21:52 View deployment

vercel Bot deployed to Preview – workbench-nuxt-workflow May 28, 2026 21:52 View deployment

vercel Bot deployed to Preview – example-nextjs-workflow-turbopack May 28, 2026 21:52 View deployment

vercel Bot deployed to Preview – workflow-docs May 28, 2026 21:53 View deployment

vercel Bot deployed to Preview – example-nextjs-workflow-webpack May 28, 2026 21:53 View deployment

vercel Bot deployed to Preview – workflow-swc-playground May 28, 2026 21:54 View deployment

TooTallNate force-pushed the debug/validate-occ-fix-20260528 branch from c0bd797 to df40f8b Compare May 28, 2026 22:18

vercel Bot deployed to Preview – workflow-web May 28, 2026 22:19 View deployment

vercel Bot deployed to Preview – workflow-tarballs May 28, 2026 22:19 View deployment

vercel Bot deployed to Preview – workbench-hono-workflow May 28, 2026 22:19 View deployment

vercel Bot deployed to Preview – workbench-nitro-workflow May 28, 2026 22:19 View deployment

vercel Bot deployed to Preview – workbench-fastify-workflow May 28, 2026 22:19 View deployment

vercel Bot deployed to Preview – workbench-express-workflow May 28, 2026 22:19 View deployment

vercel Bot deployed to Preview – workbench-vite-workflow May 28, 2026 22:20 View deployment

vercel Bot deployed to Preview – example-workflow May 28, 2026 22:20 View deployment

vercel Bot deployed to Preview – workbench-astro-workflow May 28, 2026 22:20 View deployment

vercel Bot deployed to Preview – workbench-sveltekit-workflow May 28, 2026 22:20 View deployment

vercel Bot deployed to Preview – example-nextjs-workflow-webpack May 28, 2026 23:08 View deployment

vercel Bot deployed to Preview – workflow-docs May 28, 2026 23:08 View deployment

vercel Bot deployed to Preview – workflow-swc-playground May 28, 2026 23:10 View deployment

TooTallNate force-pushed the debug/validate-occ-fix-20260528 branch from 455bda9 to 0c7eb75 Compare May 28, 2026 23:26

vercel Bot deployed to Preview – workflow-web May 28, 2026 23:27 View deployment

vercel Bot deployed to Preview – workbench-fastify-workflow May 28, 2026 23:28 View deployment

vercel Bot deployed to Preview – workflow-tarballs May 28, 2026 23:28 View deployment

vercel Bot deployed to Preview – workbench-nitro-workflow May 28, 2026 23:28 View deployment

vercel Bot deployed to Preview – workbench-hono-workflow May 28, 2026 23:28 View deployment

vercel Bot deployed to Preview – workbench-express-workflow May 28, 2026 23:28 View deployment

vercel Bot deployed to Preview – workbench-astro-workflow May 28, 2026 23:28 View deployment

vercel Bot deployed to Preview – workbench-sveltekit-workflow May 28, 2026 23:28 View deployment

vercel Bot deployed to Preview – workbench-vite-workflow May 28, 2026 23:28 View deployment

vercel Bot deployed to Preview – example-workflow May 28, 2026 23:28 View deployment

vercel Bot deployed to Preview – workbench-tanstack-start-workflow May 28, 2026 23:28 View deployment

vercel Bot deployed to Preview – workbench-nuxt-workflow May 28, 2026 23:28 View deployment

vercel Bot deployed to Preview – example-nextjs-workflow-webpack May 28, 2026 23:28 View deployment

vercel Bot deployed to Preview – workflow-docs May 28, 2026 23:29 View deployment

vercel Bot deployed to Preview – example-nextjs-workflow-turbopack May 28, 2026 23:29 View deployment

vercel Bot deployed to Preview – workflow-swc-playground May 28, 2026 23:30 View deployment

vercel Bot deployed to Preview – workflow-web May 29, 2026 03:57 View deployment

vercel Bot deployed to Preview – workflow-tarballs May 29, 2026 03:58 View deployment

vercel Bot deployed to Preview – workbench-nitro-workflow May 29, 2026 03:58 View deployment

TooTallNate added 6 commits May 28, 2026 23:38

Revert "core: recover orphaned step_created dispatch on every invocat…

d441126

…ion" This reverts commit a42c1c1.

debug: repoint to workflow-server atomic-txn preview (e1d0ea7)

7613014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[debug] Validation run: combine #2113 SDK + #447 server e6722b2#2146

[debug] Validation run: combine #2113 SDK + #447 server e6722b2#2146
TooTallNate wants to merge 7 commits into
peter/sdk-event-write-casfrom
debug/validate-occ-fix-20260528

TooTallNate commented May 28, 2026

Uh oh!

changeset-bot Bot commented May 28, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TooTallNate commented May 28, 2026

Purpose

What this branch adds on top of #2113

Validation plan

Uh oh!

changeset-bot Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

vercel Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 E2E Test Results

Summary

❌ Failed Tests

Details by Category

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

changeset-bot Bot commented May 28, 2026 •

edited

Loading

vercel Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading