Skip to content

[debug] Validation run: combine #2113 SDK + #447 server e6722b2#2146

Draft
TooTallNate wants to merge 7 commits into
peter/sdk-event-write-casfrom
debug/validate-occ-fix-20260528
Draft

[debug] Validation run: combine #2113 SDK + #447 server e6722b2#2146
TooTallNate wants to merge 7 commits into
peter/sdk-event-write-casfrom
debug/validate-occ-fix-20260528

Conversation

@TooTallNate
Copy link
Copy Markdown
Member

NOT FOR MERGE. Draft PR opened only to trigger the tarballs-checks workflow so we get a tarball URL to pin the repro app to.

Purpose

End-to-end validation that the combined fix closes the production-visible defect end to end. Pairs:

What this branch adds on top of #2113

  • WORKFLOW_SERVER_URL_OVERRIDE pinned to the Version Packages (beta) #447 preview URL
  • x-vercel-protection-bypass header forwarded from WORKFLOW_VERCEL_PROTECTION_BYPASS env var (so the repro app can hit the preview through Vercel Deployment Protection)

Both changes are gated to this branch only and will not be cherry-picked into either real PR.

Validation plan

Run the standard stress repro shape against the pinned tarball + preview pair (40 cycles × 200 workflows). Classify outcomes across:

  • completed
  • still running at final check
  • failed: CORRUPTED_EVENT_LOG
  • failed: USER_ERROR
  • failed: WORLD_CONTRACT_ERROR
  • failed: other

Last run pre-#447-server-fix: ~2/40 cycles surfaced CORRUPTED_EVENT_LOG on stable; 0/40 with this PR's predecessor against an earlier #447 preview but with 132 stuck-running + 23 USER_ERROR + 4 WORLD_CONTRACT_ERROR uncategorized.

Goal of this run: confirm not just CORRUPTED_EVENT_LOG = 0 but also stuck/USER_ERROR/WORLD_CONTRACT_ERROR are clean, since those would be the symptom of the materialization-before-fence orphan scenarios Peter walked.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 28, 2026

⚠️ No Changeset found

Latest commit: 7613014

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
example-nextjs-workflow-turbopack Ready Ready Preview, Comment May 29, 2026 7:53am
example-nextjs-workflow-webpack Ready Ready Preview, Comment May 29, 2026 7:53am
example-workflow Ready Ready Preview, Comment May 29, 2026 7:53am
workbench-astro-workflow Ready Ready Preview, Comment May 29, 2026 7:53am
workbench-express-workflow Ready Ready Preview, Comment May 29, 2026 7:53am
workbench-fastify-workflow Ready Ready Preview, Comment May 29, 2026 7:53am
workbench-hono-workflow Ready Ready Preview, Comment May 29, 2026 7:53am
workbench-nitro-workflow Ready Ready Preview, Comment May 29, 2026 7:53am
workbench-nuxt-workflow Ready Ready Preview, Comment May 29, 2026 7:53am
workbench-sveltekit-workflow Ready Ready Preview, Comment May 29, 2026 7:53am
workbench-tanstack-start-workflow Ready Ready Preview, Comment May 29, 2026 7:53am
workbench-vite-workflow Ready Ready Preview, Comment May 29, 2026 7:53am
workflow-docs Ready Ready Preview, Comment, Open in v0 May 29, 2026 7:53am
workflow-swc-playground Ready Ready Preview, Comment May 29, 2026 7:53am
workflow-tarballs Ready Ready Preview, Comment May 29, 2026 7:53am
workflow-web Ready Ready Preview, Comment May 29, 2026 7:53am

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

🧪 E2E Test Results

Some tests failed

Summary

Passed Failed Skipped Total
❌ ▲ Vercel Production 1190 32 219 1441
✅ 💻 Local Development 1615 0 219 1834
✅ 📦 Local Production 1615 0 219 1834
✅ 🐘 Local Postgres 1615 0 219 1834
✅ 🪟 Windows 131 0 0 131
❌ 📋 Other 739 2 176 917
Total 6905 34 1052 7991

❌ Failed Tests

▲ Vercel Production (32 failed)

astro (1 failed):

  • AbortController abortExternalSignalWorkflow: signal passed as workflow input

example (1 failed):

  • fibonacciWorkflow - recursive workflow composition via start() | wrun_01KSSBZ965W97AFDHKER08Y5HC | 🔍 observability

express (3 failed):

  • outputStreamWorkflow positive startIndex (skips first chunk)
  • AbortController abortReasonWorkflow: abort reason preserved across boundaries
  • AbortController abortVoidSleepTimeoutWorkflow: documented void sleep().then(abort) pattern works

fastify (2 failed):

  • sleepInLoopWorkflow - sleep inside loop with steps actually delays each iteration | wrun_01KSSC42D9CV29AQTYQ5VEJM1B | 🔍 observability
  • AbortController abortSurvivesReplayWorkflow: controller state consistent across replay

hono (3 failed):

  • parallelSleepWorkflow | wrun_01KSSBMQR32XYATD4XZD0KQF1A | 🔍 observability
  • closureVariableWorkflow - nested step functions with closure variables | wrun_01KSSBYPFQB35J48R1VYGS7Y9E | 🔍 observability
  • spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step | wrun_01KSSBYRZW4NC1XCQVYX1BGMES | 🔍 observability

nextjs-turbopack (4 failed):

  • DurableAgent e2e experimental_onStepStart (GAP) completes but callbacks are not called (GAP)
  • fibonacciWorkflow - recursive workflow composition via start() | wrun_01KSSBZ965W97AFDHKER08Y5HC | 🔍 observability
  • Calculator.calculate - static workflow method using static step methods from another class | wrun_01KSSC05YKKWQY506Y50YTFD9R | 🔍 observability
  • errorSubclassRoundTripWorkflow - first-class Error subclasses survive every serialization boundary | wrun_01KSSC1XG3F71GH2WHPDG25YN8 | 🔍 observability

nextjs-webpack (1 failed):

  • AbortController abortFromStepWorkflow: step abort cancels an in-flight sibling step

nitro (1 failed):

  • fibonacciWorkflow - recursive workflow composition via start() | wrun_01KSSBZ965W97AFDHKER08Y5HC | 🔍 observability

nuxt (7 failed):

  • fibonacciWorkflow - recursive workflow composition via start() | wrun_01KSSBZ965W97AFDHKER08Y5HC | 🔍 observability
  • health check (queue-based) - workflow endpoint responds to health check messages
  • health check (CLI) - workflow health command reports healthy endpoints
  • pathsAliasWorkflow - TypeScript path aliases resolve correctly | wrun_01KSSBZYTZQ54CV3EQMZKKVRNT | 🔍 observability
  • AbortController abortAfterCompletionWorkflow: abort after step completes is a no-op
  • AbortController abortExternalSignalWorkflow: signal passed as workflow input
  • AbortController abortAnyInWorkflowWorkflow: AbortSignal.any composes signals inside the workflow VM

sveltekit (4 failed):

  • DurableAgent e2e core single tool call
  • DurableAgent e2e experimental_onToolCallStart (GAP) completes but callbacks are not called (GAP)
  • closureVariableWorkflow - nested step functions with closure variables | wrun_01KSSBYPFQB35J48R1VYGS7Y9E | 🔍 observability
  • fibonacciWorkflow - recursive workflow composition via start() | wrun_01KSSBZ965W97AFDHKER08Y5HC | 🔍 observability

vite (5 failed):

  • readableStreamWorkflow | wrun_01KSSBG7SCK2VJYZ4HGC1Q6H1K | 🔍 observability
  • hookWorkflow is not resumable via public webhook endpoint | wrun_01KSSBH63H4Y656X71066ZKSFY | 🔍 observability
  • runClassSerializationWorkflow - Run instances serialize across workflow/step boundaries | wrun_01KSSBXVA9MZPQCBQ8CTKQY50Z | 🔍 observability
  • sleepWithSequentialStepsWorkflow - sequential steps work with concurrent sleep (control) | wrun_01KSSC4D3MHRF84CKJDYMSRCZQ | 🔍 observability
  • AbortController abortTimeoutWorkflow: timeout cancels long-running step
📋 Other (2 failed)

e2e-vercel-prod-tanstack-start (2 failed):

  • stepWinsRaceWorkflow | wrun_01KSSBMZWHM1PN5WRWBSRGR530
  • fibonacciWorkflow - recursive workflow composition via start() | wrun_01KSSBZ965W97AFDHKER08Y5HC

Details by Category

❌ ▲ Vercel Production
App Passed Failed Skipped
❌ astro 104 1 26
❌ example 104 1 26
❌ express 102 3 26
❌ fastify 103 2 26
❌ hono 102 3 26
❌ nextjs-turbopack 125 4 2
❌ nextjs-webpack 128 1 2
❌ nitro 104 1 26
❌ nuxt 98 7 26
❌ sveltekit 120 4 7
❌ vite 100 5 26
✅ 💻 Local Development
App Passed Failed Skipped
✅ astro-stable 106 0 25
✅ express-stable 106 0 25
✅ fastify-stable 106 0 25
✅ hono-stable 106 0 25
✅ nextjs-turbopack-canary 112 0 19
✅ nextjs-turbopack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-turbopack-stable-lazy-discovery-enabled 131 0 0
✅ nextjs-webpack-canary 112 0 19
✅ nextjs-webpack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-webpack-stable-lazy-discovery-enabled 131 0 0
✅ nitro-stable 106 0 25
✅ nuxt-stable 106 0 25
✅ sveltekit-stable 125 0 6
✅ vite-stable 106 0 25
✅ 📦 Local Production
App Passed Failed Skipped
✅ astro-stable 106 0 25
✅ express-stable 106 0 25
✅ fastify-stable 106 0 25
✅ hono-stable 106 0 25
✅ nextjs-turbopack-canary 112 0 19
✅ nextjs-turbopack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-turbopack-stable-lazy-discovery-enabled 131 0 0
✅ nextjs-webpack-canary 112 0 19
✅ nextjs-webpack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-webpack-stable-lazy-discovery-enabled 131 0 0
✅ nitro-stable 106 0 25
✅ nuxt-stable 106 0 25
✅ sveltekit-stable 125 0 6
✅ vite-stable 106 0 25
✅ 🐘 Local Postgres
App Passed Failed Skipped
✅ astro-stable 106 0 25
✅ express-stable 106 0 25
✅ fastify-stable 106 0 25
✅ hono-stable 106 0 25
✅ nextjs-turbopack-canary 112 0 19
✅ nextjs-turbopack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-turbopack-stable-lazy-discovery-enabled 131 0 0
✅ nextjs-webpack-canary 112 0 19
✅ nextjs-webpack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-webpack-stable-lazy-discovery-enabled 131 0 0
✅ nitro-stable 106 0 25
✅ nuxt-stable 106 0 25
✅ sveltekit-stable 125 0 6
✅ vite-stable 106 0 25
✅ 🪟 Windows
App Passed Failed Skipped
✅ nextjs-turbopack 131 0 0
❌ 📋 Other
App Passed Failed Skipped
✅ e2e-local-dev-nest-stable 106 0 25
✅ e2e-local-dev-tanstack-start- 106 0 25
✅ e2e-local-postgres-nest-stable 106 0 25
✅ e2e-local-postgres-tanstack-start- 106 0 25
✅ e2e-local-prod-nest-stable 106 0 25
✅ e2e-local-prod-tanstack-start- 106 0 25
❌ e2e-vercel-prod-tanstack-start 103 2 26

📋 View full workflow run


Some E2E test jobs failed:

  • Vercel Prod: failure
  • Local Dev: success
  • Local Prod: success
  • Local Postgres: success
  • Windows: success

Check the workflow run for details.

@TooTallNate TooTallNate force-pushed the debug/validate-occ-fix-20260528 branch from c0bd797 to df40f8b Compare May 28, 2026 22:18
Building on 98c9741's bail-on-fence-conflict, propagate the
fence-conflict signal upward as `staleSnapshot: true` so the entire
current replay's queue results are abandoned rather than just the
individual write skipped.

The narrower 'skip the write, continue the loop' shape from 98c9741
re-introduced CORRUPTED_EVENT_LOG under stress: when two concurrent
invocations make divergent VM decisions from different event-log
snapshots, the winner's fenced write succeeds and the loser's bails.
But the loser's VM had already derived its own queue results from the
stale snapshot — if it continues past the conflict and queues them,
those queue items can drive subsequent ticks that consume the
winner's events as their own, surfacing as the original step_mismatch
shape ("step_started for step_X belongs to <name-A> but consumer is
<name-B>").

The right behavior is the one Pranay sketched on Slack: 'if new events
have been introduced to the log after a concurrent replay has started,
the invocation queue results must be abandoned. That replay is invalid.'

Implementation:

- `FencedWriteResult` now carries a `staleSnapshot` boolean so callers
  can distinguish 'fence conflict — abandon entire replay' from
  'entity already exists — skip this write but keep going'.
- `handleSuspension` short-circuits and returns `{ staleSnapshot: true,
  pendingSteps: [] }` the moment any fenced write rejects with a
  fence conflict. Subsequent step/wait writes from that replay never
  run.
- Runtime tick detects `staleSnapshot: true` and `return`s cleanly
  (no `run_failed` event). The canonical invocation is left to make
  progress; the run stays `running`.

The elapsed-wait scan (`wait_completed`) deliberately keeps its
continue-on-conflict shape: the work it derives is purely
timer-based (which waits have elapsed), not a VM branch decision, so
a stale snapshot doesn't change the set of waits to complete. Only
the suspension handler's writes are guarded by the abandon-the-tick
semantic.

Tests: 1018 core tests pass.
@TooTallNate TooTallNate force-pushed the debug/validate-occ-fix-20260528 branch from 455bda9 to 0c7eb75 Compare May 28, 2026 23:26
The abandon-tick change (fbaa2bf) correctly stops a stale-snapshot
replay from queueing divergent work, but it returned without
re-enqueueing. Under a hook burst, every tick that would consume the
late-arriving hook_received events could race and abandon, leaving the
run 'running' with pending hooks and no tick scheduled to advance it.
Stress testing showed ~28/40 runs stalled this way (valid fence, real
events, just no continuation).

Return { timeoutSeconds: 0 } on stale-snapshot abandon instead of a
bare return — the same immediate re-enqueue idiom the hook-conflict
path already uses. This guarantees a fresh tick re-runs against the
canonical event log.

This is bounded (one re-enqueue per abandoned tick) and converges:
paired with the server-side atomic fence+event write (no phantom
fences), the canonical replay makes forward progress, so the
re-enqueued tick advances the log rather than spinning — unlike the
original MAX_FENCE_RETRIES storm this design replaced.
The orphaned-step-dispatch recovery (re-queue step_created /
step_retrying events that never reached step_started) was gated on
`metadata.attempt > 1`, i.e. only on queue redeliveries. That misses
the stale-snapshot abandon path: when a tick writes a fenced
step_created and then abandons on a *later* fenced write (returning
staleSnapshot + re-enqueuing), it never reaches the step-queueing
code. The re-enqueue produces a *fresh* queue message (attempt 1), not
a redelivery, so the attempt-gated recovery never fired — leaving the
run stalled with a valid fence and an orphaned step_created that no
one dispatches.

Run the recovery scan on every invocation. It is safe unconditionally:
step dispatch is queued with `idempotencyKey: step.correlationId`, so
re-queueing an already-dispatched step is deduped by the queue. Steps
this tick created are still queued via `createdStepCorrelationIds` and
selected for inline execution via `ownedPendingSteps` (unchanged);
recovery only adds orphans this tick did not create, which are queued
(never inline-executed) — correct, since their creating tick abandoned.

Observed in stress testing: with the atomic-fence server fix
eliminating phantom fences, a residual set of runs stalled with a real
fence + a step_created that never started. This closes that gap.

Tests: 1018 core tests pass.
…shot replay

The previous attempt (unconditional orphaned-step recovery scan,
reverted in d441126) re-queued every pending step_created on every
invocation. That violated the single-owner-per-step invariant: a
non-owner tick could re-dispatch a step another tick was already
running, producing a duplicate step_started and a
CORRUPTED_EVENT_LOG ("Unconsumed event in event log:
eventType=step_started"). 2/40 runs hit this in stress.

Safer approach: when handleSuspension abandons on a stale snapshot, it
returns the steps it ALREADY wrote a fenced step_created for (the ones
in createdStepCorrelationIds) as pendingSteps. Those writes succeeded
against a matching fence inside the atomic transaction, so they're
canonical and owned by exactly this tick. The runtime's
staleSnapshot branch dispatches just those owned steps (with
idempotencyKey: correlationId) before re-enqueuing, so:

- no orphaned step_created (the step that this tick created always gets
  an owner to dispatch it), and
- no double-dispatch (only the single owning tick queues each step;
  other ticks that abandon before writing the step_created never claim
  ownership of it).

This pairs ownership with dispatch instead of blindly recovering, which
is what made the unconditional scan unsafe.

Tests: 1018 core tests pass.
NOT FOR MERGE. This branch ('debug/validate-occ-fix-20260528') exists
to run end-to-end stress validation of the combined fix:

- @workflow/core: #2113 (e5cd686, top of branch)
- workflow-server: vercel/workflow-server#447 (e6722b2)

WORKFLOW_SERVER_URL_OVERRIDE is pinned to the workflow-server preview
deployment for #447 so the repro app exercises both PRs together. The
WORKFLOW_VERCEL_PROTECTION_BYPASS env var is forwarded as the bare
'x-vercel-protection-bypass' header to bypass the preview's Vercel
Deployment Protection. Setting 'x-vercel-set-bypass-cookie: true' is
deliberately NOT done — it triggers a 307 redirect loop on Node undici.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant