Skip to content

[core] Optimistic concurrency control for branch-decision event writes#2113

Open
VaguelySerious wants to merge 10 commits into
mainfrom
peter/sdk-event-write-cas
Open

[core] Optimistic concurrency control for branch-decision event writes#2113
VaguelySerious wants to merge 10 commits into
mainfrom
peter/sdk-event-write-cas

Conversation

@VaguelySerious
Copy link
Copy Markdown
Member

@VaguelySerious VaguelySerious commented May 26, 2026

Summary

Adds optimistic-concurrency fencing to the event writes that determine workflow branching, closing the hook/sleep race that produces CORRUPTED_EVENT_LOG on production runs.

Every write that emits a branch-decision side effect from a stale snapshot is now fenced against the canonical log:

  • wait_completed — elapsed-wait scan snapshots the loaded events' tail eventId and passes it as lastKnownEventId on each write. If a concurrent resumeHook has advanced the log, the server's CAS rejects.
  • step_created, hook_created, hook_disposed, wait_created — suspension-time writes fence against the same snapshot, via a new shared fencedEventCreate helper (packages/core/src/runtime/fenced-write.ts).
  • run_completed, run_failed — terminal writes fence against the snapshot, with idempotency-via-reload so a concurrent terminal write doesn't re-fail the tick.

On a fence-conflict EntityConflictError, the runtime retries in-place rather than throwing the whole tick away: it reloads events from the cursor, refreshes the fence, and tries again (up to 5x with backoff). Falling back to queue redelivery turned out to thunder-herd — every redelivery spawns another concurrent tick, which fences-conflicts again, and workflows stall in running. If the work was committed by a concurrent writer between attempts, we observe it in the reloaded log and skip the write entirely (idempotency).

resumeHook appends hook_received unfenced. ULID ordering already places this write after anything committed before us, and applying CAS would only ever reject the hook in favor of an unrelated concurrent write (losing the user's signal). Stale-snapshot protection lives on the tick writes that consume hooks, not on the write that delivers them.

Fence-conflict detection is anchored on HTTP status, not error wording: in world-vercel, HTTP 412 responses are always surfaced as EntityConflictError with a fence conflict: prefix added client-side based on HTTP status (so the marker is present even when the response body fails to parse). The runtime's isFenceConflict() check (EntityConflictError + /fence conflict/i) therefore cannot silently regress against server wording changes.

CreateEventParams on @workflow/world grows lastKnownEventId and asOfTimestamp (both optional). Worlds that don't implement OCC can pass them through or ignore them.

Pairs with backend PR vercel/workflow-server#447 which materializes run.lastKnownEventId and gates event writes on it. The server's CAS is explicit opt-in — unfenced writers (most paths) still atomically advance the materialized value so fenced writers can chain off it, but they don't reject on contention.

Test plan

  • All 1014 core unit tests passing
  • Typecheck clean
  • Changeset included
  • End-to-end repro against a Vercel preview deployment of this branch + the matching backend preview

Stress reproduction

The original CORRUPTED_EVENT_LOG bug reproduces on stable at ~0.1–0.4% of runs under the following shape: Promise.race([hook, sleep]) with sleepBranchWaitCount parallel sleeps when sleep wins, 10 hook payloads per token at fireAfterMs=3000.

Stress runs (REPRO_COUNT=180 REPRO_LOOPS=80 REPRO_CONCURRENCY=50 × 8 parallel cycles, 40 cycles total against this branch + the paired backend preview) show 0/40 cycles surfacing CORRUPTED_EVENT_LOG. Baseline on stable surfaces the failure in ~2/40 cycles under the same load.

The earlier residual pattern (sleep-branch waits with a single un-completed wait_created) is now closed by extending the fence to wait_created itself.

🤖 Generated with Claude Code

The elapsed-wait scan now snapshots the loaded events' tail eventId and
passes it as `lastKnownEventId` on each `wait_completed` write, so a
concurrent `resumeHook` that has already advanced the canonical log is
detected — the server's CAS rejects the write, we surface it as the
existing `EntityConflictError`, and the next iteration re-replays
against the fresh event list (mirroring the duplicate-wait fall-through
that was already there).

`resumeHook` sends `asOfTimestamp` (Date.now() at call time) so the
server resolves the fence to the highest eventId strictly before
resume time — no client-side event pre-read needed.

Plumbed through `CreateEventParams` on `@workflow/world` so future
worlds can forward as-is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 26, 2026

🦋 Changeset detected

Latest commit: 98c9741

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 20 packages
Name Type
@workflow/core Patch
@workflow/world Patch
@workflow/world-vercel Patch
@workflow/world-local Patch
@workflow/builders Patch
@workflow/cli Patch
@workflow/next Patch
@workflow/nitro Patch
@workflow/vitest Patch
@workflow/web-shared Patch
@workflow/web Patch
workflow Patch
@workflow/world-testing Patch
@workflow/world-postgres Patch
@workflow/astro Patch
@workflow/nest Patch
@workflow/rollup Patch
@workflow/sveltekit Patch
@workflow/vite Patch
@workflow/nuxt Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 26, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
example-nextjs-workflow-turbopack Ready Ready Preview, Comment May 28, 2026 11:10pm
example-nextjs-workflow-webpack Ready Ready Preview, Comment May 28, 2026 11:10pm
example-workflow Ready Ready Preview, Comment May 28, 2026 11:10pm
workbench-astro-workflow Ready Ready Preview, Comment May 28, 2026 11:10pm
workbench-express-workflow Ready Ready Preview, Comment May 28, 2026 11:10pm
workbench-fastify-workflow Ready Ready Preview, Comment May 28, 2026 11:10pm
workbench-hono-workflow Ready Ready Preview, Comment May 28, 2026 11:10pm
workbench-nitro-workflow Ready Ready Preview, Comment May 28, 2026 11:10pm
workbench-nuxt-workflow Ready Ready Preview, Comment May 28, 2026 11:10pm
workbench-sveltekit-workflow Ready Ready Preview, Comment May 28, 2026 11:10pm
workbench-tanstack-start-workflow Ready Ready Preview, Comment May 28, 2026 11:10pm
workbench-vite-workflow Ready Ready Preview, Comment May 28, 2026 11:10pm
workflow-docs Ready Ready Preview, Comment, Open in v0 May 28, 2026 11:10pm
workflow-swc-playground Ready Ready Preview, Comment May 28, 2026 11:10pm
workflow-tarballs Ready Ready Preview, Comment May 28, 2026 11:10pm
workflow-web Ready Ready Preview, Comment May 28, 2026 11:10pm

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 26, 2026

🧪 E2E Test Results

Some tests failed

Summary

Passed Failed Skipped Total
✅ ▲ Vercel Production 1222 0 219 1441
❌ 💻 Local Development 1614 1 219 1834
✅ 📦 Local Production 1615 0 219 1834
✅ 🐘 Local Postgres 1615 0 219 1834
✅ 🪟 Windows 131 0 0 131
✅ 📋 Other 741 0 176 917
Total 6938 1 1052 7991

❌ Failed Tests

💻 Local Development (1 failed)

fastify-stable (1 failed):

  • addTenWorkflow | wrun_01KSRDG1M69VYM3QCBRG4CGJ57

Details by Category

✅ ▲ Vercel Production
App Passed Failed Skipped
✅ astro 105 0 26
✅ example 105 0 26
✅ express 105 0 26
✅ fastify 105 0 26
✅ hono 105 0 26
✅ nextjs-turbopack 129 0 2
✅ nextjs-webpack 129 0 2
✅ nitro 105 0 26
✅ nuxt 105 0 26
✅ sveltekit 124 0 7
✅ vite 105 0 26
❌ 💻 Local Development
App Passed Failed Skipped
✅ astro-stable 106 0 25
✅ express-stable 106 0 25
❌ fastify-stable 105 1 25
✅ hono-stable 106 0 25
✅ nextjs-turbopack-canary 112 0 19
✅ nextjs-turbopack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-turbopack-stable-lazy-discovery-enabled 131 0 0
✅ nextjs-webpack-canary 112 0 19
✅ nextjs-webpack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-webpack-stable-lazy-discovery-enabled 131 0 0
✅ nitro-stable 106 0 25
✅ nuxt-stable 106 0 25
✅ sveltekit-stable 125 0 6
✅ vite-stable 106 0 25
✅ 📦 Local Production
App Passed Failed Skipped
✅ astro-stable 106 0 25
✅ express-stable 106 0 25
✅ fastify-stable 106 0 25
✅ hono-stable 106 0 25
✅ nextjs-turbopack-canary 112 0 19
✅ nextjs-turbopack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-turbopack-stable-lazy-discovery-enabled 131 0 0
✅ nextjs-webpack-canary 112 0 19
✅ nextjs-webpack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-webpack-stable-lazy-discovery-enabled 131 0 0
✅ nitro-stable 106 0 25
✅ nuxt-stable 106 0 25
✅ sveltekit-stable 125 0 6
✅ vite-stable 106 0 25
✅ 🐘 Local Postgres
App Passed Failed Skipped
✅ astro-stable 106 0 25
✅ express-stable 106 0 25
✅ fastify-stable 106 0 25
✅ hono-stable 106 0 25
✅ nextjs-turbopack-canary 112 0 19
✅ nextjs-turbopack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-turbopack-stable-lazy-discovery-enabled 131 0 0
✅ nextjs-webpack-canary 112 0 19
✅ nextjs-webpack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-webpack-stable-lazy-discovery-enabled 131 0 0
✅ nitro-stable 106 0 25
✅ nuxt-stable 106 0 25
✅ sveltekit-stable 125 0 6
✅ vite-stable 106 0 25
✅ 🪟 Windows
App Passed Failed Skipped
✅ nextjs-turbopack 131 0 0
✅ 📋 Other
App Passed Failed Skipped
✅ e2e-local-dev-nest-stable 106 0 25
✅ e2e-local-dev-tanstack-start- 106 0 25
✅ e2e-local-postgres-nest-stable 106 0 25
✅ e2e-local-postgres-tanstack-start- 106 0 25
✅ e2e-local-prod-nest-stable 106 0 25
✅ e2e-local-prod-tanstack-start- 106 0 25
✅ e2e-vercel-prod-tanstack-start 105 0 26

📋 View full workflow run


Some E2E test jobs failed:

  • Vercel Prod: success
  • Local Dev: failure
  • Local Prod: success
  • Local Postgres: success
  • Windows: success

Check the workflow run for details.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 26, 2026

📊 Benchmark Results

📈 Comparing against baseline from main branch. Green 🟢 = faster, Red 🔺 = slower.

workflow with no steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Express 0.035s (-21.7% 🟢) 1.005s (~) 0.970s 10 1.00x
💻 Local Nitro 0.048s (+10.2% 🔺) 1.006s (~) 0.959s 10 1.37x
💻 Local Next.js (Turbopack) 0.061s 1.005s 0.944s 10 1.76x
🐘 Postgres Nitro 0.061s (-35.7% 🟢) 1.012s (-2.9%) 0.951s 10 1.76x
🐘 Postgres Express 0.065s (+11.4% 🔺) 1.012s (~) 0.948s 10 1.86x
🐘 Postgres Next.js (Turbopack) 0.068s 1.011s 0.944s 10 1.95x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 0.390s (+55.0% 🔺) 2.735s (+17.2% 🔺) 2.345s 10 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 1 step

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Express 1.076s (-4.4%) 2.005s (~) 0.929s 10 1.00x
🐘 Postgres Express 1.103s (-3.8%) 2.009s (~) 0.906s 10 1.02x
🐘 Postgres Nitro 1.103s (-3.2%) 2.010s (~) 0.907s 10 1.03x
💻 Local Nitro 1.104s (-2.4%) 2.007s (~) 0.903s 10 1.03x
💻 Local Next.js (Turbopack) 1.130s 2.006s 0.876s 10 1.05x
🐘 Postgres Next.js (Turbopack) 1.132s 2.008s 0.875s 10 1.05x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 1.607s (-21.0% 🟢) 3.630s (-5.2% 🟢) 2.023s 10 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 10 sequential steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Express 10.408s (-4.7%) 11.021s (~) 0.613s 3 1.00x
🐘 Postgres Express 10.513s (-4.1%) 11.017s (~) 0.504s 3 1.01x
🐘 Postgres Nitro 10.517s (-3.3%) 11.018s (~) 0.501s 3 1.01x
💻 Local Nitro 10.539s (-3.7%) 11.023s (~) 0.484s 3 1.01x
💻 Local Next.js (Turbopack) 10.738s 11.022s 0.284s 3 1.03x
🐘 Postgres Next.js (Turbopack) 10.838s 11.020s 0.182s 3 1.04x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 13.544s (-21.8% 🟢) 15.820s (-18.5% 🟢) 2.276s 2 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 25 sequential steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Express 13.475s (-10.0% 🟢) 14.025s (-6.7% 🟢) 0.550s 5 1.00x
💻 Local Nitro 13.731s (-8.8% 🟢) 14.027s (-12.5% 🟢) 0.295s 5 1.02x
🐘 Postgres Express 13.747s (-5.7% 🟢) 14.019s (-6.7% 🟢) 0.272s 5 1.02x
🐘 Postgres Nitro 13.865s (-5.0% 🟢) 14.020s (-6.7% 🟢) 0.155s 5 1.03x
💻 Local Next.js (Turbopack) 14.383s 15.029s 0.645s 4 1.07x
🐘 Postgres Next.js (Turbopack) 14.513s 15.013s 0.500s 4 1.08x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 21.018s (-60.0% 🟢) 23.084s (-57.7% 🟢) 2.066s 3 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 50 sequential steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Express 11.956s (-28.0% 🟢) 12.396s (-27.2% 🟢) 0.441s 8 1.00x
🐘 Postgres Express 12.379s (-11.6% 🟢) 13.017s (-10.8% 🟢) 0.638s 7 1.04x
🐘 Postgres Nitro 12.470s (-10.7% 🟢) 13.017s (-9.0% 🟢) 0.547s 7 1.04x
💻 Local Nitro 12.500s (-25.5% 🟢) 13.025s (-23.5% 🟢) 0.525s 7 1.05x
💻 Local Next.js (Turbopack) 13.538s 14.026s 0.488s 7 1.13x
🐘 Postgres Next.js (Turbopack) 13.589s 14.017s 0.428s 7 1.14x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 32.124s (-91.8% 🟢) 35.293s (-91.1% 🟢) 3.169s 3 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

Promise.all with 10 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Express 1.160s (-22.1% 🟢) 2.005s (~) 0.845s 15 1.00x
🐘 Postgres Express 1.214s (-3.7%) 2.007s (~) 0.793s 15 1.05x
🐘 Postgres Nitro 1.228s (-3.6%) 2.007s (~) 0.779s 15 1.06x
🐘 Postgres Next.js (Turbopack) 1.303s 2.008s 0.705s 15 1.12x
💻 Local Next.js (Turbopack) 1.363s 2.006s 0.643s 15 1.17x
💻 Local Nitro 1.368s (-16.2% 🟢) 2.006s (-3.3%) 0.638s 15 1.18x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 2.965s (-12.8% 🟢) 5.201s (+5.4% 🔺) 2.236s 6 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

Promise.all with 25 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 1.299s (-45.0% 🟢) 2.008s (-33.3% 🟢) 0.709s 15 1.00x
🐘 Postgres Nitro 1.393s (-40.8% 🟢) 2.074s (-31.1% 🟢) 0.681s 15 1.07x
🐘 Postgres Next.js (Turbopack) 1.461s 2.006s 0.545s 15 1.13x
💻 Local Express 1.590s (-46.1% 🟢) 2.005s (-41.9% 🟢) 0.415s 15 1.22x
💻 Local Nitro 1.907s (-39.3% 🟢) 2.151s (-44.6% 🟢) 0.244s 14 1.47x
💻 Local Next.js (Turbopack) 1.989s 2.591s 0.602s 12 1.53x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 5.060s (-28.7% 🟢) 6.693s (-24.8% 🟢) 1.633s 5 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

Promise.all with 50 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Nitro 1.434s (-58.8% 🟢) 2.008s (-49.9% 🟢) 0.574s 15 1.00x
🐘 Postgres Express 1.447s (-58.5% 🟢) 2.007s (-49.9% 🟢) 0.560s 15 1.01x
🐘 Postgres Next.js (Turbopack) 1.844s 2.315s 0.471s 13 1.29x
💻 Local Express 4.169s (-50.0% 🟢) 4.582s (-49.2% 🟢) 0.413s 7 2.91x
💻 Local Next.js (Turbopack) 5.047s 5.514s 0.467s 6 3.52x
💻 Local Nitro 5.794s (-30.6% 🟢) 6.414s (-28.9% 🟢) 0.621s 5 4.04x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 8.192s (-8.1% 🟢) 10.022s (-8.6% 🟢) 1.830s 3 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

Promise.race with 10 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 1.208s (-3.9%) 2.009s (~) 0.800s 15 1.00x
💻 Local Express 1.216s (-35.8% 🟢) 2.005s (-15.2% 🟢) 0.789s 15 1.01x
🐘 Postgres Nitro 1.223s (-2.7%) 2.009s (~) 0.786s 15 1.01x
🐘 Postgres Next.js (Turbopack) 1.268s 2.007s 0.739s 15 1.05x
💻 Local Nitro 1.271s (-31.8% 🟢) 2.006s (-14.3% 🟢) 0.734s 15 1.05x
💻 Local Next.js (Turbopack) 1.511s 2.006s 0.495s 15 1.25x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 3.295s (+12.4% 🔺) 5.451s (+17.4% 🔺) 2.157s 6 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

Promise.race with 25 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Nitro 1.285s (-45.1% 🟢) 2.008s (-33.3% 🟢) 0.723s 15 1.00x
🐘 Postgres Express 1.285s (-45.1% 🟢) 2.009s (-33.3% 🟢) 0.724s 15 1.00x
🐘 Postgres Next.js (Turbopack) 1.436s 2.007s 0.570s 15 1.12x
💻 Local Express 1.827s (-41.7% 🟢) 2.072s (-44.9% 🟢) 0.245s 15 1.42x
💻 Local Nitro 1.995s (-34.9% 🟢) 2.507s (-35.5% 🟢) 0.512s 12 1.55x
💻 Local Next.js (Turbopack) 1.999s 2.391s 0.392s 13 1.56x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 5.494s (+74.8% 🔺) 7.206s (+59.3% 🔺) 1.711s 5 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

Promise.race with 50 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 1.422s (-59.4% 🟢) 2.008s (-49.9% 🟢) 0.586s 15 1.00x
🐘 Postgres Nitro 1.450s (-58.3% 🟢) 2.008s (-49.9% 🟢) 0.558s 15 1.02x
🐘 Postgres Next.js (Turbopack) 1.846s 2.314s 0.469s 13 1.30x
💻 Local Express 4.902s (-44.3% 🟢) 5.512s (-40.6% 🟢) 0.611s 6 3.45x
💻 Local Next.js (Turbopack) 5.655s 6.012s 0.357s 5 3.98x
💻 Local Nitro 6.234s (-31.8% 🟢) 6.617s (-34.0% 🟢) 0.383s 5 4.38x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 9.819s (+45.3% 🔺) 12.421s (+45.4% 🔺) 2.602s 3 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 10 sequential data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Express 0.556s (-43.5% 🟢) 1.021s (-5.2% 🟢) 0.465s 59 1.00x
🐘 Postgres Express 0.561s (-33.1% 🟢) 1.007s (-1.6%) 0.446s 60 1.01x
🐘 Postgres Nitro 0.563s (-31.3% 🟢) 1.006s (~) 0.443s 60 1.01x
💻 Local Nitro 0.626s (-36.2% 🟢) 1.022s (-6.6% 🟢) 0.396s 59 1.13x
🐘 Postgres Next.js (Turbopack) 0.803s 1.041s 0.238s 58 1.44x
💻 Local Next.js (Turbopack) 0.829s 1.004s 0.175s 60 1.49x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 5.595s (-61.4% 🟢) 7.643s (-52.5% 🟢) 2.048s 8 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 25 sequential data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Express 1.240s (-58.9% 🟢) 2.005s (-44.1% 🟢) 0.766s 45 1.00x
🐘 Postgres Nitro 1.348s (-30.1% 🟢) 2.030s (-3.4%) 0.682s 45 1.09x
🐘 Postgres Express 1.366s (-30.9% 🟢) 2.053s (-9.1% 🟢) 0.686s 44 1.10x
💻 Local Nitro 1.541s (-49.2% 🟢) 2.006s (-46.6% 🟢) 0.465s 45 1.24x
🐘 Postgres Next.js (Turbopack) 1.876s 2.030s 0.155s 45 1.51x
💻 Local Next.js (Turbopack) 2.265s 3.075s 0.810s 30 1.83x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 15.419s (-69.0% 🟢) 17.465s (-66.2% 🟢) 2.046s 6 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 50 sequential data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 2.624s (-34.2% 🟢) 3.034s (-30.6% 🟢) 0.410s 40 1.00x
🐘 Postgres Nitro 2.659s (-35.2% 🟢) 3.058s (-33.6% 🟢) 0.399s 40 1.01x
💻 Local Express 2.817s (-69.4% 🟢) 3.218s (-67.9% 🟢) 0.401s 38 1.07x
💻 Local Nitro 3.455s (-62.8% 🟢) 4.183s (-58.3% 🟢) 0.727s 29 1.32x
🐘 Postgres Next.js (Turbopack) 3.706s 4.010s 0.304s 30 1.41x
💻 Local Next.js (Turbopack) 4.281s 5.010s 0.729s 24 1.63x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 32.528s (-69.6% 🟢) 35.374s (-67.5% 🟢) 2.847s 4 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 10 concurrent data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 0.215s (-23.9% 🟢) 1.006s (~) 0.791s 60 1.00x
🐘 Postgres Nitro 0.233s (-17.7% 🟢) 1.006s (~) 0.773s 60 1.08x
🐘 Postgres Next.js (Turbopack) 0.297s 1.006s 0.709s 60 1.38x
💻 Local Express 0.550s (-1.8%) 1.004s (~) 0.454s 60 2.56x
💻 Local Next.js (Turbopack) 0.668s 1.021s 0.353s 59 3.11x
💻 Local Nitro 0.684s (+13.1% 🔺) 1.075s (+5.3% 🔺) 0.391s 57 3.18x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 4.190s (+107.2% 🔺) 6.135s (+61.7% 🔺) 1.945s 10 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 25 concurrent data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 0.356s (-30.2% 🟢) 1.006s (~) 0.650s 90 1.00x
🐘 Postgres Nitro 0.359s (-27.7% 🟢) 1.006s (~) 0.647s 90 1.01x
🐘 Postgres Next.js (Turbopack) 0.512s 1.006s 0.494s 90 1.44x
💻 Local Express 2.100s (-16.4% 🟢) 2.765s (-8.1% 🟢) 0.665s 33 5.90x
💻 Local Nitro 2.493s (-1.8%) 3.224s (+7.1% 🔺) 0.731s 28 7.01x
💻 Local Next.js (Turbopack) 2.660s 3.333s 0.672s 28 7.48x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 8.660s (+145.0% 🔺) 10.867s (+109.3% 🔺) 2.207s 9 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 50 concurrent data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Nitro 0.712s (-9.9% 🟢) 1.006s (~) 0.294s 120 1.00x
🐘 Postgres Express 0.715s (-12.7% 🟢) 1.006s (-1.1%) 0.291s 120 1.00x
🐘 Postgres Next.js (Turbopack) 1.021s 1.825s 0.804s 66 1.43x
💻 Local Express 8.036s (-28.2% 🟢) 8.664s (-27.4% 🟢) 0.628s 14 11.29x
💻 Local Nitro 9.750s (-12.9% 🟢) 10.363s (-11.2% 🟢) 0.613s 12 13.70x
💻 Local Next.js (Turbopack) 10.510s 11.395s 0.885s 11 14.77x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 29.101s (+181.8% 🔺) 31.550s (+156.8% 🔺) 2.448s 5 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

Stream Benchmarks (includes TTFB metrics)
workflow with stream

💻 Local Development

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Express 1.137s (+471.0% 🔺) 2.004s (+99.5% 🔺) 0.008s (-30.6% 🟢) 2.014s (+97.9% 🔺) 0.878s 10 1.00x
🐘 Postgres Express 1.165s (+468.2% 🔺) 1.999s (+100.2% 🔺) 0.001s (-25.0% 🟢) 2.011s (+98.8% 🔺) 0.845s 10 1.03x
💻 Local Nitro 1.169s (+447.0% 🔺) 2.005s (+99.6% 🔺) 0.013s (~) 2.020s (+98.3% 🔺) 0.851s 10 1.03x
🐘 Postgres Nitro 1.171s (+471.1% 🔺) 2.002s (+100.3% 🔺) 0.001s (-33.3% 🟢) 2.011s (+98.9% 🔺) 0.840s 10 1.03x
💻 Local Next.js (Turbopack) 1.201s 2.004s 0.011s 2.018s 0.817s 10 1.06x
🐘 Postgres Next.js (Turbopack) 1.215s 2.001s 0.001s 2.009s 0.795s 10 1.07x

▲ Production (Vercel)

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 2.231s (-67.4% 🟢) 3.547s (-59.0% 🟢) 1.527s (+141.6% 🔺) 5.627s (-42.5% 🟢) 3.396s 10 1.00x
▲ Vercel Express ⚠️ missing - - - - -
▲ Vercel Nitro ⚠️ missing - - - - -

🔍 Observability: Next.js (Turbopack)

stream pipeline with 5 transform steps (1MB)

💻 Local Development

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Nitro 1.576s (+152.5% 🔺) 2.004s (+99.0% 🔺) 0.004s (-5.7% 🟢) 2.027s (+98.3% 🔺) 0.451s 30 1.00x
🐘 Postgres Express 1.577s (+150.3% 🔺) 2.007s (+99.4% 🔺) 0.004s (+3.6%) 2.025s (+97.9% 🔺) 0.447s 30 1.00x
💻 Local Express 1.672s (+120.8% 🔺) 2.008s (+95.2% 🔺) 0.007s (-20.7% 🟢) 2.197s (+111.2% 🔺) 0.525s 28 1.06x
🐘 Postgres Next.js (Turbopack) 1.725s 2.043s 0.004s 2.057s 0.332s 30 1.09x
💻 Local Next.js (Turbopack) 1.906s 2.010s 0.008s 2.200s 0.294s 28 1.21x
💻 Local Nitro 2.020s (+140.8% 🔺) 2.010s (+98.6% 🔺) 0.009s (-2.7%) 2.422s (+117.1% 🔺) 0.403s 25 1.28x

▲ Production (Vercel)

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 6.435s (-62.0% 🟢) 7.907s (-56.6% 🟢) 0.491s (+132.4% 🔺) 8.880s (-53.1% 🟢) 2.445s 7 1.00x
▲ Vercel Express ⚠️ missing - - - - -
▲ Vercel Nitro ⚠️ missing - - - - -

🔍 Observability: Next.js (Turbopack)

10 parallel streams (1MB each)

💻 Local Development

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 0.649s (-32.5% 🟢) 1.031s (-19.3% 🟢) 0.000s (+256.9% 🔺) 1.051s (-19.6% 🟢) 0.402s 58 1.00x
🐘 Postgres Nitro 0.659s (-31.9% 🟢) 1.052s (-15.7% 🟢) 0.000s (-58.6% 🟢) 1.068s (-15.1% 🟢) 0.408s 58 1.02x
🐘 Postgres Next.js (Turbopack) 0.764s 1.054s 0.000s 1.060s 0.296s 57 1.18x
💻 Local Express 1.174s (-4.1%) 2.012s (~) 0.000s (-20.0% 🟢) 2.014s (~) 0.840s 30 1.81x
💻 Local Next.js (Turbopack) 1.389s 2.013s 0.000s 2.016s 0.627s 30 2.14x
💻 Local Nitro 1.529s (+25.1% 🔺) 2.015s (~) 0.001s (+435.7% 🔺) 2.196s (+8.6% 🔺) 0.667s 28 2.36x

▲ Production (Vercel)

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 4.728s (-53.6% 🟢) 6.495s (-43.6% 🟢) 0.000s (NaN%) 7.195s (-40.3% 🟢) 2.467s 9 1.00x
▲ Vercel Express ⚠️ missing - - - - -
▲ Vercel Nitro ⚠️ missing - - - - -

🔍 Observability: Next.js (Turbopack)

fan-out fan-in 10 streams (1MB each)

💻 Local Development

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 1.275s (-28.1% 🟢) 2.032s (-6.7% 🟢) 0.000s (+Infinity% 🔺) 2.075s (-5.6% 🟢) 0.800s 29 1.00x
🐘 Postgres Nitro 1.291s (-28.0% 🟢) 2.026s (-5.4% 🟢) 0.000s (-3.4%) 2.088s (-4.0%) 0.798s 29 1.01x
🐘 Postgres Next.js (Turbopack) 1.460s 2.072s 0.000s 2.108s 0.649s 29 1.14x
💻 Local Express 2.422s (-30.2% 🟢) 2.975s (-26.2% 🟢) 0.000s (-58.3% 🟢) 2.978s (-26.2% 🟢) 0.556s 21 1.90x
💻 Local Nitro 2.571s (-24.1% 🟢) 3.125s (-22.5% 🟢) 0.000s (-53.1% 🟢) 3.127s (-22.5% 🟢) 0.557s 20 2.02x
💻 Local Next.js (Turbopack) 2.629s 3.078s 0.001s 3.084s 0.455s 20 2.06x

▲ Production (Vercel)

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 7.214s (+28.4% 🔺) 8.635s (+23.7% 🔺) 0.000s (+14.3% 🔺) 9.115s (+20.9% 🔺) 1.901s 7 1.00x
▲ Vercel Express ⚠️ missing - - - - -
▲ Vercel Nitro ⚠️ missing - - - - -

🔍 Observability: Next.js (Turbopack)

Summary

Fastest Framework by World

Winner determined by most benchmark wins

World 🥇 Fastest Framework Wins
💻 Local Express 21/21
🐘 Postgres Express 15/21
▲ Vercel Next.js (Turbopack) 21/21
Fastest World by Framework

Winner determined by most benchmark wins

Framework 🥇 Fastest World Wins
Express 🐘 Postgres 12/21
Next.js (Turbopack) 🐘 Postgres 15/21
Nitro 🐘 Postgres 18/21
Column Definitions
  • Workflow Time: Runtime reported by workflow (completedAt - createdAt) - primary metric
  • TTFB: Time to First Byte - time from workflow start until first stream byte received (stream benchmarks only)
  • Slurp: Time from first byte to complete stream consumption (stream benchmarks only)
  • Wall Time: Total testbench time (trigger workflow + poll for result)
  • Overhead: Testbench overhead (Wall Time - Workflow Time)
  • Samples: Number of benchmark iterations run
  • vs Fastest: How much slower compared to the fastest configuration for this benchmark

Worlds:

  • 💻 Local: In-memory filesystem world (local development)
  • 🐘 Postgres: PostgreSQL database world (local development)
  • ▲ Vercel: Vercel production/preview deployment
  • 🌐 Turso: Community world (local development)
  • 🌐 MongoDB: Community world (local development)
  • 🌐 Redis: Community world (local development)
  • 🌐 Jazz: Community world (local development)
  • 🌐 Redis: Community world (local development)
  • 🌐 Redis + BullMQ: Community world (local development)
  • 🌐 Cloudflare: Community world (local development)
  • 🌐 MySQL: Community world (local development)
  • 🌐 Azure: Community world (local development)
  • 🌐 NATS JetStream: Community world (local development)
  • 🌐 Upstash: Community world (local development)

📋 View full workflow run


Some benchmark jobs failed:

  • Local: success
  • Postgres: success
  • Vercel: failure

Check the workflow run for details.

⚠️ Community world benchmarks failed (non-blocking):

  • Community Worlds: failure

Check the workflow run for details.

Copy link
Copy Markdown
Contributor

@vercel vercel Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Suggestion:

OCC fence parameters (lastKnownEventId, asOfTimestamp) are silently dropped for wait_completed and hook_received events because the lazy branch of createWorkflowRunEventInner doesn't forward them.

Fix on Vercel

The lazy-refs branch of createWorkflowRunEventInner forgot to thread
`lastKnownEventId` and `asOfTimestamp` into the request body, so the
fence was silently dropped for any event whose type went through
the lazy path (i.e., not in `eventsNeedingResolve`). The resolve
branch already had the forwarding. Caught by Vercel Agent Review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@VaguelySerious
Copy link
Copy Markdown
Member Author

Vercel Agent review acknowledged + addressed in 1e69c82 — the lazy branch of createWorkflowRunEventInner now forwards lastKnownEventId and asOfTimestamp alongside the resolve branch. Good catch — without this the fence was silently dropped for any event whose type didn't appear in eventsNeedingResolve (including wait_completed and hook_received).

@VaguelySerious
Copy link
Copy Markdown
Member Author

Status after 1e69c82b:

  • All Local E2E (Dev / Prod / Postgres / Windows) green.
  • Vercel Prod E2E: 11/12 apps green (astro, example, express, fastify build, hono, nextjs-turbopack, nextjs-webpack, nitro, nuxt, sveltekit, tanstack-start, vite). Including hono and vite that were red on the previous push.
  • Vercel Prod fastify — single failure: abortAnyInStepWorkflow: AbortSignal.any inside a step composes deserialized signals (130/131 passed). The assertion is expect(returnValue.c2Aborted).toBe(true) — the workflow-side c2 controller didn't observe the abort by the time the workflow returned.

I don't think this is from anything in this PR:

  • The failing workflow has no sleep()/wait calls, so the elapsed-wait-scan fence path I touched never executes for it.
  • The abort signal here propagates from step → workflow via the controller's backing stream, not via hook_received. resumeHook is not in the path either.
  • Both stepResult.saw === true and stepResult.via === 'listener' passed on the same run, so the step-side composition worked correctly. Only the workflow-side controller hadn't caught up by the time the workflow returned.
  • The same test passes on Local fastify (and on every other Vercel Prod app on this PR).

Reads like a long-standing abort-stream-propagation timing flake that just happened to fire on fastify Vercel Prod this run. Will rerun the job once the workflow run is no longer in-progress; flagging here in case a reviewer hits it before I get back to it.

Comment thread .changeset/event-write-occ-fence.md Outdated
Comment thread packages/core/src/runtime/resume-hook.ts Outdated
Comment thread packages/world/src/events.ts Outdated
Comment thread packages/world/src/events.ts Outdated
Comment thread packages/core/src/runtime.ts Outdated
Comment thread packages/core/src/runtime.ts Outdated
Co-authored-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
@VaguelySerious
Copy link
Copy Markdown
Member Author

Good catch — the answer is "yes, conceptually they can race the same way." Here's the breakdown of where we currently fence vs. don't, and what's at risk:

Where the SDK currently sends a fence

Only one site: the elapsed-wait scan in runtime.ts, when it writes wait_completed. The fence is the loaded events' tail eventId; on a fence conflict the tick retries in-place with a fresh fence (the loop I added).

Where it doesn't, but could race in the same shape

Any write that the workflow runtime makes based on a branch decision driven by the loaded events array can race in the same way the production hook/sleep bug did. Concretely:

suspension-handler.ts — these are exactly the writes the workflow VM emits when its replay decides to allocate a new entity. If that decision was made on a stale snapshot, the write is "stale-branch":

Write Currently fenced? Can race?
hook_created no yes — workflow decided to allocate a hook based on its branch
hook_disposed no yes — workflow decided to dispose
step_created no yes — workflow decided to invoke a step
wait_created no yes — workflow decided to sleep()

runtime.ts — terminal-state writes after a successful replay:

Write Currently fenced? Can race?
run_completed (line 974) no yes — workflow returned based on its branch
run_failed (catch path) no yes — workflow threw based on its branch

What doesn't need a fence (and why)

  • step_started, step_completed, step_failed, step_retrying (step-executor.ts, step-handler.ts): these record facts about a step that's already in the log via step_created. They're not making a new branch decision; if the step was allocated, finishing it is just bookkeeping.
  • run_created (start.ts): no prior events for the run; nothing to be stale against.
  • run_started (initial tick): same — first thing the runtime writes.
  • run_failed from MAX_DELIVERIES_EXCEEDED / replay-budget exhaustion (replay-budget.ts): terminal escape hatch; needs to land regardless.
  • hook_received from resumeHook: deliberately unfenced — fencing would lose the user's signal.

Why the wait_completed fence alone catches the production bug today

The hook/sleep race needs: (1) tick takes sleep branch with stale events, (2) tick writes wait_completed, and (3) tick writes any sleep-branch follow-up events (more sleeps, the next iteration's allocations). My current fence trips at step (2): the wait_completed CAS fails, the retry loop reloads events, sees hook_received, and the next replay picks the hook branch — so step (3)'s sleep-branch writes never happen. That's why the repro went from corrupted to clean.

But that's specific to the hook-vs-sleep shape because wait_completed is on the critical path of the sleep branch. A race that doesn't go through wait_completed (e.g., a branch decision around Promise.race([hook, someStep]) where the workflow allocates a different step depending on which side wins) wouldn't be caught — the stale step_created/hook_created would land without a fence check.

Recommendation

For full coverage of the general "stale-snapshot branch decision" race, fence all 6 sites in the table above. Implementation cost is modest: the suspension-handler already has the events array available (the runtime invokes it with the loaded events); we'd thread the tail eventId in the same way I did for the elapsed-wait scan, and add the same retry-on-fence-conflict pattern to each catch block. run_completed/run_failed need the same plus distinguishing fence-conflict from "run already finished" (mirror of the workflow-server-side fix).

Want me to do it in this PR or keep it as a follow-up? My slight lean is follow-up: the current PR cleanly resolves the production bug and it's already a non-trivial behavioral change; the broader fencing is mostly defensive against shapes we haven't seen in the wild. But the risk of leaving them unfenced is real if we're betting that no other workflow patterns trigger the same race.

@VaguelySerious
Copy link
Copy Markdown
Member Author

You're right, I was sloppy. Let me walk through it.

Your example: Promise.race([someStep, someHook])

Timeline that breaks:

  1. T+0: workflow starts, runtime writes step_created for someStep. Suspends.
  2. T+1: step finishes elsewhere. Step executor writes step_completed (eventId timestamp ≈ T+1).
  3. T+3: resumeHook writes hook_received (eventId timestamp ≈ T+3).
  4. T+3.5: tick A is re-invoked, loads events.

If at T+3.5 the eventually-consistent read returns [..., step_created, hook_received] but misses step_completed, then tick A's Promise.race resolves with the hook (because the step's resolution event isn't in its events array). Tick A writes hook-branch follow-up events (say step_created for the post-hook step).

But canonically: step_completed (T+1) < hook_received (T+3). A future replay walks events in eventId order:

  • consumes step_completed → step subscriber resolves first via promiseQueue
  • then hook_received → hook subscriber resolves second

step wins the race, not hook. Tick A's hook-branch writes are orphan. Future replay tries to take the step branch, hits unconsumed events, fires CORRUPTED_EVENT_LOG.

So yes, the race exists and step_completed is on the critical path.

And my fence doesn't catch this one

This is the embarrassing part. The fence compares run.lastKnownEventId (latest written) against the tick's claim (its loaded tail). At T+3.5:

  • server run.lastKnownEventId = E_hook (because hook_received was the most-recently-written event, and my server .set() is unconditional, not max())
  • tick A's fence = E_hook (its loaded tail)
  • CAS: E_hook == E_hookpasses

The tick wrote based on a stale view of "everything ≤ E_hook" but the server's fence only tracks "latest write," which the tick did see. The fence catches "missed events that came after my tail" (the hook/sleep production shape) but not "missed events that came before my tail."

Why the production fix still held

The hook/sleep production race has the opposite shape: the tick's tail is from T+0 (a wait_created written at workflow start), and the missed event (hook_received) lands at T+5. That missed event has a higher eventId than the tick's tail, so the server's lastKnownEventId advances past it and CAS fails. That's the case the current fence covers.

The shape you're describing — missed event has a lower eventId than what the tick saw — slips past a single-attribute fence.

What would actually catch it

Three options I can think of, none free:

  1. Strongly-consistent reads on every event load instead of the split-read with the 1.5 s freshness window. Catches it definitively. Reads get more expensive; existing perf decisions go out the window.
  2. Server-side gap detection: server checks that the events table has no eventId in the range (tick.fence_value, tick.now()) other than the events the tick has acknowledged. Requires either sending the full set of seen eventIds or a per-run counter the tick increments.
  3. Reject any event whose eventId is < run.lastKnownEventId: makes eventId monotonic per run (no "stragglers"). Eliminates the failure mode but rejects legitimate concurrent writers whose clocks lag — turns into spurious 409s under any clock skew between SDK processes.

(2) is the cleanest but requires server-side work that isn't in this PR's scope. (3) is cheap but trades correctness for skew-tolerance. (1) is correct but a perf regression.

So how should we think about this PR

Honestly: the current fence is a partial fix. It closes the hook/sleep production shape but the general "stale-snapshot branch decision" hazard still exists for races where the missed event is canonically earlier than what the tick saw. The Promise.race([someStep, someHook]) shape you've described is a real instance of that.

I think we should:

  • Land this PR for the production bug as-is, but be honest in the description that it's the asymmetric coverage above.
  • File a follow-up to design proper gap-detection (probably option 2) — that requires server-side schema/logic that we shouldn't tack onto this PR.

Want me to update the PR description with this caveat, or take a swing at extending the fence in this PR? If the latter, I'd want to whiteboard the design with you first — none of the three options above are obviously right.

Comment thread packages/core/src/runtime.ts Outdated
Comment thread packages/core/src/runtime.ts Outdated
Comment thread packages/core/src/runtime.ts Outdated
Comment thread packages/core/src/runtime.ts Outdated
Comment thread packages/world/src/events.ts Outdated
Copy link
Copy Markdown
Member

@TooTallNate TooTallNate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed against a fresh failing run (wrun_01KSNVVXH82P5GM40D6F9152JF, deployed against the matching workflow-server preview). The fence catches the original hook/sleep production shape, but the run still produced CORRUPTED_EVENT_LOG because the corruption path doesn't run through wait_completed.

TL;DR

The SDK side of this fix only fences one write site — the elapsed-wait scan's wait_completed. Every other branch-decision write from suspension-handler.ts (step_created, hook_created, wait_created, hook_disposed) plus the terminal writes (run_completed, run_failed) still go through unfenced. That leaves the very gap you already articulated in the 2026-05-27T11:02:45Z comment: "A race that doesn't go through wait_completed (e.g., a branch decision around Promise.race([hook, someStep]) where the workflow allocates a different step depending on which side wins) wouldn't be caught — the stale step_created/hook_created would land without a fence check."

Our failing run is exactly that case.

Evidence from wrun_01KSNVVXH82P5GM40D6F9152JF

Reproduced with the WF_TRACE instrumentation from #2127. Critical sequence (eventIds and write timestamps from DynamoDB):

eventId type corr write time
…3CFWJ step_completed step_…PMW (sync iter 3) 23:21:17.967
…3ZNRF wait_completed wait_…PMP 23:21:18.581
…41F5T step_created (drain) step_…PMX 23:21:18.639

What happened:

  1. Two replays (inv 269, 327) loaded events at ec=205 — last event was step_completed PMW. No wait_completed yet. Same digest 94e1262add726a3e in both.
  2. Both replays' iter-3 race resolved with wake (hook payloads were buffered from the resume hammer), so the workflow called drain and the suspension handler wrote step_created PMX (drain) unfenced.
  3. Concurrently, some other invocation's elapsed-wait scan wrote wait_completed PMP at 23:21:18.581 — 58 ms before the drain write at 23:21:18.639.
  4. 52 subsequent replays loaded the same ec=208 / digest=e8d1f86e7a43de3d event log and every one of them failed with step_mismatch on PMX (their iter-3 race resolved with sleep because wait_completed is now in the log, so iter 4's useStep('sync') got allocated the correlationId that the log records as drain).

The critical point: the missed event (wait_completed PMP) has a higher eventId than the tick's tail (step_completed PMW), so this is precisely the symmetric-case race the current fence design covers — if the write being fenced is the step_created PMX (drain) write. But it isn't, because this PR doesn't fence step writes.

With the current PR applied to this run:

  • The hypothetical concurrent wait_completed write would be fenced (and either land first or fail CAS and retry).
  • But step_created PMX (drain) from inv 269/327 is still unfenced, so it still lands, still corrupts the log, and future replays still fail.

Requesting changes

Per your own follow-up table in the 2026-05-27T11:02:45Z comment — the writes that need fences are:

  • suspension-handler.tsstep_created, hook_created, wait_created, hook_disposed
  • runtime.ts terminal-state writes — run_completed, run_failed

Fencing each to the tick's loaded events tail (with a parallel retry-on-conflict loop modeled after the one this PR adds for wait_completed) should close the failure mode we're seeing.

The deeper edge case you describe at the end of that comment (eventually-consistent reads + missed event with a lower eventId than the tick's tail) is a separate concern that the current fence design genuinely cannot catch — that's fine to defer. Our run is unambiguously the symmetric case the fence already handles, just at a write site that isn't fenced yet.

What I'd like to see before merge

Either:

  1. Extend the fence to the writes in your table (preferred — completes the design), or
  2. Land this PR as-is but with a KNOWN_ISSUES note documenting which races it does and doesn't catch, plus a tracking issue for the follow-up.

The production hook/sleep shape this PR validated against does demonstrably stop reproducing — that's solid. But shipping it as the fix for CORRUPTED_EVENT_LOG without the rest of the table is going to be misleading once a Promise.race([hook, step])-shaped run hits prod.

Happy to provide the full WF_TRACE export and DynamoDB dump for wrun_01KSNVVXH82P5GM40D6F9152JF if helpful for designing the extension.

Comment thread packages/core/src/runtime.ts
Comment thread packages/core/src/runtime.ts Outdated
Comment thread packages/core/src/runtime/resume-hook.ts
* [DEBUG] Trace replay event log and step/hook/sleep assignments

Temporary diagnostic instrumentation for investigating intermittent
CorruptedEventLogError 'step consumer mismatch' failures.

Emits console.log lines tagged 'WF_TRACE' at four points:
- runWorkflow start: dumps the full event array the replay will consume
  (eventIds, types, correlationIds, stepNames) plus a sha256 digest
- step/hook/sleep subscribe: per-replay correlationId -> name assignment
- step consumer mismatch: structured record of the failure including the
  event index in the SDK's view of the log
- runWorkflow end: completed | failed | suspended

Used to diff successive replays of the same runId and confirm whether
the SDK actually sees the same event array each time.

* [DEBUG] Extend OCC fence to all branch-decision writes

Peter's PR #2113 fences `wait_completed` writes from the elapsed-wait
scan. This commit extends the fence to every other write whose outcome
depends on a branch decision the workflow VM made from its loaded event
log — per the table @VaguelySerious himself laid out in his PR comment:

  suspension-handler.ts:
    - step_created      (the smoking gun on wrun_01KSPS7XEGHF4A6WYF4DB03D40)
    - hook_created
    - hook_disposed
    - wait_created

  runtime.ts terminal writes:
    - run_completed
    - run_failed

`hook_received` is deliberately NOT fenced (Peter's reasoning preserved
verbatim: fencing the user's signal would drop it on contention; stale-
snapshot protection belongs on the writes that consume hooks, not the
ones that deliver them).

The fence value is the load-time tail of the events array passed into
`runWorkflow`. `suspension-handler` receives the fence + cursor from
the runtime and reloads on conflict; the runtime's terminal writes read
the cursor directly.

The new `__fenced-write.ts` helper encapsulates the retry loop so we
don't have to copy/paste Peter's pattern six times. It's named with the
leading-underscore convention to flag it as throwaway diagnostic code,
matching `__debug-replay-trace.ts`.

* [DEBUG] Point at Peter's workflow-server PR 447 preview + map HTTP 412

Two changes both needed for the extended-fence test loop to actually
exercise the OCC code path on the server:

1. Hardcode WORKFLOW_SERVER_URL_OVERRIDE to
   https://workflow-server-83nn57dvc.vercel.sh (preview deployment of
   workflow-server PR 447, branch alias
   workflow-server-git-peter-event-write-cas.vercel.sh). The previous
   preview at workflow-server-7pxaxn4d4.vercel.sh was Pranay's monotonic-
   append PR 456 \u2014 different fix, doesn't implement the CAS the SDK side
   now sends.

2. Map HTTP 412 \u2192 EntityConflictError in the world-vercel error mapper.
   workflow-server PR 447 returns 412 with a 'fence conflict' message
   for EventLogFenceConflictError; the SDK's existing fence-retry loops
   (Peter's wait_completed scan + the new ones in suspension-handler
   and runtime terminal writes) match on /fence conflict/i against the
   message of an EntityConflictError. Without this mapping the 412 falls
   through to WorkflowWorldError and the regex match never fires.

* fix(core): chain fences for replay-created events

* test: clear workflow server debug override

* fix(core): scope step dispatch to owners

* fix(core): recover wait-raced step dispatch

* Remove diagnostic instrumentation, rename fenced-write helper

Strip the WF_TRACE replay tracing that was used to diagnose the
CORRUPTED_EVENT_LOG race \u2014 it's served its purpose now that the fix
is in. Specifically:

- Delete packages/core/src/__debug-replay-trace.ts and its 8 call sites
  in workflow.ts, step.ts, workflow/hook.ts, workflow/sleep.ts.
- Drop the matching [DEBUG] inline narrative comments at each call site.
- Rename packages/core/src/runtime/__fenced-write.ts \u2192 fenced-write.ts
  (the leading-underscore convention marked it as throwaway diagnostic
  code; the helper is intended to stay).
- Trim the file header on fenced-write.ts and the related narrative
  comment in suspension-handler.ts to drop the failing-runId / PR-number
  references that only made sense in the debug context.

No behavioral change. typecheck clean (0 errors); 1014/1014 unit tests
pass (same as parent commit 77f057a).

* world-vercel: always preserve fence-conflict marker on HTTP 412

Address Copilot review on PR 2132 (#2132 (comment)).

The fence-retry loop in runtime/fenced-write.ts detects OCC conflicts
via /fence conflict/i.test(err.message). The 412 branch was relying on
the server's JSON body to populate that message via errorData.message,
but parseResponseBody().catch(() => ({})) swallows JSON parse failures
silently — so any non-JSON 412 response (CDN HTML, gateway timeout
page, intermediate proxy error) would surface as
EntityConflictError("<METHOD> /endpoint -> HTTP 412: Precondition
Failed"), the regex would miss it, and the retry loop would
mis-classify the conflict as terminal.

Prefix the message with `fence conflict:` whenever the parsed body
didn't already carry the marker, so the retry detection is robust to
response-body parse failures.

Tests: world-vercel 69/69 pass.

---------

Co-authored-by: Peter Wielander <peter.wielander@vercel.com>
Copy link
Copy Markdown
Member

@TooTallNate TooTallNate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed after b5c567c (extended OCC fence to all branch-decision writes, #2132) and a7efa5a (always preserve fence-conflict marker on HTTP 412).

All three of my earlier inline comments are addressed:

  1. Fence coverage (runtime.ts:800) — now applied to step_created, hook_created, hook_disposed, wait_created, run_completed, run_failed via the shared fencedEventCreate helper, in addition to wait_completed. All branch-decision writes are fenced.

  2. Brittle string match (runtime.ts:842) — packages/world-vercel/src/utils.ts now anchors detection on HTTP 412 and prefixes the message with fence conflict: client-side, so the /fence conflict/i regex in fenced-write.ts cannot regress against server wording changes.

  3. hook_received unfenced (resume-hook.ts:159) — confirmed deliberate, preserved.

Stress reproduction with this branch + the paired backend preview: 0/40 cycles surface CORRUPTED_EVENT_LOG (baseline on stable: ~2/40).

LGTM — clearing my CHANGES_REQUESTED block.

- Log each fence-conflict retry at info level (datadog visibility for
  the retry path between the first conflict and the give-up warning).
- Keep prior fence when a create response is missing `event`; emit a
  warn instead of silently advancing to a value we didn't observe on
  the wire. The schema marks `event` as optional for legacy compat;
  in practice creates always return it, but the type permits drift.
- Replace `events.some(\u2026)` inside the reload-merge loop with a
  Set-based dedup so the retry path is O(n + m) instead of O(n \u00d7 m).
- Drop `asOfTimestamp` from CreateEventParams. The original motivation
  was `resumeHook`-style writes; the runtime keeps `hook_received`
  unfenced (fencing the user signal would drop it on contention) so
  nothing in this PR exercises the param. Reintroduce when a real
  caller appears.

Addresses inline review comments on #2113.
The four `WorkflowWorldError` subclasses surfaced by the runtime's own
calls into the world layer represent infrastructure-level conditions,
not user-code failures:

- `EntityConflictError`: CAS rejection on event writes (409 / 412),
  including OCC fence conflicts that exhausted the in-place retry
  budget in `fenced-write.ts`.
- `RunExpiredError`: 410 — run was cleaned up or already terminal.
- `TooEarlyError`: 425 — retry-after timestamp not yet reached.
- `ThrottleError`: 429 — rate limited by the workflow backend.

When any of these reach `classifyRunError` the runtime's own retry
logic has already exhausted (otherwise the error would have been
swallowed upstream). The truthful classification is `RUNTIME_ERROR`,
not `USER_ERROR`. Same shape as the existing entries
(`WorkflowRuntimeError`, `WorkflowNotRegisteredError`,
`StepNotRegisteredError`).

The bare `WorkflowWorldError` parent stays out of the runtime list:
it can also surface from user-code `fetch` calls into the workflow
API, where `USER_ERROR` is the correct attribution (see the existing
"WorkflowWorldError with status 500" test).

Caught during the stress validation of #2113 + workflow-server#447
end-to-end: fence-conflict retries exhausting under the 180-way hook
race were surfacing as `USER_ERROR`, masking an obvious infra
condition as user code. Tests added for each of the four subclasses.
Copy link
Copy Markdown
Contributor

@karthikscale3 karthikscale3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the OCC fencing approach end-to-end — the design is sound and unusually well-documented, and the repro evidence (0/40 vs ~2/40 baseline) is convincing. Three things I'd want addressed before merge, called out inline:

  1. Duplication — the wait_completed loop in runtime.ts reimplements fencedEventCreate's entire fence/retry/reload/backoff pattern (incl. a second MAX_FENCE_RETRIES and a duplicate /fence conflict/i regex). These will drift.
  2. Owner-scoped dispatch is the riskiest behavioral change and the least directly tested.
  3. fenced-write.ts has no direct unit tests despite being the most subtle new code.

Minor (non-blocking): deterministic 25 * attempt backoff has no jitter (can resync conflicting writers under the exact storm this targets); attempts > MAX_FENCE_RETRIES yields 6 attempts, not the stated 5; please confirm the world-local completedMessages cache can't suppress a legitimate step re-dispatch that reuses a completed idempotency key; and there's stray whitespace in the new resume-hook.ts comment plus two unrelated blank-line additions in step.ts / workflow/hook.ts.

Comment thread packages/core/src/runtime.ts Outdated
events.length > 0
? events[events.length - 1].eventId
: undefined;
const MAX_FENCE_RETRIES = 5;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Emphasis #1 — duplication] This inline wait_completed loop reimplements the exact fence → 412 → reload → idempotency-check → retry-with-backoff pattern that fenced-write.ts was created to centralize — including a second copy of MAX_FENCE_RETRIES = 5 and a duplicated /fence conflict/i regex.

This is the biggest maintainability concern in the PR: two copies of the same protocol will drift, and a future fix to one won't reach the other. The loop's extra requirements (chaining the fence across multiple waits, merging reloaded events into the local events array) look expressible by calling fencedEventCreate per wait with an onConflictRefresh closure that does the merge. If full unification isn't feasible now, please at least share the MAX_FENCE_RETRIES constant and the isFenceConflict helper from fenced-write.ts rather than re-declaring them here.

}
}

const queueablePendingSteps =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Emphasis #2 — owner-scoped dispatch is the riskiest change] This flips a core invariant: from "queue every pending step (minus inline)" to "queue only owned + recoverable steps, unless a wait is pending." The safety argument holds only if the winning owner always finishes its tick, or a redelivery (metadata.attempt > 1) triggers the recovery set.

The gap I'd want covered by a test: a step whose step_created was written by a concurrent handler that then crashed, observed here on a fresh attempt === 1 delivery with no pending wait — it's not in createdStepCorrelationIds, not in recoverablePendingStepCorrelationIds (empty when attempt === 1), and not the inline step, so neither this handler nor the dead owner dispatches it until a later redelivery bumps attempt. That's the intended crash window, but it's a stall risk that the hook/sleep repro doesn't exercise. Can we add a targeted stress/integration case for owner-crash-after-step_created-before-dispatch (both with and without a pending wait)?

* the abort-vs-rethrow decision (preserves the existing
* "EntityConflictError → log and skip" behavior for callers that want it).
*/
export async function fencedEventCreate(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Emphasis #3 — needs direct unit tests] This helper carries the most subtle logic in the PR (retry budget + off-by-one boundary, abort-vs-rethrow on non-fence EntityConflictError, the missing-result.event path that keeps the prior fence, and fence advancement on success) yet has no dedicated unit test — only suspension-handler.test.ts exercises the happy-path chaining, and the rest is covered solely by the e2e stress run.

A table-driven test here would lock in the contract cheaply: (a) success advances the fence and returns the event; (b) N fence conflicts then success; (c) exceeding MAX_FENCE_RETRIES rethrows; (d) onConflictRefresh returning abort yields written:false; (e) non-fence EntityConflictError honors onEntityConflict abort vs rethrow; (f) missing result.event keeps the prior fence and logs. This is the seam most likely to regress silently.

When a fenced event write rejects with EventLogFenceConflictError, the
SDK previously retried up to MAX_FENCE_RETRIES = 5 times against a
freshly-loaded tail, with linear backoff. Under stress this behaved
poorly in two ways:

1. The retry loop spins against an ever-changing tail. Under high
   contention (e.g. a hook flood triggering many concurrent ticks),
   exhausting the budget throws an EntityConflictError, which surfaces
   as run_failed — a transient infra condition mis-classified as a
   terminal failure.

2. The retries amplified the server-side stuck-fence pattern (run.lastKnownEventId
   advancing past a non-existent eventId due to the patch-then-PUT
   non-atomicity documented in c06d6ce of workflow-server). Every
   retry that hit the same stale fence wasted compute and prolonged
   the affected window.

Switch to bail-on-conflict: on fence conflict, fencedEventCreate
returns {written: false} immediately. No retry, no throw, no
re-enqueue. A fence conflict means another invocation has the
canonical view of the event log — the canonical invocation is
responsible for whatever progress the workflow needs, and the
losing invocation just exits cleanly.

This matches the existing workflow-server comment ('the @workflow/core
suspension handler swallows it') and the original design intent.

Net change: ~250 lines removed from runtime.ts + fenced-write.ts +
suspension-handler.ts. The custom retry loop in the wait_completed
elapsed-wait scan is also folded into fencedEventCreate.

Behavioral effects:

- USER_ERROR / RUNTIME_ERROR run failures from exhausted fence retries
  are eliminated. Fence conflicts no longer mark runs as failed.

- Higher hook-payload throughput under stress: the 180-way race is no
  longer amplified by 5x retries per losing lambda.

- The server-side stuck-fence window (patch-then-PUT non-atomicity)
  is unchanged — that needs to be addressed in workflow-server, not
  here. But the SDK no longer makes it worse by spinning.

Tests: 1018 core tests pass. The previously-tested 'fence-conflict
retries and reloads' behavior is removed; the replacement behavior
('fence-conflict returns {written:false} once') is exercised
implicitly via the suspension-handler integration tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants