Skip to content

[DEBUG] Trace replay event log and step/hook/sleep assignments#2127

Draft
TooTallNate wants to merge 4 commits into
stablefrom
debug/event-log-replay-trace
Draft

[DEBUG] Trace replay event log and step/hook/sleep assignments#2127
TooTallNate wants to merge 4 commits into
stablefrom
debug/event-log-replay-trace

Conversation

@TooTallNate
Copy link
Copy Markdown
Member

Summary

Temporary diagnostic instrumentation to investigate intermittent CorruptedEventLogError "step consumer mismatch" failures. Not for merging. Branched off stable so the CI tarball can be deployed against the repro app.

What it does

Emits console.log lines tagged WF_TRACE at four points so we can diff successive replays of the same runId:

  • runWorkflow start (packages/core/src/workflow.ts) — dumps the full event array the replay will consume: eventId, eventType, correlationId, eventData.stepName/resumeAt, plus a sha256 digest for quick equality checks.
  • step/hook/sleep subscribe (packages/core/src/step.ts, workflow/hook.ts, workflow/sleep.ts) — per-replay assignment of correlationIdstepName/token/resumeAt, with a monotonic per-invocation seq.
  • Step consumer mismatch (packages/core/src/step.ts) — structured record of the failure including the offending event's index in the SDK's view of the log.
  • runWorkflow endcompleted | failed | suspended.

All logging routed through a single throwaway helper packages/core/src/__debug-replay-trace.ts so the diff is easy to revert.

How to use

Trigger the repro app, grep Vercel runtime logs for WF_TRACE, then group lines by runId + inv to compare what each replay invocation saw. If digest differs between two replay_start lines for the same runId, the event array is unstable across replays — root cause is on the server side. If digest matches but the step subscribe seq+stepName mapping differs, the SDK is the source of non-determinism.

Notes

  • No existing tests broken (all 622 unit tests in @workflow/core pass).
  • No new TypeScript errors (baseline 22 pre-existing on stable, still 22 after).
  • Adds a non-trivial amount of log volume per replay; intended to be reverted once the bug is diagnosed.

Temporary diagnostic instrumentation for investigating intermittent
CorruptedEventLogError 'step consumer mismatch' failures.

Emits console.log lines tagged 'WF_TRACE' at four points:
- runWorkflow start: dumps the full event array the replay will consume
  (eventIds, types, correlationIds, stepNames) plus a sha256 digest
- step/hook/sleep subscribe: per-replay correlationId -> name assignment
- step consumer mismatch: structured record of the failure including the
  event index in the SDK's view of the log
- runWorkflow end: completed | failed | suspended

Used to diff successive replays of the same runId and confirm whether
the SDK actually sees the same event array each time.
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 27, 2026

⚠️ No Changeset found

Latest commit: 3da0595

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
example-nextjs-workflow-turbopack Ready Ready Preview, Comment May 28, 2026 7:42am
example-nextjs-workflow-webpack Ready Ready Preview, Comment May 28, 2026 7:42am
example-workflow Ready Ready Preview, Comment May 28, 2026 7:42am
workbench-astro-workflow Ready Ready Preview, Comment May 28, 2026 7:42am
workbench-express-workflow Ready Ready Preview, Comment May 28, 2026 7:42am
workbench-fastify-workflow Ready Ready Preview, Comment May 28, 2026 7:42am
workbench-hono-workflow Ready Ready Preview, Comment May 28, 2026 7:42am
workbench-nitro-workflow Ready Ready Preview, Comment May 28, 2026 7:42am
workbench-nuxt-workflow Ready Ready Preview, Comment May 28, 2026 7:42am
workbench-sveltekit-workflow Ready Ready Preview, Comment May 28, 2026 7:42am
workbench-tanstack-start-workflow Ready Ready Preview, Comment May 28, 2026 7:42am
workbench-vite-workflow Ready Ready Preview, Comment May 28, 2026 7:42am
workflow-swc-playground Ready Ready Preview, Comment May 28, 2026 7:42am
workflow-tarballs Ready Ready Preview, Comment May 28, 2026 7:42am
workflow-web Ready Ready Preview, Comment May 28, 2026 7:42am
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
workflow-docs Skipped Skipped May 28, 2026 7:42am

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

🧪 E2E Test Results

Some tests failed

Summary

Passed Failed Skipped Total
❌ ▲ Vercel Production 899 2 67 968
✅ 💻 Local Development 970 0 86 1056
✅ 📦 Local Production 970 0 86 1056
✅ 🐘 Local Postgres 970 0 86 1056
✅ 🪟 Windows 88 0 0 88
❌ 🌍 Community Worlds 15 69 0 84
✅ 📋 Other 492 0 36 528
Total 4404 71 361 4836

❌ Failed Tests

▲ Vercel Production (2 failed)

express (1 failed):

  • DurableAgent e2e multimodal tool results passes through LanguageModelV3ToolResultOutput from tools

nextjs-webpack (1 failed):

  • DurableAgent e2e tool approval (GAP) completes but needsApproval is not checked (GAP)
🌍 Community Worlds (69 failed)

mongodb-dev (1 failed):

  • dev e2e should rebuild on imported step dependency change

redis-dev (1 failed):

  • dev e2e should rebuild on imported step dependency change

turso-dev (1 failed):

  • dev e2e should rebuild on imported step dependency change

turso (66 failed):

  • addTenWorkflow | wrun_01KSPRDTTCB0E5JXA30BX5CQN3
  • addTenWorkflow | wrun_01KSPRDTTCB0E5JXA30BX5CQN3
  • wellKnownAgentWorkflow (.well-known/agent) | wrun_01KSPRF03Z0GHMRR4Z1XBMR8GH
  • should work with react rendering in step
  • promiseAllWorkflow | wrun_01KSPRE2DC2936JDEZZFT1AXDF
  • promiseRaceWorkflow | wrun_01KSPRE6PCZE9T9VVW8N17QA4J
  • promiseAnyWorkflow | wrun_01KSPRE8ZSV8WM1PMPE9MA1GJB
  • importedStepOnlyWorkflow | wrun_01KSPRFAY50EYBZSAS95775KA2
  • readableStreamWorkflow | wrun_01KSPRECH2GD8E2332X7VE4DM4
  • hookWorkflow | wrun_01KSPRESBMWHSKPQDDQ3V61RRP
  • hookWorkflow is not resumable via public webhook endpoint | wrun_01KSPREZEXEZ4R95RM7TXZGBZ0
  • webhookWorkflow | wrun_01KSPRF3MC7G97K8QX5Z7302YX
  • sleepingWorkflow | wrun_01KSPRF9Z7RCST2R37TMC6G32Q
  • parallelSleepWorkflow | wrun_01KSPRFRW5YQZZE9NEVPQN56QN
  • nullByteWorkflow | wrun_01KSPRFX1SYFQET6ZGEEHR282G
  • workflowAndStepMetadataWorkflow | wrun_01KSPRG09APXDXKQ3HM5HNPKTD
  • outputStreamWorkflow no startIndex (reads all chunks)
  • outputStreamWorkflow positive startIndex (skips first chunk)
  • outputStreamWorkflow negative startIndex (reads from end)
  • outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns correct index after stream completes
  • outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns -1 before any chunks are written
  • outputStreamWorkflow - getTailIndex and getStreamChunks getStreamChunks returns same content as reading the stream
  • outputStreamInsideStepWorkflow - getWritable() called inside step functions | wrun_01KSPRJESG995F00KC2YC26SE6
  • fetchWorkflow | wrun_01KSPRJX06F06772AT18KDXB4J
  • promiseRaceStressTestWorkflow | wrun_01KSPRK0B2S6C3NNR86GZMY3BS
  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling error propagation workflow errors cross-file imports preserve message and stack trace
  • error handling error propagation step errors basic step error preserves message and stack trace
  • error handling error propagation step errors cross-file step error preserves message and function names in stack
  • error handling retry behavior regular Error retries until success
  • error handling retry behavior FatalError fails immediately without retries
  • error handling retry behavior RetryableError respects custom retryAfter delay
  • error handling retry behavior maxRetries=0 disables retries
  • error handling catchability FatalError can be caught and detected with FatalError.is()
  • error handling not registered WorkflowNotRegisteredError fails the run when workflow does not exist
  • error handling not registered StepNotRegisteredError fails the step but workflow can catch it
  • error handling not registered StepNotRegisteredError fails the run when not caught in workflow
  • hookCleanupTestWorkflow - hook token reuse after workflow completion | wrun_01KSPRP2VC877M7Z1M2H2BXWA5
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously | wrun_01KSPRPCTZ8H16Y3A3XKGWE1BN
  • hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running | wrun_01KSPRPSX1CDCGNG1SPM3W1ZC4
  • stepFunctionPassingWorkflow - step function references can be passed as arguments (without closure vars) | wrun_01KSPRQ7XBDWTK0WPTAHCAHR31
  • stepFunctionWithClosureWorkflow - step function with closure variables passed as argument | wrun_01KSPRQFMYFYN4YERC3CB1A525
  • closureVariableWorkflow - nested step functions with closure variables | wrun_01KSPRQMHFDS3W0CWGRE142PKN
  • spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step | wrun_01KSPRQPME39ENG4KGKGWE2Z4Z
  • health check (queue-based) - workflow and step endpoints respond to health check messages
  • health check (CLI) - workflow health command reports healthy endpoints
  • pathsAliasWorkflow - TypeScript path aliases resolve correctly | wrun_01KSPRR4ATKPJ1N4244SZDNHXB
  • Calculator.calculate - static workflow method using static step methods from another class | wrun_01KSPRR9D7WRFZ2A4Q9AFFDRK6
  • AllInOneService.processNumber - static workflow method using sibling static step methods | wrun_01KSPRRF8A70ZRY49EJFGMG48Y
  • ChainableService.processWithThis - static step methods using this to reference the class | wrun_01KSPRRN7VEDSY2T1BSVYGN51P
  • thisSerializationWorkflow - step function invoked with .call() and .apply() | wrun_01KSPRRV7ZNB1GP2CP4T0Z5NST
  • customSerializationWorkflow - custom class serialization with WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE | wrun_01KSPRS2Q4D2ZXV7V0SADM116Y
  • instanceMethodStepWorkflow - instance methods with "use step" directive | wrun_01KSPRS9TV1JZVF8BF65CRZT1Z
  • crossContextSerdeWorkflow - classes defined in step code are deserializable in workflow context | wrun_01KSPRSNXK302P1NF2KV2TBAAJ
  • stepFunctionAsStartArgWorkflow - step function reference passed as start() argument | wrun_01KSPRSY9E963BJGH21P28H031
  • cancelRun - cancelling a running workflow | wrun_01KSPRT5BXFFT23DYJ5MY08VFA
  • cancelRun via CLI - cancelling a running workflow | wrun_01KSPRTE13CNSQZ313E48KQ6WP
  • pages router addTenWorkflow via pages router
  • pages router promiseAllWorkflow via pages router
  • pages router sleepingWorkflow via pages router
  • hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep | wrun_01KSPRTRYJ8T6XWG60WK99QF9P
  • sleepInLoopWorkflow - sleep inside loop with steps actually delays each iteration | wrun_01KSPRV7T4ERS0WBE2FHFQ5N2P
  • sleepWithSequentialStepsWorkflow - sequential steps work with concurrent sleep (control) | wrun_01KSPRVJM9CXBKHJ66C84NNTW8
  • importMetaUrlWorkflow - import.meta.url is available in step bundles | wrun_01KSPRVRYWRH1N9BZ518GTAKJJ
  • metadataFromHelperWorkflow - getWorkflowMetadata/getStepMetadata work from module-level helper (#1577) | wrun_01KSPRVVXV9VA0AFG6G4P7F8DR
  • resilient start: addTenWorkflow completes when run_created returns 500 | wrun_01KSPRVXSAS5K3VAVYE7GXZ8QK

Details by Category

❌ ▲ Vercel Production
App Passed Failed Skipped
✅ astro 81 0 7
✅ example 81 0 7
❌ express 80 1 7
✅ fastify 81 0 7
✅ hono 81 0 7
✅ nextjs-turbopack 86 0 2
❌ nextjs-webpack 85 1 2
✅ nitro 81 0 7
✅ nuxt 81 0 7
✅ sveltekit 81 0 7
✅ vite 81 0 7
✅ 💻 Local Development
App Passed Failed Skipped
✅ astro-stable 82 0 6
✅ express-stable 82 0 6
✅ fastify-stable 82 0 6
✅ hono-stable 82 0 6
✅ nextjs-turbopack-canary 69 0 19
✅ nextjs-turbopack-stable 88 0 0
✅ nextjs-webpack-canary 69 0 19
✅ nextjs-webpack-stable 88 0 0
✅ nitro-stable 82 0 6
✅ nuxt-stable 82 0 6
✅ sveltekit-stable 82 0 6
✅ vite-stable 82 0 6
✅ 📦 Local Production
App Passed Failed Skipped
✅ astro-stable 82 0 6
✅ express-stable 82 0 6
✅ fastify-stable 82 0 6
✅ hono-stable 82 0 6
✅ nextjs-turbopack-canary 69 0 19
✅ nextjs-turbopack-stable 88 0 0
✅ nextjs-webpack-canary 69 0 19
✅ nextjs-webpack-stable 88 0 0
✅ nitro-stable 82 0 6
✅ nuxt-stable 82 0 6
✅ sveltekit-stable 82 0 6
✅ vite-stable 82 0 6
✅ 🐘 Local Postgres
App Passed Failed Skipped
✅ astro-stable 82 0 6
✅ express-stable 82 0 6
✅ fastify-stable 82 0 6
✅ hono-stable 82 0 6
✅ nextjs-turbopack-canary 69 0 19
✅ nextjs-turbopack-stable 88 0 0
✅ nextjs-webpack-canary 69 0 19
✅ nextjs-webpack-stable 88 0 0
✅ nitro-stable 82 0 6
✅ nuxt-stable 82 0 6
✅ sveltekit-stable 82 0 6
✅ vite-stable 82 0 6
✅ 🪟 Windows
App Passed Failed Skipped
✅ nextjs-turbopack 88 0 0
❌ 🌍 Community Worlds
App Passed Failed Skipped
❌ mongodb-dev 4 1 0
❌ redis-dev 4 1 0
❌ turso-dev 4 1 0
❌ turso 3 66 0
✅ 📋 Other
App Passed Failed Skipped
✅ e2e-local-dev-nest-stable 82 0 6
✅ e2e-local-dev-tanstack-start-stable 82 0 6
✅ e2e-local-postgres-nest-stable 82 0 6
✅ e2e-local-postgres-tanstack-start-stable 82 0 6
✅ e2e-local-prod-nest-stable 82 0 6
✅ e2e-local-prod-tanstack-start-stable 82 0 6

📋 View full workflow run


Some E2E test jobs failed:

  • Vercel Prod: failure
  • Local Dev: success
  • Local Prod: success
  • Local Postgres: success
  • Windows: success

Check the workflow run for details.

Set WORKFLOW_SERVER_URL_OVERRIDE to https://workflow-server-7pxaxn4d4.vercel.sh
to validate Pranay's monotonic-append fix (workflow-server#456) against the
hook/sleep stress repro.

If the preview server correctly reorders events so eventIds reflect commit
order (instead of letting a slow hook_received commit with an early eventId
behind a wait_completed that committed first), the corrupted-event-log
failures should disappear without any SDK-side fencing changes.
The set-bypass-cookie flow triggers a 307 redirect-and-set-cookie response
intended for browsers. Node's fetch in the SDK doesn't follow the cookie
dance, so it loops on the 307. Vercel's docs are explicit that for
API-client usage the bare x-vercel-protection-bypass header should
authenticate each request directly, without set-bypass-cookie.
@vercel vercel Bot temporarily deployed to Preview – workflow-docs May 28, 2026 07:38 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant