[debug] Validation run: combine #2113 SDK + #447 server e6722b2#2146
[debug] Validation run: combine #2113 SDK + #447 server e6722b2#2146TooTallNate wants to merge 7 commits into
Conversation
|
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (32 failed)astro (1 failed):
example (1 failed):
express (3 failed):
fastify (2 failed):
hono (3 failed):
nextjs-turbopack (4 failed):
nextjs-webpack (1 failed):
nitro (1 failed):
nuxt (7 failed):
sveltekit (4 failed):
vite (5 failed):
📋 Other (2 failed)e2e-vercel-prod-tanstack-start (2 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
c0bd797 to
df40f8b
Compare
Building on 98c9741's bail-on-fence-conflict, propagate the fence-conflict signal upward as `staleSnapshot: true` so the entire current replay's queue results are abandoned rather than just the individual write skipped. The narrower 'skip the write, continue the loop' shape from 98c9741 re-introduced CORRUPTED_EVENT_LOG under stress: when two concurrent invocations make divergent VM decisions from different event-log snapshots, the winner's fenced write succeeds and the loser's bails. But the loser's VM had already derived its own queue results from the stale snapshot — if it continues past the conflict and queues them, those queue items can drive subsequent ticks that consume the winner's events as their own, surfacing as the original step_mismatch shape ("step_started for step_X belongs to <name-A> but consumer is <name-B>"). The right behavior is the one Pranay sketched on Slack: 'if new events have been introduced to the log after a concurrent replay has started, the invocation queue results must be abandoned. That replay is invalid.' Implementation: - `FencedWriteResult` now carries a `staleSnapshot` boolean so callers can distinguish 'fence conflict — abandon entire replay' from 'entity already exists — skip this write but keep going'. - `handleSuspension` short-circuits and returns `{ staleSnapshot: true, pendingSteps: [] }` the moment any fenced write rejects with a fence conflict. Subsequent step/wait writes from that replay never run. - Runtime tick detects `staleSnapshot: true` and `return`s cleanly (no `run_failed` event). The canonical invocation is left to make progress; the run stays `running`. The elapsed-wait scan (`wait_completed`) deliberately keeps its continue-on-conflict shape: the work it derives is purely timer-based (which waits have elapsed), not a VM branch decision, so a stale snapshot doesn't change the set of waits to complete. Only the suspension handler's writes are guarded by the abandon-the-tick semantic. Tests: 1018 core tests pass.
455bda9 to
0c7eb75
Compare
The abandon-tick change (fbaa2bf) correctly stops a stale-snapshot replay from queueing divergent work, but it returned without re-enqueueing. Under a hook burst, every tick that would consume the late-arriving hook_received events could race and abandon, leaving the run 'running' with pending hooks and no tick scheduled to advance it. Stress testing showed ~28/40 runs stalled this way (valid fence, real events, just no continuation). Return { timeoutSeconds: 0 } on stale-snapshot abandon instead of a bare return — the same immediate re-enqueue idiom the hook-conflict path already uses. This guarantees a fresh tick re-runs against the canonical event log. This is bounded (one re-enqueue per abandoned tick) and converges: paired with the server-side atomic fence+event write (no phantom fences), the canonical replay makes forward progress, so the re-enqueued tick advances the log rather than spinning — unlike the original MAX_FENCE_RETRIES storm this design replaced.
The orphaned-step-dispatch recovery (re-queue step_created / step_retrying events that never reached step_started) was gated on `metadata.attempt > 1`, i.e. only on queue redeliveries. That misses the stale-snapshot abandon path: when a tick writes a fenced step_created and then abandons on a *later* fenced write (returning staleSnapshot + re-enqueuing), it never reaches the step-queueing code. The re-enqueue produces a *fresh* queue message (attempt 1), not a redelivery, so the attempt-gated recovery never fired — leaving the run stalled with a valid fence and an orphaned step_created that no one dispatches. Run the recovery scan on every invocation. It is safe unconditionally: step dispatch is queued with `idempotencyKey: step.correlationId`, so re-queueing an already-dispatched step is deduped by the queue. Steps this tick created are still queued via `createdStepCorrelationIds` and selected for inline execution via `ownedPendingSteps` (unchanged); recovery only adds orphans this tick did not create, which are queued (never inline-executed) — correct, since their creating tick abandoned. Observed in stress testing: with the atomic-fence server fix eliminating phantom fences, a residual set of runs stalled with a real fence + a step_created that never started. This closes that gap. Tests: 1018 core tests pass.
…ion" This reverts commit a42c1c1.
…shot replay The previous attempt (unconditional orphaned-step recovery scan, reverted in d441126) re-queued every pending step_created on every invocation. That violated the single-owner-per-step invariant: a non-owner tick could re-dispatch a step another tick was already running, producing a duplicate step_started and a CORRUPTED_EVENT_LOG ("Unconsumed event in event log: eventType=step_started"). 2/40 runs hit this in stress. Safer approach: when handleSuspension abandons on a stale snapshot, it returns the steps it ALREADY wrote a fenced step_created for (the ones in createdStepCorrelationIds) as pendingSteps. Those writes succeeded against a matching fence inside the atomic transaction, so they're canonical and owned by exactly this tick. The runtime's staleSnapshot branch dispatches just those owned steps (with idempotencyKey: correlationId) before re-enqueuing, so: - no orphaned step_created (the step that this tick created always gets an owner to dispatch it), and - no double-dispatch (only the single owning tick queues each step; other ticks that abandon before writing the step_created never claim ownership of it). This pairs ownership with dispatch instead of blindly recovering, which is what made the unconditional scan unsafe. Tests: 1018 core tests pass.
NOT FOR MERGE. This branch ('debug/validate-occ-fix-20260528') exists
to run end-to-end stress validation of the combined fix:
- @workflow/core: #2113 (e5cd686, top of branch)
- workflow-server: vercel/workflow-server#447 (e6722b2)
WORKFLOW_SERVER_URL_OVERRIDE is pinned to the workflow-server preview
deployment for #447 so the repro app exercises both PRs together. The
WORKFLOW_VERCEL_PROTECTION_BYPASS env var is forwarded as the bare
'x-vercel-protection-bypass' header to bypass the preview's Vercel
Deployment Protection. Setting 'x-vercel-set-bypass-cookie: true' is
deliberately NOT done — it triggers a 307 redirect loop on Node undici.
NOT FOR MERGE. Draft PR opened only to trigger the tarballs-checks workflow so we get a tarball URL to pin the repro app to.
Purpose
End-to-end validation that the combined fix closes the production-visible defect end to end. Pairs:
@workflow/corefrom this branch (= top of [core] Optimistic concurrency control for branch-decision event writes #2113 + the existing SDK fixes from e5cd686)https://workflow-server-git-peter-event-write-cas.vercel.sh)What this branch adds on top of #2113
WORKFLOW_SERVER_URL_OVERRIDEpinned to the Version Packages (beta) #447 preview URLx-vercel-protection-bypassheader forwarded fromWORKFLOW_VERCEL_PROTECTION_BYPASSenv var (so the repro app can hit the preview through Vercel Deployment Protection)Both changes are gated to this branch only and will not be cherry-picked into either real PR.
Validation plan
Run the standard stress repro shape against the pinned tarball + preview pair (40 cycles × 200 workflows). Classify outcomes across:
completedrunningat final checkfailed: CORRUPTED_EVENT_LOGfailed: USER_ERRORfailed: WORLD_CONTRACT_ERRORfailed: otherLast run pre-#447-server-fix: ~2/40 cycles surfaced
CORRUPTED_EVENT_LOGon stable; 0/40 with this PR's predecessor against an earlier #447 preview but with 132 stuck-running + 23 USER_ERROR + 4 WORLD_CONTRACT_ERROR uncategorized.Goal of this run: confirm not just
CORRUPTED_EVENT_LOG = 0but alsostuck/USER_ERROR/WORLD_CONTRACT_ERRORare clean, since those would be the symptom of the materialization-before-fence orphan scenarios Peter walked.