[e2e] Add label-gated event log race repro#2159
Conversation
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
🦋 Changeset detectedLatest commit: ebde477 The changes in this PR will be included in the next version bump. This PR includes changesets to release 0 packagesWhen changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results✅ All tests passed Summary
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
|
Event Log Race Repro1 of 2000 latest repro runs did not complete cleanly. Run History
Latest Scenario Breakdown
Latest Non-Completed Runs
|
TooTallNate
left a comment
There was a problem hiding this comment.
Approve — well-built diagnostic harness, doing exactly what it claims
This is a real and useful piece of infrastructure. The harness has caught the ~0.2% baseline CORRUPTED_EVENT_LOG rate consistently across 4 separate CI runs (1, 2, 3, 1 races detected over 1500-2000 attempts), which validates both the hypothesis and the harness itself.
Architecture is clean
- Label-gated (
event-log-race-reprolabel OR manual dispatch) — won't run on every PR - All scenarios in one test with shared deployment, sequential execution per scenario, parallel within
- Three scenario classes stressing different replay races:
- hook-sleep
Promise.race([hook, sleep])— classic - step-fanout with parallel-step replays acting on stale logs
- step-sleep-race biased both ways (step wins / sleep wins) to exercise branch-decision determinism
- hook-sleep
- Sticky comment with history — last N runs visible at a glance, scenario breakdown for the latest run
- --check mode in the renderer cleanly separates "render markdown" from "fail the CI step", so the same script powers both paths
Verified
- All non-repro CI jobs pass (109/2 failures, only
Benchmark Vercel (nitro-v3)flake and the expectedEvent Log Race Repro"failure") - The repro job runs ~24 min as expected for 2000 attempts at concurrency 50, plus the step-heavy scenarios
- Sticky comment renders the history table correctly across 4 runs
- The
--checkexit code logic correctly fails whennonCompleted > 0
Things to keep in mind (non-blockers)
-
Drift risk: This is 881+372+405 lines of test + fixture + renderer scaffolding that depends on SDK surfaces (
createHook,start,getHookByToken,resumeHook). If those APIs change, this harness needs updating. Worth a comment block at the top of the test file (or aREADMEnext to it) calling it out as "diagnostic-only, update when SDK race shapes change." -
Single-adapter coverage: Only runs against
example-nextjs-workflow-turbopack. The races being targeted are SDK-level (event log replay), not adapter-specific, and the world-vercel backend is shared across adapters — so this is probably fine. But if races emerge that are adapter-specific (e.g. specific to webpack bundling), this won't catch them. Worth keeping in mind. -
Workflow file numbering:
101_hook_sleep_repro.tsfollows the1_,2_, ...,100_convention. Good. -
Defensive env parsing:
envNumber/envBooleancorrectly fall back on invalid input. ✓
Sticky-comment design observations
The comment structure is well thought out:
<!-- event-log-race-repro-history JSON event-log-race-repro-history -->— hidden JSON for next run to read and append. Standard "sticky comment with state" pattern, clean.- Truncated to most recent N runs implicitly (the table doesn't grow unboundedly because old runs eventually scroll off in the markdown).
One micro-suggestion: if the history grows past 10-ish runs, the table could get unwieldy in the GitHub UI. Worth capping history.length in the renderer (history.slice(-10)) so old data ages out naturally. Not a blocker — easy to add later if it ever matters.
Bottom line
This is the right shape of tool for the problem (random races at 0.2% require a high-iteration harness to surface reliably). The label gating keeps it cheap by default. The sticky-comment history makes it actionable for tracking fixes over time.
Approving.
|
Backport PR opened against |
(AI)
Adds a label-gated CI job for reproducing event-log race conditions.
The new
event-log-race-reprolabel runs a single Next.js Turbopack preview deployment through repeated race attempts, writes a JSON artifact, and posts one sticky PR comment with a running history table. Each CI run is a column with timestamped links to the workflow logs and tested deployment, plus the outcome distribution for completed, USER_ERROR, CORRUPTED_EVENT_LOG, RUNTIME_ERROR, stuck, and other runs.The default run keeps 1500 hook/sleep workflow runs at concurrency 50, then adds 250 parallel-step fanout runs and 250 step/sleep race runs. The step-heavy scenarios are meant to catch concurrent replays acting on stale event logs around
step_created,step_completed,wait_completed, and branch decisions, not just hook/sleep interactions. The sticky comment also includes a per-scenario breakdown for the latest run.This should still catch at least one failure on average for the current main/stable baseline, where
CORRUPTED_EVENT_LOGis expected around 0.2% of runs.This draft is expected to fail on the current unfixed baseline when the label is present; the failure output is the point of the harness. The sticky comment on this PR is the source of truth for the current run history and observed distribution.