feat(core): bounded queue-redelivery retry on decryption failure during replay#2166
feat(core): bounded queue-redelivery retry on decryption failure during replay#2166TooTallNate wants to merge 1 commit into
Conversation
…ng replay Follow-up to #2145. An AES-GCM auth failure (RuntimeDecryptionError) is terminal for the current attempt's bytes, but often transient at the run level: when the ciphertext came from a truncated/corrupted read of remotely-persisted data (e.g. a partial /refs response), a fresh queue delivery re-fetches the event log + ref payloads and can succeed. Mirror the replay-timeout bounded-redelivery precedent: on managed worlds (processExitTriggersQueueRedelivery), exit the process to trigger queue redelivery for up to DECRYPTION_FAILURE_MAX_RETRIES attempts, then commit run_failed as RUNTIME_ERROR. In-process worlds fail immediately (no queue to re-fetch from, and exiting would kill the host).
🦋 Changeset detectedLatest commit: fe5d9ca The changes in this PR will be included in the next version bump. This PR includes changesets to release 16 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (4 failed)astro (1 failed):
fastify (1 failed):
nextjs-turbopack (1 failed):
nextjs-webpack (1 failed):
📦 Local Production (1 failed)express-stable (1 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
❌ 📦 Local Production
✅ 🐘 Local Postgres
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
There was a problem hiding this comment.
Pull request overview
This PR adds bounded queue-redelivery behavior for replay-time RuntimeDecryptionErrors in @workflow/core, allowing managed worlds to retry transient persisted-data read corruption before marking the run failed.
Changes:
- Adds a decryption-failure retry decision helper and retry budget constant.
- Integrates the helper into the workflow runtime error path before writing
run_failed. - Adds unit and runtime coverage plus a patch changeset.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
packages/core/src/runtime/decryption-failure.ts |
Adds redelivery decision logic and logging for replay decryption failures. |
packages/core/src/runtime/decryption-failure.test.ts |
Covers managed/in-process world behavior and retry-budget boundaries. |
packages/core/src/runtime/constants.ts |
Defines the decryption-failure retry budget. |
packages/core/src/runtime.ts |
Redrives managed-world runs on eligible RuntimeDecryptionErrors before failing the run. |
packages/core/src/runtime.test.ts |
Adds end-to-end key-mismatch tests for redelivery and failure behavior. |
.changeset/requeue-on-decryption-failure.md |
Adds a patch changeset for @workflow/core. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
VaguelySerious
left a comment
There was a problem hiding this comment.
AI review: blocking issues found
| return false; | ||
| } | ||
|
|
||
| if (attempt <= DECRYPTION_FAILURE_MAX_RETRIES) { |
There was a problem hiding this comment.
AI Review: Blocking
RuntimeDecryptionError also covers encrypt-side AES-GCM failures (context.operation === "encrypt"), but this helper redrives every RuntimeDecryptionError on managed worlds. A fresh queue delivery can only help when decrypting persisted bytes, not when serializing/encrypting a new workflow payload. This can re-run workflow code up to 3 times before failing. Please gate redelivery on error.context?.operation === "decrypt".
| // queue delivery re-fetches the event log + ref | ||
| // payloads and can succeed. On managed Worlds, exit the | ||
| // process (which the platform turns into a redelivery) | ||
| // for a bounded number of attempts before failing the |
There was a problem hiding this comment.
AI Review: Note
This changes the user-facing behavior for runtime decryption failures on managed worlds: they now redeliver before the terminal run_failed. The existing runtime-decryption-failed docs still say the run fails immediately, so those docs should be updated as part of this behavior change.
| // Past DECRYPTION_FAILURE_MAX_RETRIES (3) but under MAX_QUEUE_DELIVERIES | ||
| // (48), so we exercise the decryption-failure budget, not the queue | ||
| // max-deliveries guard. | ||
| attempt: 10, |
There was a problem hiding this comment.
AI Review: Nit
This test duplicates the retry/delivery constants in comments and uses attempt: 10 instead of deriving from the constant it is testing. Please import the constants and use DECRYPTION_FAILURE_MAX_RETRIES + 1, with an assertion that it remains below MAX_QUEUE_DELIVERIES, so future constant changes don't silently stale the test.
Summary
Follow-up to #2145 (the
RuntimeDecryptionErrorattribution fix), implementing the bounded-redelivery behavior @pranaygp described in this comment.An AES-GCM authentication failure is terminal for the bytes/key of the current attempt — we must never continue executing the workflow on top of data we couldn't decrypt. But the failure is often not terminal for the run: when the ciphertext came from a transiently truncated or corrupted read of remotely-persisted data (a partial
/refsresponse, an edge-cache miss returning a partial 200, a proxy drop during streaming), a fresh queue delivery re-fetches the event log and ref payloads from scratch and can succeed.Previously we committed
run_failedimmediately, turning a potentially recoverable read failure into a terminal workflow failure.What changed
Mirrors the existing replay-timeout bounded-redelivery precedent (
handleReplayBudgetExhausted):processExitTriggersQueueRedelivery === true, e.g.world-vercel): on attempts ≤DECRYPTION_FAILURE_MAX_RETRIES(3), the run handler exits the process — which the platform turns into a queue redelivery — so replay restarts from freshly-fetched persisted data. Once the retry budget is exhausted, it commitsrun_failedwithRUNTIME_ERROR.world-local, dev servers, custom in-process worlds): no queue to re-fetch from, andprocess.exit()would kill the user's host, so the run fails immediately withRUNTIME_ERROR.New
shouldRedriveOnDecryptionFailure()helper (runtime/decryption-failure.ts) is pure (logging only) and returns whether the caller should redrive, keeping theprocess.exit/run_faileddecision at the single existing call site in the run handler.Test coverage
runtime/decryption-failure.test.ts— unit tests for the helper across managed/in-process worlds and the retry-budget boundary.runtime.test.ts— end-to-end tests driving a real key mismatch (input encrypted with key A, run-key resolves to key B → auth-tag failure during input hydration):process.exit(1), norun_failedrun_failedwithRUNTIME_ERRORrun_failedimmediately, no exitAll
@workflow/coretests pass (1073), full repo typecheck passes (40/40).Notes
The longer-term improvement @pranaygp mentioned — detecting response truncation / integrity failure at the
/refstransport boundary to classify the retryable case more directly — is out of scope here and would be a separate change on the world / workflow-server side.