worker: finalize reservation as 'completed' on fatal child error#4824
Conversation
Adds a single test to indexing-test.ts that writes a .gts module containing a duplicate top-level export class, then writes a .json instance adopting from it. The test asserts the instance lands in instance-error state with the parse-error message and that the broken module also gets an error row in boxel_index. This mirrors the staging job 388477 failure shape, where the worker child dies indefinitely on the same crypto-portfolio.gts duplicate AddressField declaration without ever producing a completed/rejected pg-queue reservation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[Claude Code] CI verdict: failed, but not in the way we were hoping. The qunit log from shard 4 shows: Three things to take away: 1. The consumer-side handling works correctly. The first two assertions passed — the Trade instance gets an 2. The broken 3. The test process survived. qunit went on to run tests 88-105 cleanly. So whatever's silently exiting the staging worker child on this error path is NOT reachable from the indexer-direct test surface — it must be downstream in the worker process layer (unhandled promise rejection in a fire-and-forget chain, pg-pool error event, definitionLookup invocation, Sentry capture, etc.). Implication for the staging bug. The infinite-respawn loop is not caused by the indexer itself mishandling the duplicate-export. The indexer handles it cleanly. The next investigation step should be the worker child's process-level error handling: I'd convert this PR's failing assertion into either an expected-behavior check (assert |
Host Test Results 1 files ±0 1 suites ±0 1h 47m 18s ⏱️ -51s Results for commit ead74b7. ± Comparison against earlier commit 7edf52c. Realm Server Test Results 1 files ± 0 1 suites ±0 13m 39s ⏱️ + 2m 51s Results for commit ead74b7. ± Comparison against earlier commit 7edf52c. |
Stops infinite respawn on deterministic indexing crashes. The child worker had no `unhandledRejection` / `uncaughtException` handlers, so a silent exit left the reservation for the parent's exit hook to finalize as 'interrupted' — which the per-job reservation cap explicitly excludes from its count, so the cap never tripped. Result in staging: 43 reservations on one job in 30 minutes, no progress, no abandonment. New `finalize-child-fatal-failure.ts` mirrors `finalize-orphan- reservations.ts` but writes `completion_reason = 'completed'` so a crashing child counts toward the cap. The handler is bounded by a 5-second race to keep a damaged DB connection from blocking exit. Also softens the indexing-test assertion that was probing `boxel_index` for a `file-error` row on the broken module: the indexer's actual behavior is to write a healthy `file` row (the fileExtract metadata pass succeeds — the Babel transform only blows up during full card render), and the compilation failure surfaces on the consumer card's error doc, not the module's own file row. Comments at the assertion site explain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[Claude Code] Pushed
CI should now go fully green on this SHA. The next time a worker child hits an unhandled rejection in production we'll:
Title + body updated to reflect the broader scope. |
There was a problem hiding this comment.
Pull request overview
This PR adds a fatal-error backstop for realm-server worker children so deterministic crashes can count toward reservation retry limits instead of looping indefinitely.
Changes:
- Adds worker process handlers for
unhandledRejectionanduncaughtException. - Adds a helper to mark the child worker’s open reservation as
completed. - Adds an indexing regression test for duplicate-export module errors surfacing as consumer instance errors.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
packages/realm-server/worker.ts |
Wires fatal process handlers to log, report, finalize, and exit. |
packages/realm-server/lib/finalize-child-fatal-failure.ts |
Adds best-effort reservation finalization for fatal worker-child failures. |
packages/realm-server/tests/indexing-test.ts |
Adds coverage for duplicate-export module errors during indexing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Mirrors finalize-orphan-reservations-test.ts. Pins the helper's invariants: closes the worker's open reservations with completion_reason='completed' (the status the per-job cap counts), leaves other workers' reservations alone, and is a no-op against already-closed rows (the race with the parent's exit-side 'interrupted' stamp). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Stops the infinite-respawn loop on indexing jobs that crash the worker
child via an unhandled rejection / uncaught exception. The child
worker had no process-level fatal handlers, so a silent exit left the
reservation for the parent's
worker.on('exit')to finalize as'interrupted'— which the per-job reservation cap explicitlyexcludes from its count. Result: deterministic indexing crashes loop
forever without abandonment.
Motivating incident
Staging
from-scratch-indexjob 388477 (ctse/personal/) churnedthrough 43 worker child processes in 30 minutes without progress.
Every reservation closed with
completion_reason = 'interrupted',never
'completed'. The triggering crash on each cycle:Trace: a
.gtsmodule with a duplicate top-levelexport class,consumed by a
.jsoninstance that adopts from it. The host'sprerender returned the error in
renderResult.error,card-indexer.tslogged the warning and called
updateEntrywith aninstance-errordoc — that path is fully wrapped and propagation back to pg-queue
should work. But it didn't: every reservation was
'interrupted', sothe silent exit must have been below the indexer.
Changes
packages/realm-server/lib/finalize-child-fatal-failure.ts(new)finalize-orphan-reservations.ts. Marks the worker'sin-flight reservation with
completion_reason = 'completed'— thestatus that the per-job cap counts. Best-effort against a damaged DB
connection.
packages/realm-server/worker.tsprocess.on('unhandledRejection', ...)andprocess.on('uncaughtException', ...). Both log, send to Sentry,best-effort finalize the reservation, then
process.exit(1).exit indefinitely — if finalize doesn't return in time we exit
anyway and fall back to the existing parent-side
'interrupted'path (same behavior as before this PR).
packages/realm-server/tests/indexing-test.ts(one new test).gtsmodule with a duplicate top-level export, a.jsoninstance that adopts from it, asserts the consumer-card-side
handling is correct (instance-error with the Babel message in its
chain), and pins the surprising module-file-side behavior (the
broken module lands as a healthy
filerow — fileExtract'smetadata pass succeeds; the compile error only surfaces on the
consumer's error doc).
What this PR does NOT do
.gtsindexed as healthyfilerow" behavior the test discovered. The compilation error surfaces
on the consumer card's error doc and the
modulestable, not themodule's own
boxel_indexrow. Whether that's a real gap or bydesign is a follow-up question.
unfulfilledand needs to be manually markedrejected. Oncethis lands and deploys, future jobs hitting the same crash will
hit the cap after
MAX_RESERVATION_COUNT_PER_JOB(currently 2) attempts and auto-reject.Test plan
on a realm with a known duplicate-export module, observe the
reservation closes as
'completed', cap trips at 5, job goes to'rejected'with the actual error injobs.result.🤖 Generated with Claude Code