Skip to content

worker: finalize reservation as 'completed' on fatal child error#4824

Merged
habdelra merged 3 commits into
mainfrom
claude/indexing-babel-duplicate-export-test
May 14, 2026
Merged

worker: finalize reservation as 'completed' on fatal child error#4824
habdelra merged 3 commits into
mainfrom
claude/indexing-babel-duplicate-export-test

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

@habdelra habdelra commented May 13, 2026

Summary

Stops the infinite-respawn loop on indexing jobs that crash the worker
child via an unhandled rejection / uncaught exception. The child
worker had no process-level fatal handlers, so a silent exit left the
reservation for the parent's worker.on('exit') to finalize as
'interrupted' — which the per-job reservation cap explicitly
excludes from its count. Result: deterministic indexing crashes loop
forever without abandonment.

Motivating incident

Staging from-scratch-index job 388477 (ctse/personal/) churned
through 43 worker child processes in 30 minutes without progress.
Every reservation closed with completion_reason = 'interrupted',
never 'completed'. The triggering crash on each cycle:

encountered error indexing card instance Trade/eth-to-usdc-2023.json:
  /crypto-portfolio.gts: Identifier 'AddressField' has already been declared. (59:13)

Trace: a .gts module with a duplicate top-level export class,
consumed by a .json instance that adopts from it. The host's
prerender returned the error in renderResult.error, card-indexer.ts
logged the warning and called updateEntry with an instance-error
doc — that path is fully wrapped and propagation back to pg-queue
should work. But it didn't: every reservation was 'interrupted', so
the silent exit must have been below the indexer.

Changes

packages/realm-server/lib/finalize-child-fatal-failure.ts (new)

  • Sibling of finalize-orphan-reservations.ts. Marks the worker's
    in-flight reservation with completion_reason = 'completed' — the
    status that the per-job cap counts. Best-effort against a damaged DB
    connection.

packages/realm-server/worker.ts

  • Adds process.on('unhandledRejection', ...) and
    process.on('uncaughtException', ...). Both log, send to Sentry,
    best-effort finalize the reservation, then process.exit(1).
  • Bounded by a 5-second race so a damaged DB connection can't block
    exit indefinitely — if finalize doesn't return in time we exit
    anyway and fall back to the existing parent-side 'interrupted'
    path (same behavior as before this PR).

packages/realm-server/tests/indexing-test.ts (one new test)

  • Writes a .gts module with a duplicate top-level export, a .json
    instance that adopts from it, asserts the consumer-card-side
    handling is correct (instance-error with the Babel message in its
    chain), and pins the surprising module-file-side behavior (the
    broken module lands as a healthy file row — fileExtract's
    metadata pass succeeds; the compile error only surfaces on the
    consumer's error doc).

What this PR does NOT do

  • Doesn't fix the surprising "broken .gts indexed as healthy file
    row" behavior the test discovered. The compilation error surfaces
    on the consumer card's error doc and the modules table, not the
    module's own boxel_index row. Whether that's a real gap or by
    design is a follow-up question.
  • Doesn't recover staging job 388477 — that job is still
    unfulfilled and needs to be manually marked rejected. Once
    this lands and deploys, future jobs hitting the same crash will
    hit the cap after MAX_RESERVATION_COUNT_PER_JOB (currently 2) attempts and auto-reject.

Test plan

  • CI: realm-server tests pass (new test asserts the regression).
  • Manual repro on staging after deploy: enqueue a from-scratch
    on a realm with a known duplicate-export module, observe the
    reservation closes as 'completed', cap trips at 5, job goes to
    'rejected' with the actual error in jobs.result.

🤖 Generated with Claude Code

Adds a single test to indexing-test.ts that writes a .gts module
containing a duplicate top-level export class, then writes a .json
instance adopting from it. The test asserts the instance lands in
instance-error state with the parse-error message and that the
broken module also gets an error row in boxel_index.

This mirrors the staging job 388477 failure shape, where the
worker child dies indefinitely on the same crypto-portfolio.gts
duplicate AddressField declaration without ever producing a
completed/rejected pg-queue reservation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra
Copy link
Copy Markdown
Contributor Author

[Claude Code] CI verdict: failed, but not in the way we were hoping.

The qunit log from shard 4 shows:

not ok 87 ... handles babel duplicate-export error in a consumed module without crashing
  message: "broken module file has an error entry (rows: [{\"type\":\"file\",\"has_error\":false}])"
  actual: false  expected: true
  at .../indexing-test.ts:2388

Three things to take away:

1. The consumer-side handling works correctly. The first two assertions passed — the Trade instance gets an instance-error doc, the error chain mentions "already been declared", and the deps include the broken address module. So performCardIndexing is doing its job for cards that adopt from a Babel-broken module.

2. The broken .gts file itself gets indexed as a healthy file row. address.gts lands in boxel_index as {type: 'file', has_error: false}. The fileExtract pass apparently succeeded (extraction is metadata-level, not a full Babel transform), so performFileIndexing took the success branch. Whether this is "a real gap" (module file with unparseable contents should arguably have a file-error entry) or "by design" (module compilation errors are tracked in the modules table via getDefinition, not on boxel_index file rows) is worth confirming — but either way it's not what's killing the worker in staging.

3. The test process survived. qunit went on to run tests 88-105 cleanly. So whatever's silently exiting the staging worker child on this error path is NOT reachable from the indexer-direct test surface — it must be downstream in the worker process layer (unhandled promise rejection in a fire-and-forget chain, pg-pool error event, definitionLookup invocation, Sentry capture, etc.).

Implication for the staging bug. The infinite-respawn loop is not caused by the indexer itself mishandling the duplicate-export. The indexer handles it cleanly. The next investigation step should be the worker child's process-level error handling: packages/realm-server/worker.ts does not install unhandledRejection / uncaughtException handlers (the worker-manager parent does, but the spawned child doesn't). Installing them would (a) prevent the silent exit, (b) log the actual stack trace pointing at the real culprit, and (c) let the in-flight reservation get marked 'completed' so the per-job cap can trip.

I'd convert this PR's failing assertion into either an expected-behavior check (assert has_error === false, with a comment explaining the surprise) or remove the third block entirely, leaving the instance-side assertions as a regression test for the part that demonstrably works. Open question for the reviewer: is the file-row staying has_error=false a real gap worth fixing in a separate PR, or expected because modules are tracked elsewhere?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

Host Test Results

    1 files  ±0      1 suites  ±0   1h 47m 18s ⏱️ -51s
2 661 tests +2  2 646 ✅ +2  15 💤 ±0  0 ❌ ±0 
2 680 runs  +2  2 665 ✅ +2  15 💤 ±0  0 ❌ ±0 

Results for commit ead74b7. ± Comparison against earlier commit 7edf52c.

Realm Server Test Results

    1 files  ±    0      1 suites  ±0   13m 39s ⏱️ + 2m 51s
1 369 tests +    5  1 369 ✅ +    5  0 💤 ±0  0 ❌ ±0 
2 582 runs  +1 139  2 582 ✅ +1 139  0 💤 ±0  0 ❌ ±0 

Results for commit ead74b7. ± Comparison against earlier commit 7edf52c.

Stops infinite respawn on deterministic indexing crashes. The child
worker had no `unhandledRejection` / `uncaughtException` handlers, so
a silent exit left the reservation for the parent's exit hook to
finalize as 'interrupted' — which the per-job reservation cap
explicitly excludes from its count, so the cap never tripped. Result
in staging: 43 reservations on one job in 30 minutes, no progress, no
abandonment.

New `finalize-child-fatal-failure.ts` mirrors `finalize-orphan-
reservations.ts` but writes `completion_reason = 'completed'` so a
crashing child counts toward the cap. The handler is bounded by a
5-second race to keep a damaged DB connection from blocking exit.

Also softens the indexing-test assertion that was probing
`boxel_index` for a `file-error` row on the broken module: the
indexer's actual behavior is to write a healthy `file` row (the
fileExtract metadata pass succeeds — the Babel transform only blows
up during full card render), and the compilation failure surfaces on
the consumer card's error doc, not the module's own file row.
Comments at the assertion site explain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra changed the title test: pin Babel duplicate-export error path in indexing worker: finalize reservation as 'completed' on fatal child error May 13, 2026
@habdelra
Copy link
Copy Markdown
Contributor Author

[Claude Code] Pushed 7edf52c35d. PR now bundles:

  1. The repro test (unchanged scope — but the failing module-file assertion is now an expected-behavior pin with a comment, since the indexer's actual behavior is to leave the broken module as a healthy file row).
  2. The production fix: unhandledRejection / uncaughtException handlers in worker.ts that finalize the in-flight reservation as 'completed' (the status the cap counts) before exiting. New helper lib/finalize-child-fatal-failure.ts mirrors the existing finalize-orphan-reservations.ts but with the cap-counting reason.

CI should now go fully green on this SHA. The next time a worker child hits an unhandled rejection in production we'll:

  • get a Sentry capture with a stack trace pointing at the real culprit (vs. today: silent exit, no diagnostic),
  • get a stamped reservation that counts toward the cap, so the job auto-abandons after 5 retries instead of looping forever,
  • still get the same parent-side 'interrupted' fallback if the DB connection is also damaged at fatal-exit time (no regression in that edge case).

Title + body updated to reflect the broader scope.

@habdelra habdelra requested a review from Copilot May 14, 2026 01:13
@habdelra habdelra marked this pull request as ready for review May 14, 2026 01:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a fatal-error backstop for realm-server worker children so deterministic crashes can count toward reservation retry limits instead of looping indefinitely.

Changes:

  • Adds worker process handlers for unhandledRejection and uncaughtException.
  • Adds a helper to mark the child worker’s open reservation as completed.
  • Adds an indexing regression test for duplicate-export module errors surfacing as consumer instance errors.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
packages/realm-server/worker.ts Wires fatal process handlers to log, report, finalize, and exit.
packages/realm-server/lib/finalize-child-fatal-failure.ts Adds best-effort reservation finalization for fatal worker-child failures.
packages/realm-server/tests/indexing-test.ts Adds coverage for duplicate-export module errors during indexing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/realm-server/lib/finalize-child-fatal-failure.ts
Comment thread packages/realm-server/worker.ts
Comment thread packages/realm-server/worker.ts
Mirrors finalize-orphan-reservations-test.ts. Pins the helper's
invariants: closes the worker's open reservations with
completion_reason='completed' (the status the per-job cap counts),
leaves other workers' reservations alone, and is a no-op against
already-closed rows (the race with the parent's exit-side
'interrupted' stamp).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra requested a review from a team May 14, 2026 01:26
@habdelra habdelra merged commit 5e16df6 into main May 14, 2026
97 of 98 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants