Skip to content

fix: MCP daemon hangs on init when a parse-worker grammar load fails#567

Open
arttttt wants to merge 5 commits into
colbymchenry:mainfrom
arttttt:fix/daemon-grammar-load-hang
Open

fix: MCP daemon hangs on init when a parse-worker grammar load fails#567
arttttt wants to merge 5 commits into
colbymchenry:mainfrom
arttttt:fix/daemon-grammar-load-hang

Conversation

@arttttt
Copy link
Copy Markdown
Contributor

@arttttt arttttt commented May 29, 2026

Problem

The MCP server can hang on startup / first use and never initialize — on any project, large or small. The shared daemon stops responding and every editor session that connects to it hangs waiting to initialize.

Root cause

ExtractionOrchestrator spins up a parse worker and awaits a grammar-load handshake:

// src/extraction/index.ts (ensureWorker, before)
await new Promise<void>((resolve, reject) => {
  parseWorker!.once('message', (msg) => {
    if (msg.type === 'grammars-loaded') resolve();
    else reject(...);
  });
  parseWorker!.postMessage({ type: 'load-grammars', languages: neededLanguages });
});

The bare once('message') only ever resolves on grammars-loaded. The worker's load-grammars handler had no try/catch, and attachWorkerHandlers' 'error'/'exit' handlers reject only entries in pendingParses (empty during the grammar-load handshake). So if the worker dies while loading grammars (a tree-sitter WASM abort), 'grammars-loaded' never arrives, nothing rejects the promise, and await ensureWorker() hangs forever.

ensureWorker runs inside cg.index() / cg.sync(), which run inside indexMutex.withLock(...). A hung await means withLock's finally release never runs, so the in-process index mutex is held forever. In the shared daemon (#411 — one process across all MCP clients) the background catch-up sync is that hung op, so its gate promise never settles: the first tool call — and every client that connects afterwards — hangs on init, regardless of project size. (The unbounded await predates the daemon — it's from the move to worker-thread parsing — but the shared daemon turned a per-client transient into a permanent, daemon-wide wedge.)

Fix

  1. parse-worker.ts — wrap load-grammars in try/catch; report failures as grammars-load-failed instead of dying silently.
  2. extraction/index.ts — extract awaitWorkerGrammarLoad() that settles on the first of grammars-loaded / grammars-load-failed / worker error / worker exit / timeout (always cleans up its listeners). On failure the worker is torn down and parsing degrades to in-process (slower but correct) for the rest of the run.
  3. mcp/tools.ts (defense in depth) — bound the first-call wait on the catch-up gate (default 120s, override CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS) so a sync that never settles for any reason can't wedge the first call.

Tests

  • __tests__/parse-worker-grammar-load.test.tsawaitWorkerGrammarLoad settles (never hangs) on every outcome and removes its listeners. (The worker path isn't exercised under vitest — parse-worker.js isn't next to the .ts, so useWorker=false — which is why this never surfaced in CI; hence the extracted, directly-testable helper.)
  • __tests__/mcp-catchup-gate.test.ts — a gate that never settles no longer hangs the first tool call.

Not included (deliberate follow-up)

A proxy↔daemon JSON-RPC timeout (so a wedged daemon falls back to direct mode) was considered but skipped: the daemon sends hello synchronously and answers initialize without the gate, so after this fix a daemon can no longer go silent on JSON-RPC from any known cause — the timeout would guard a near-impossible case at the cost of reworking the transparent proxy pipe that every session depends on.

🤖 Generated with Claude Code

arttttt and others added 5 commits May 30, 2026 00:17
…o load grammars

The parse worker's grammar-load handshake was awaited with a bare
`once('message')` that only resolved on `grammars-loaded`. If the worker died
while loading grammars (a tree-sitter WASM abort), that message never arrived,
so the await — and the in-process `indexMutex` it runs under — hung forever. In
the shared daemon (one process across all MCP clients), this wedged the
background catch-up sync, so the first tool call and every client that connected
afterwards hung on init with no fallback, regardless of project size.

`ensureWorker` now awaits via the new exported `awaitWorkerGrammarLoad`, which
settles on the first of grammars-loaded / grammars-load-failed / worker error /
worker exit / timeout. On failure the worker is torn down and parsing degrades
to in-process (slower but correct) for the rest of the run. The worker's
`load-grammars` handler also reports JS-level load failures instead of dying
silently.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Asserts awaitWorkerGrammarLoad settles (never hangs) on every outcome —
grammars-loaded, grammars-load-failed, worker exit, worker error, and timeout —
and always removes its listeners. Regression guard for the daemon init-hang.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…st call

Defense in depth on top of the parse-worker grammar-load fix. The first tool
call blocks on the engine's post-open catch-up sync (the gate) so it never
serves rows for files deleted while no server was running. It awaited the gate
unbounded — a sync that never settles for any reason would hang the first call
and, in the shared daemon, every client that connected after.

The wait is now bounded (default 120s, override CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS);
on timeout the call proceeds best-effort over possibly-stale data — the same
outcome the existing rejection-swallow already accepts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Asserts the first tool call still serves (best-effort) when the catch-up gate
never settles, instead of hanging — using a tiny CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS
override.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant