Skip to content

PagePool: don't block reused-tab callers on standby refill (CS-11139)#4822

Merged
habdelra merged 6 commits into
mainfrom
cs-11139-pagepool-standby-refill-dedup
May 14, 2026
Merged

PagePool: don't block reused-tab callers on standby refill (CS-11139)#4822
habdelra merged 6 commits into
mainfrom
cs-11139-pagepool-standby-refill-dedup

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

Summary

  • Decouples PagePool#getPage from the dedup'd #ensureStandbyPool promise so a slow refill no longer stalls reused-tab callers across the server.
  • Moves the standby-pool wait inside #selectEntryForAffinity's last-resort branch, returning per-caller tabStartupMs so the diagnostic captures only the wait actually paid.
  • Adds a regression test suite for the refill-doesn't-block-reuse contract.

Closes CS-11139.

What was wrong

getPage did await this.#ensureStandbyPool() synchronously before #selectEntryForAffinity. Concurrent callers shared the same #ensuringStandbys promise, so when standby creation hung — for example, when #evictLRUAffinity waited on page.close() of a Glimmer-tracking-loop render — every caller on that server paid the full hang, even ones who would have landed on a warm reused tab and never touched a standby.

The on-call fingerprint: a 2026-05-13 staging incident where a single hung render on prerender container 5f5f63e1622841ed9e239f176203a512 produced a fleet-wide 90-minute degradation, with bursts of slow tabStartupMs events that included rows with tabReused: true (130 s of "tab startup" for callers that ended up on a warm tab) and 6 indexing jobs aborted by the 150 s upstream timeout across multiple realms — all routed to the same wedged container.

What changed

packages/realm-server/prerender/page-pool.ts

  • getPage (around line 942): the synchronous await this.#ensureStandbyPool() is replaced with a fire-and-forget kick (void this.#ensureStandbyPool().catch(log)). Standby refill still runs in the background to keep the pool warm for the next caller.
  • #selectEntryForAffinity now reports tabStartupMs on its return shape. The internal awaitStandbyPool helper accumulates only the time this specific caller spent inside #ensureStandbyPool. The fast paths (warm idle tab / orphan claim / dormant-tab commandeer / cross-affinity steal) all return tabStartupMs: 0.
  • A new last-resort branch runs at the bottom of #selectEntryForAffinity: if every fast path missed, we now await the refill once and retry #tryClaimOrphanContext / #commandeerDormantTab before throwing 'No standby page available for prerender'. The throw is now genuinely the "refill couldn't produce a tab" case, which used to be unreachable in practice because of the upstream synchronous await.
  • tabQueueMs accounting in getPage is adjusted to subtract any standby-refill wait that #selectEntryForAffinity performed, so the wait categories don't double-count.

.claude/skills/indexing-diagnostics/SKILL.md — updates the inline tabStartupMs comment to reflect the per-caller semantics; a tabReused: true row should now have tabStartupMs ≈ 0, and a non-trivial value on a reused-tab row is the CS-11139 dedup-amplification signature.

packages/realm-server/tests/page-pool-standby-refill-test.ts — three new tests pinning the contract:

  1. Reused-tab caller returns quickly with tabStartupMs === 0 while the standby refill is hung in the background.
  2. Caller that genuinely needs a fresh standby still waits and reports the actual wait in tabStartupMs.
  3. Concurrent reused-tab callers on different affinities both return quickly despite a stalled refill — the dedup'd promise no longer leaks waits across callers.

Test plan

  • pnpm exec qunit … tests/page-pool-standby-refill-test.ts — 3/3 pass
  • pnpm exec qunit … tests/page-pool-expansion-test.ts tests/page-pool-priority-test.ts tests/page-pool-standby-refill-test.ts — 28/28 pass
  • pnpm exec qunit … tests/prerender-deadlock-test.ts tests/prerender-cancellation-test.ts — 20/20 pass
  • pnpm lint:js clean on touched files (existing lint:types failures in ../base/*.gts from unbuilt sibling packages are pre-existing and unrelated)
  • Soak in staging: tabStartupMs p99 falls for tabReused: true rows; ≥60s bucket empties for reused-tab rows
  • Follow-up: CS-11140 for the underlying eviction-can't-free-stuck-pages issue this exposes

🤖 Generated with Claude Code

`getPage` synchronously awaited `#ensureStandbyPool()` before
selecting an entry. That promise is deduplicated via
`#ensuringStandbys` so every concurrent caller paid the worst-case
refill time — including callers that would have landed on a warm
reused tab and never needed a standby. A single slow eviction
(stuck `page.close()` on a Glimmer-tracking-loop render) was enough
to stall every render on the server, and the tax showed up against
every caller as `tabStartupMs` regardless of whether they actually
warmed a tab.

The refill is now fire-and-forget. `#selectEntryForAffinity`
awaits `#ensureStandbyPool` only in the last-resort path where
warm-tab / orphan / commandeer / cross-affinity-steal all missed,
and reports the wait it actually paid through a new `tabStartupMs`
field on its return shape. `getPage`'s public `waits.tabStartupMs`
now attributes the per-caller wait instead of dedup-amplified time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46f8177ee3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/realm-server/prerender/page-pool.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the realm-server prerender PagePool so that slow/hung standby-tab refills no longer stall callers that can reuse an already-warm tab, and refines the tabStartupMs diagnostic to reflect only the standby-refill wait actually paid by an individual caller.

Changes:

  • Make getPage() kick off standby refills asynchronously (no longer awaiting the deduped refill promise up front).
  • Move standby-refill waiting into #selectEntryForAffinity()’s true last-resort path and return per-caller tabStartupMs for accurate attribution.
  • Add a regression test suite covering “refill doesn’t block reuse” scenarios and register it in the test index.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
packages/realm-server/prerender/page-pool.ts Decouples standby refill from the critical path for warm-tab reuse; introduces per-caller tabStartupMs and last-resort refill retry logic.
packages/realm-server/tests/page-pool-standby-refill-test.ts Adds regression tests ensuring reused-tab callers are not blocked by stalled refills and that true standby-needing callers report the wait.
packages/realm-server/tests/index.ts Registers the new test module in the realm-server QUnit suite.
.claude/skills/indexing-diagnostics/SKILL.md Updates operational documentation to reflect the new per-caller tabStartupMs semantics.
Comments suppressed due to low confidence (1)

packages/realm-server/tests/page-pool-standby-refill-test.ts:222

  • The elapsed < 200 assertion is timing-sensitive and may be flaky under slow CI or a busy event loop. Consider making this deterministic (assert the promise resolves before unblocking, or use a much more forgiving threshold) so the test doesn’t fail spuriously.
    assert.true(
      elapsed < 200,
      `both reused-tab callers returned in ${elapsed}ms despite blocked standby refill`,
    );

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/realm-server/prerender/page-pool.ts Outdated
Comment thread packages/realm-server/tests/page-pool-standby-refill-test.ts Outdated
Comment thread packages/realm-server/tests/page-pool-standby-refill-test.ts Outdated
habdelra and others added 2 commits May 13, 2026 18:08
…teal

When a brand-new affinity arrives with no warm tabs of its own,
queueing behind another realm's in-flight render serializes the
caller against unrelated work. Pre-CS-11139, the upfront
`await #ensureStandbyPool()` in `getPage` made the right ordering
implicit — by the time `#selectEntryForAffinity` ran, a fresh
standby was usually available for `#commandeerDormantTab` to claim,
so the cross-affinity steal never fired in practice.

The fire-and-forget refill broke that ordering: a brand-new affinity
could now reach the cross-affinity branch while the refill was
still mid-flight, and queue on a busy donor tab instead of waiting
the shorter time for the in-flight standby to land.

Make the ordering explicit. In the `entryList.length === 0` branch
(brand-new affinity), `await` the refill once and retry the
orphan-claim / standby-commandeer paths before falling back to
cross-affinity steal. Cross-affinity remains the final safety net
for when the refill itself can't produce a tab.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Adds a `#kickStandbyRefill(reason)` helper that consistently
  attaches `.catch` to every fire-and-forget `#ensureStandbyPool()`
  call so an unhandled rejection (e.g. browser unreachable) can't
  crash the process under strict unhandled-rejection settings.
  Routes the three call sites (`getPage` pre-acquire, `getPage`
  post-acquire, `disposeAffinity` post-cleanup) through it.

* Tightens the standby-refill tests: drops timing-based
  `elapsed < 200ms` assertions in favour of the deterministic
  `tabStartupMs === 0` contract proof (CI-stable), and replaces
  the vestigial `unblockAt > 0` check with a real `unblockFired`
  boolean that asserts the timer actually fired before getPage
  resolved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

Host Test Results

    1 files  ±    0      1 suites  ±0   3h 21m 36s ⏱️ + 1h 33m 23s
2 661 tests +    2  2 646 ✅ +    2  15 💤 ± 0  0 ❌ ±0 
5 278 runs  +2 600  5 249 ✅ +2 586  29 💤 +14  0 ❌ ±0 

Results for commit a24f09b. ± Comparison against earlier commit a099f21.

Realm Server Test Results

    1 files  ±0      1 suites  ±0   14m 30s ⏱️ +41s
1 368 tests +2  1 368 ✅ +3  0 💤 ±0  0 ❌  - 1 
1 447 runs  +2  1 447 ✅ +3  0 💤 ±0  0 ❌  - 1 

Results for commit a24f09b. ± Comparison against earlier commit a099f21.

`#selectEntryForAffinity`'s `awaitStandbyPool` helper unconditionally
yielded a microtask on `await this.#ensureStandbyPool()` — even when
the refill loop was a guaranteed no-op (e.g. `disableStandbyRefill`
is on with an active tab, or the pool is already at desired capacity).

That extra microtask hop changed the relative ordering against a
sibling caller hitting the simpler same-affinity `least-pending`
branch above. The sibling reached `TabQueue.acquire` one microtask
earlier, inflating `pendingCount` past the cross-affinity scan's
`> 1` filter — leaving the brand-new-affinity caller without a
stealable candidate and throwing
`'No standby page available for prerender'`.

Fix: skip the await entirely when `#currentStandbyCount` already meets
`#desiredStandbyCount`. With the fast path the helper is purely
synchronous in the no-op case, restoring the relative microtask
ordering the cross-affinity scan was designed around. Regression
seen by the `prerendering-test.ts > queues same-realm request when
tab is transitioning` test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra
Copy link
Copy Markdown
Contributor Author

[Claude Code] CI failure on prerendering-test.ts > queues same-realm request when tab is transitioning (and the related does not reuse across different realms) traced to a microtask-ordering regression my fix introduced. The new awaitStandbyPool helper in #selectEntryForAffinity always yielded a microtask via await this.#ensureStandbyPool(), even when the refill was a guaranteed no-op (disableStandbyRefill: true with an active tab). That extra hop let a sibling same-affinity caller queue on the busy tab one microtask earlier, inflating its TabQueue's pendingCount past the cross-affinity scan's > 1 filter — and the brand-new-affinity caller threw 'No standby page available for prerender' despite a valid stealable candidate existing.

Fixed in commit e7a467b: awaitStandbyPool now skips the await entirely when #currentStandbyCount >= #desiredStandbyCount (no refill work to do). The helper is synchronous in the no-op case, restoring the relative microtask ordering the cross-affinity scan was designed around. Local sweep of page-pool-* + prerender-deadlock + prerender-cancellation tests: 48/48 pass; the CI re-run should clear both failures.

@habdelra habdelra requested a review from a team May 13, 2026 22:49
The previous attempt at a fast-path skip in `awaitStandbyPool` still
yielded a microtask via `await asyncHelper()` — even when the
helper returned synchronously (the resolved Promise still rounds
through the microtask queue). That extra hop kept the relative
microtask order broken: the cross-realm caller paid one more
yield than the same-realm caller, the same-realm caller reached
`entry.queue.acquire` first, `pendingCount` inflated past the
cross-affinity scan's `> 1` filter, and the cross-realm caller
threw `'No standby page available for prerender'`.

Inline the `if (current < desired) { await ... }` check at the
two call sites in `#selectEntryForAffinity`. The no-op path is
now strictly synchronous (no Promise round-trip), restoring the
relative microtask ordering the cross-affinity scan was designed
around. `prerendering-test.ts > queues same-realm request when
tab is transitioning` is the regression test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra
Copy link
Copy Markdown
Contributor Author

[Claude Code] Previous CI fix wasn't enough — the fast-path check was inside an async helper, and await asyncHelper() still yields one microtask on the resolved Promise (the very thing the fast-path was trying to avoid). The cross-realm caller paid that extra hop and lost the microtask race against the same-realm caller, same failure mode as before.

Inlined the check at both call sites in a099f21if (currentStandbyCount < desiredStandbyCount) { await … } directly, no helper. The no-op path is now strictly synchronous, restoring the pre-CS-11139 relative microtask ordering. Verified locally: 48/48 page-pool + prerender tests pass.

@habdelra
Copy link
Copy Markdown
Contributor Author

[Claude Code] Status update on commit a099f21:

  • Tests 79 (does not reuse across different realms) and 100 (queues same-realm request when tab is transitioning) — the original regression my fix targeted — now both PASS. The inline microtask check is correct.
  • A different test failed: 46 file prerender returns extracted metadata with "deps include base file-api module" expected true, got false. This is in file-extract dep tracking, unrelated to my changes (which only touch #selectEntryForAffinity standby-refill ordering). Reads like a flake; the workflow auto-retries on push so we'll see what the next run shows.

The eager-commandeer path in `#selectEntryForAffinity`'s step 2
was preempting fresh-standby creation by stealing an idle tab from
ANOTHER realm — and the stolen tab's runtime state (cached
modules, deps tracking, localStorage) would leak into this
caller's prerender. Pre-CS-11139, the upfront synchronous
`await #ensureStandbyPool()` in `getPage` made the right ordering
implicit; by the time `commandeerDormantTab` ran, a fresh standby
was always available so the cross-affinity-steal branch never
fired in step 2.

After CS-11139 removed that upfront await, the refill is fire-and-
forget, so step 2's commandeer can hit the cross-affinity branch
when no standby is ready yet. That broke
`prerendering-test.ts > file prerender returns extracted metadata`
("deps include base file-api module") because the file-extract
landed on a tab where file-api was already loaded — so the host's
deps tracker for this prerender didn't re-track it.

Add a `standbyOnly: true` opt on `#commandeerDormantTab`. All
step-2 / fallback paths pass it (they should never steal across
realms — only adopt a fresh standby). Cross-affinity steal is
reserved for the explicit `crossAffinityEntries` scan in the
`entryList === 0` last-resort branch, which already runs after
`await this.#ensureStandbyPool()` so a fresh standby is preferred
over a busy cross-realm tab anyway.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra
Copy link
Copy Markdown
Contributor Author

[Claude Code] Found the actual cause of test 46 (file prerender returns extracted metadata / "deps include base file-api module"). It's a real regression from CS-11139, not a flake.

#commandeerDormantTab does both standby-adoption AND cross-affinity-steal. Pre-CS-11139, getPage's upfront await #ensureStandbyPool() always produced a fresh standby BEFORE commandeerDormantTab ran in step 2 of #selectEntryForAffinity — so the cross-affinity branch never fired there in practice. After CS-11139 removed the upfront await, the fire-and-forget refill can lag behind, and step 2's commandeer falls through to cross-affinity steal. The stolen tab's runtime has file-api cached from previous tests in other realms, so the host's per-prerender deps tracker doesn't re-track it.

Fixed in a24f09b: add a standbyOnly: true opt on #commandeerDormantTab. All eager step-2 / fallback paths pass it (they should never preemptively steal across realms — only adopt a fresh standby). Cross-affinity steal is reserved for the explicit crossAffinityEntries scan in the entryList===0 last-resort branch, which already awaits the refill so a fresh standby is preferred over a busy cross-realm tab.

48/48 page-pool + prerender unit tests pass locally.

@habdelra habdelra merged commit f572c6c into main May 14, 2026
96 of 99 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants