PagePool: don't block reused-tab callers on standby refill (CS-11139)#4822
Conversation
`getPage` synchronously awaited `#ensureStandbyPool()` before selecting an entry. That promise is deduplicated via `#ensuringStandbys` so every concurrent caller paid the worst-case refill time — including callers that would have landed on a warm reused tab and never needed a standby. A single slow eviction (stuck `page.close()` on a Glimmer-tracking-loop render) was enough to stall every render on the server, and the tax showed up against every caller as `tabStartupMs` regardless of whether they actually warmed a tab. The refill is now fire-and-forget. `#selectEntryForAffinity` awaits `#ensureStandbyPool` only in the last-resort path where warm-tab / orphan / commandeer / cross-affinity-steal all missed, and reports the wait it actually paid through a new `tabStartupMs` field on its return shape. `getPage`'s public `waits.tabStartupMs` now attributes the per-caller wait instead of dedup-amplified time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 46f8177ee3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR updates the realm-server prerender PagePool so that slow/hung standby-tab refills no longer stall callers that can reuse an already-warm tab, and refines the tabStartupMs diagnostic to reflect only the standby-refill wait actually paid by an individual caller.
Changes:
- Make
getPage()kick off standby refills asynchronously (no longer awaiting the deduped refill promise up front). - Move standby-refill waiting into
#selectEntryForAffinity()’s true last-resort path and return per-callertabStartupMsfor accurate attribution. - Add a regression test suite covering “refill doesn’t block reuse” scenarios and register it in the test index.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
packages/realm-server/prerender/page-pool.ts |
Decouples standby refill from the critical path for warm-tab reuse; introduces per-caller tabStartupMs and last-resort refill retry logic. |
packages/realm-server/tests/page-pool-standby-refill-test.ts |
Adds regression tests ensuring reused-tab callers are not blocked by stalled refills and that true standby-needing callers report the wait. |
packages/realm-server/tests/index.ts |
Registers the new test module in the realm-server QUnit suite. |
.claude/skills/indexing-diagnostics/SKILL.md |
Updates operational documentation to reflect the new per-caller tabStartupMs semantics. |
Comments suppressed due to low confidence (1)
packages/realm-server/tests/page-pool-standby-refill-test.ts:222
- The
elapsed < 200assertion is timing-sensitive and may be flaky under slow CI or a busy event loop. Consider making this deterministic (assert the promise resolves before unblocking, or use a much more forgiving threshold) so the test doesn’t fail spuriously.
assert.true(
elapsed < 200,
`both reused-tab callers returned in ${elapsed}ms despite blocked standby refill`,
);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…teal When a brand-new affinity arrives with no warm tabs of its own, queueing behind another realm's in-flight render serializes the caller against unrelated work. Pre-CS-11139, the upfront `await #ensureStandbyPool()` in `getPage` made the right ordering implicit — by the time `#selectEntryForAffinity` ran, a fresh standby was usually available for `#commandeerDormantTab` to claim, so the cross-affinity steal never fired in practice. The fire-and-forget refill broke that ordering: a brand-new affinity could now reach the cross-affinity branch while the refill was still mid-flight, and queue on a busy donor tab instead of waiting the shorter time for the in-flight standby to land. Make the ordering explicit. In the `entryList.length === 0` branch (brand-new affinity), `await` the refill once and retry the orphan-claim / standby-commandeer paths before falling back to cross-affinity steal. Cross-affinity remains the final safety net for when the refill itself can't produce a tab. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Adds a `#kickStandbyRefill(reason)` helper that consistently attaches `.catch` to every fire-and-forget `#ensureStandbyPool()` call so an unhandled rejection (e.g. browser unreachable) can't crash the process under strict unhandled-rejection settings. Routes the three call sites (`getPage` pre-acquire, `getPage` post-acquire, `disposeAffinity` post-cleanup) through it. * Tightens the standby-refill tests: drops timing-based `elapsed < 200ms` assertions in favour of the deterministic `tabStartupMs === 0` contract proof (CI-stable), and replaces the vestigial `unblockAt > 0` check with a real `unblockFired` boolean that asserts the timer actually fired before getPage resolved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Host Test Results 1 files ± 0 1 suites ±0 3h 21m 36s ⏱️ + 1h 33m 23s Results for commit a24f09b. ± Comparison against earlier commit a099f21. Realm Server Test Results 1 files ±0 1 suites ±0 14m 30s ⏱️ +41s Results for commit a24f09b. ± Comparison against earlier commit a099f21. |
`#selectEntryForAffinity`'s `awaitStandbyPool` helper unconditionally yielded a microtask on `await this.#ensureStandbyPool()` — even when the refill loop was a guaranteed no-op (e.g. `disableStandbyRefill` is on with an active tab, or the pool is already at desired capacity). That extra microtask hop changed the relative ordering against a sibling caller hitting the simpler same-affinity `least-pending` branch above. The sibling reached `TabQueue.acquire` one microtask earlier, inflating `pendingCount` past the cross-affinity scan's `> 1` filter — leaving the brand-new-affinity caller without a stealable candidate and throwing `'No standby page available for prerender'`. Fix: skip the await entirely when `#currentStandbyCount` already meets `#desiredStandbyCount`. With the fast path the helper is purely synchronous in the no-op case, restoring the relative microtask ordering the cross-affinity scan was designed around. Regression seen by the `prerendering-test.ts > queues same-realm request when tab is transitioning` test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[Claude Code] CI failure on Fixed in commit e7a467b: |
The previous attempt at a fast-path skip in `awaitStandbyPool` still
yielded a microtask via `await asyncHelper()` — even when the
helper returned synchronously (the resolved Promise still rounds
through the microtask queue). That extra hop kept the relative
microtask order broken: the cross-realm caller paid one more
yield than the same-realm caller, the same-realm caller reached
`entry.queue.acquire` first, `pendingCount` inflated past the
cross-affinity scan's `> 1` filter, and the cross-realm caller
threw `'No standby page available for prerender'`.
Inline the `if (current < desired) { await ... }` check at the
two call sites in `#selectEntryForAffinity`. The no-op path is
now strictly synchronous (no Promise round-trip), restoring the
relative microtask ordering the cross-affinity scan was designed
around. `prerendering-test.ts > queues same-realm request when
tab is transitioning` is the regression test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[Claude Code] Previous CI fix wasn't enough — the fast-path check was inside an Inlined the check at both call sites in a099f21 — |
|
[Claude Code] Status update on commit a099f21:
|
The eager-commandeer path in `#selectEntryForAffinity`'s step 2
was preempting fresh-standby creation by stealing an idle tab from
ANOTHER realm — and the stolen tab's runtime state (cached
modules, deps tracking, localStorage) would leak into this
caller's prerender. Pre-CS-11139, the upfront synchronous
`await #ensureStandbyPool()` in `getPage` made the right ordering
implicit; by the time `commandeerDormantTab` ran, a fresh standby
was always available so the cross-affinity-steal branch never
fired in step 2.
After CS-11139 removed that upfront await, the refill is fire-and-
forget, so step 2's commandeer can hit the cross-affinity branch
when no standby is ready yet. That broke
`prerendering-test.ts > file prerender returns extracted metadata`
("deps include base file-api module") because the file-extract
landed on a tab where file-api was already loaded — so the host's
deps tracker for this prerender didn't re-track it.
Add a `standbyOnly: true` opt on `#commandeerDormantTab`. All
step-2 / fallback paths pass it (they should never steal across
realms — only adopt a fresh standby). Cross-affinity steal is
reserved for the explicit `crossAffinityEntries` scan in the
`entryList === 0` last-resort branch, which already runs after
`await this.#ensureStandbyPool()` so a fresh standby is preferred
over a busy cross-realm tab anyway.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[Claude Code] Found the actual cause of test 46 (
Fixed in a24f09b: add a 48/48 page-pool + prerender unit tests pass locally. |
Summary
PagePool#getPagefrom the dedup'd#ensureStandbyPoolpromise so a slow refill no longer stalls reused-tab callers across the server.#selectEntryForAffinity's last-resort branch, returning per-callertabStartupMsso the diagnostic captures only the wait actually paid.Closes CS-11139.
What was wrong
getPagedidawait this.#ensureStandbyPool()synchronously before#selectEntryForAffinity. Concurrent callers shared the same#ensuringStandbyspromise, so when standby creation hung — for example, when#evictLRUAffinitywaited onpage.close()of a Glimmer-tracking-loop render — every caller on that server paid the full hang, even ones who would have landed on a warm reused tab and never touched a standby.The on-call fingerprint: a 2026-05-13 staging incident where a single hung render on prerender container
5f5f63e1622841ed9e239f176203a512produced a fleet-wide 90-minute degradation, with bursts of slowtabStartupMsevents that included rows withtabReused: true(130 s of "tab startup" for callers that ended up on a warm tab) and 6 indexing jobs aborted by the 150 s upstream timeout across multiple realms — all routed to the same wedged container.What changed
packages/realm-server/prerender/page-pool.tsgetPage(around line 942): the synchronousawait this.#ensureStandbyPool()is replaced with a fire-and-forget kick (void this.#ensureStandbyPool().catch(log)). Standby refill still runs in the background to keep the pool warm for the next caller.#selectEntryForAffinitynow reportstabStartupMson its return shape. The internalawaitStandbyPoolhelper accumulates only the time this specific caller spent inside#ensureStandbyPool. The fast paths (warm idle tab / orphan claim / dormant-tab commandeer / cross-affinity steal) all returntabStartupMs: 0.#selectEntryForAffinity: if every fast path missed, we now await the refill once and retry#tryClaimOrphanContext/#commandeerDormantTabbefore throwing'No standby page available for prerender'. The throw is now genuinely the "refill couldn't produce a tab" case, which used to be unreachable in practice because of the upstream synchronous await.tabQueueMsaccounting ingetPageis adjusted to subtract any standby-refill wait that#selectEntryForAffinityperformed, so the wait categories don't double-count..claude/skills/indexing-diagnostics/SKILL.md— updates the inlinetabStartupMscomment to reflect the per-caller semantics; atabReused: truerow should now havetabStartupMs ≈ 0, and a non-trivial value on a reused-tab row is the CS-11139 dedup-amplification signature.packages/realm-server/tests/page-pool-standby-refill-test.ts— three new tests pinning the contract:tabStartupMs === 0while the standby refill is hung in the background.tabStartupMs.Test plan
pnpm exec qunit … tests/page-pool-standby-refill-test.ts— 3/3 passpnpm exec qunit … tests/page-pool-expansion-test.ts tests/page-pool-priority-test.ts tests/page-pool-standby-refill-test.ts— 28/28 passpnpm exec qunit … tests/prerender-deadlock-test.ts tests/prerender-cancellation-test.ts— 20/20 passpnpm lint:jsclean on touched files (existinglint:typesfailures in../base/*.gtsfrom unbuilt sibling packages are pre-existing and unrelated)tabStartupMsp99 falls fortabReused: truerows; ≥60s bucket empties for reused-tab rows🤖 Generated with Claude Code