Fix: Browser context leak causes Chrome crash and cascading test failures#50
Open
georgephillips wants to merge 2 commits intomainfrom
Open
Fix: Browser context leak causes Chrome crash and cascading test failures#50georgephillips wants to merge 2 commits intomainfrom
georgephillips wants to merge 2 commits intomainfrom
Conversation
…d enhanced error logging. Updated BrowserWindow structure to include context ID and browser reference.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When running a large test suite (~240 tests) at concurrency 4, Chrome becomes unresponsive after ~50 tests, causing
Failed to close browser window: Timeoutpanics that cascade into mass test failures. The root cause is a browser context leak — each test creates a CDP browser context but never disposes it, causing Chrome to accumulate dozens of zombie contexts until it crashes.Root cause
In
get_window()(definitions/browser/mod.rs), each test creates a new browser context viacreate_browser_context(), but the context ID is stored in a local variable and dropped after the page is created. Only thePageis returned asBrowserWindow::Chrome(page).At shutdown,
Civilization::shutdown()callspage.close()(CDPTarget.closeTarget), which closes the tab but does not dispose the browser context. Thedispose_on_detach: trueflag only fires when the CDP WebSocket session disconnects (at process exit), not when the page closes.After ~50 tests, Chrome carries ~50 empty but alive browser contexts with their associated memory/renderer processes. Chrome hits a resource limit, the WebSocket stalls, and all concurrent workers' CDP commands timeout simultaneously.
Observed behavior
thread 'tokio-runtime-worker' panickedmessages appear atcivilization.rswithFailed to close browser window: TimeoutStep timed out after 30son the browser load stepmain.rswithFailed to await all testsFix (three changes)
1. Fix the context leak
Changed
BrowserWindow::Chrometo store theBrowserContextIdand anArc<Browser>alongside the page. Updatedshutdown()to explicitly callbrowser.dispose_browser_context(context_id)after closing the page.2. Graceful error handling on window close
Replaced
.expect("Failed to close browser window")inCivilization::shutdown()with error logging and a 5-second timeout guard. Browser teardown failures are now warnings, not panics.3. Graceful JoinError handling in main
Replaced
panic!("Failed to await all tests: {e}")with afilter_mapthat logs panicked tasks to stderr and continues collecting results.Files changed
toolproof/src/definitions/browser/mod.rs—BrowserWindow::Chromenow holds{ page, context_id, browser }; all match arms updatedtoolproof/src/civilization.rs— shutdown disposes context, with timeout + error handlingtoolproof/src/main.rs— JoinError logged instead of panickingVerification
Tested against a ~240 test suite that reliably reproduced the bug:
Failed to close browser window: Timeout)Step timed out after 30s)The 18 remaining failures after the fix are real test-level issues (snapshot changes, selector timeouts), not infrastructure cascades.
Environment