fix(code-reviews): retry transient sandbox prepareSession failures#3315
Open
imanolmzd-svg wants to merge 2 commits into
Open
fix(code-reviews): retry transient sandbox prepareSession failures#3315imanolmzd-svg wants to merge 2 commits into
imanolmzd-svg wants to merge 2 commits into
Conversation
Three additional cloud-agent-next prepareSession 5xx error bodies are now classified as retryable, so a fresh-session retry can recover from transient sandbox-side failures that the new strict classifier in #3290 left in 'other_5xx' / 'unclassified_5xx': - mkdir operation failed with exit code NaN/null (sandbox exec returned no real exit code; typically a stuck Cloudflare Sandbox DO) - Network connection lost (Cloudflare DO connection drop mid-prepare) - CommandError + Command timeout after Nms (e.g. git checkout 30s timeout); kept distinct from repo_clone_or_checkout_failure which remains hard-not-retryable for LFS / missing-object cases Observed live in the 2026-05-18 code-reviews/up error_spike incident where multiple bot reviewer sandboxes accumulated permanent failures because the orchestrator no longer retried these transient cases. With this change the orchestrator allocates a fresh sandbox on the next attempt, which is the recovery path the previous classifier reached implicitly via hasSandboxSignal && hasInternalServerSignal. The new categories surface in the 'Fresh session retry skipped' / 'Retrying with a fresh session' logs so post-deploy effectiveness can be measured per failure mode.
Contributor
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Executive SummaryThe previously flagged inaccurate comment (claiming Files Reviewed (2 files)
Reviewed by claude-sonnet-4.6 · 131,807 tokens Review guidance: REVIEW.md from base branch |
…sification The explanatory comments above each retry condition in classifyCloudAgentNextFreshSessionRetry were removed. The function name, classification helper, and reason strings already convey sufficient context for each case.
emilieschario
approved these changes
May 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extend
classifyCloudAgentNextFreshSessionRetryso three transient sandbox-sideprepareSessionfailure bodies are classified retryable instead of falling through toother_5xx/unclassified_5xx. Each new branch is opt-in (literal substring match) and lands after the existing deterministic not-retryable rules, so behavior for billing, cancelled, configured-session-lookup and hard repo/checkout failures is unchanged.New retryable categories:
sandbox_exec_invalid_exit_code— matchesmkdir operation failed with exit code(coversNaNandnull). The sandboxexecreturned no numeric exit code, typically a stuck Cloudflare Sandbox DO.sandbox_network_connection_lost— matchesnetwork connection lost. Cloudflare DO connection drop mid-prepareSession.sandbox_command_timeout— matchescommanderror+command timeout after. Generic in-sandbox command timeout (e.g.git checkout30s). Distinct fromrepo_clone_or_checkout_failure, which stays hard-not-retryable for LFS / missing-object cases.Why
Three observed failure bodies post-#3290 classifier:
other_5xxmkdir operation failed with exit code NaNinternal_server_error/500but nosandbox/container/cloudflare→ fallback&&failsNetwork connection lost.CommandError: ... Command timeout after 30000mscommand timeoutnot matched;failed to checkout pull refnot in bodyIn production this means a single stuck bot reviewer sandbox accumulates permanent failures until the user gives up, instead of being recycled by the orchestrator's fresh-session retry. The pre-#3290
hasSandboxSignal && hasInternalServerSignalclause retried some of these implicitly; this PR makes the intent explicit per failure mode so it can be measured.Safety
repo_clone_or_checkout_failure, so hard checkout failures (LFS, missing objects, deterministic checkout errors) stay not-retryable. Thesandbox_command_timeoutbranch only triggers on the literalcommand timeout aftersubstring.sandboxRetryAttempted === true(orchestrator state) caps retries at one per review (code-review-orchestrator.ts:382). Worst case for a genuinely-broken sandbox is one extra retry.Fresh session retry skipped/Retrying with a fresh session after sandbox 500log fields (failureCategory,retryClassificationReason) for post-deploy measurement per mode.Tests
Three new integration tests in
code-review-orchestrator.test.ts, mirroring the existingretries prepareSession once after wrapper waitForPort readiness timeoutpattern:retries prepareSession once after sandbox mkdir returns invalid exit coderetries prepareSession once after sandbox network connection lostretries prepareSession once after sandbox command timeout during workspace setupEach asserts a single fresh-session retry and a successful
initiateFromKilocodeSessionV2call.Refs