Skip to content

fix(code-reviews): retry transient sandbox prepareSession failures#3315

Open
imanolmzd-svg wants to merge 2 commits into
mainfrom
imanol/add-retryable-errors
Open

fix(code-reviews): retry transient sandbox prepareSession failures#3315
imanolmzd-svg wants to merge 2 commits into
mainfrom
imanol/add-retryable-errors

Conversation

@imanolmzd-svg
Copy link
Copy Markdown
Contributor

@imanolmzd-svg imanolmzd-svg commented May 18, 2026

Summary

Extend classifyCloudAgentNextFreshSessionRetry so three transient sandbox-side prepareSession failure bodies are classified retryable instead of falling through to other_5xx / unclassified_5xx. Each new branch is opt-in (literal substring match) and lands after the existing deterministic not-retryable rules, so behavior for billing, cancelled, configured-session-lookup and hard repo/checkout failures is unchanged.

New retryable categories:

  • sandbox_exec_invalid_exit_code — matches mkdir operation failed with exit code (covers NaN and null). The sandbox exec returned no numeric exit code, typically a stuck Cloudflare Sandbox DO.
  • sandbox_network_connection_lost — matches network connection lost. Cloudflare DO connection drop mid-prepareSession.
  • sandbox_command_timeout — matches commanderror + command timeout after. Generic in-sandbox command timeout (e.g. git checkout 30s). Distinct from repo_clone_or_checkout_failure, which stays hard-not-retryable for LFS / missing-object cases.

Why

Three observed failure bodies post-#3290 classifier:

Body Why it landed in other_5xx
mkdir operation failed with exit code NaN has internal_server_error/500 but no sandbox/container/cloudflare → fallback && fails
Network connection lost. matches no branch
CommandError: ... Command timeout after 30000ms command timeout not matched; failed to checkout pull ref not in body

In production this means a single stuck bot reviewer sandbox accumulates permanent failures until the user gives up, instead of being recycled by the orchestrator's fresh-session retry. The pre-#3290 hasSandboxSignal && hasInternalServerSignal clause retried some of these implicitly; this PR makes the intent explicit per failure mode so it can be measured.

Safety

  • Strictly additive: only new branches added; no existing classifier output changes.
  • Each new branch lands after repo_clone_or_checkout_failure, so hard checkout failures (LFS, missing objects, deterministic checkout errors) stay not-retryable. The sandbox_command_timeout branch only triggers on the literal command timeout after substring.
  • Bounded blast: sandboxRetryAttempted === true (orchestrator state) caps retries at one per review (code-review-orchestrator.ts:382). Worst case for a genuinely-broken sandbox is one extra retry.
  • New categories surface in the existing Fresh session retry skipped / Retrying with a fresh session after sandbox 500 log fields (failureCategory, retryClassificationReason) for post-deploy measurement per mode.

Tests

Three new integration tests in code-review-orchestrator.test.ts, mirroring the existing retries prepareSession once after wrapper waitForPort readiness timeout pattern:

  • retries prepareSession once after sandbox mkdir returns invalid exit code
  • retries prepareSession once after sandbox network connection lost
  • retries prepareSession once after sandbox command timeout during workspace setup

Each asserts a single fresh-session retry and a successful initiateFromKilocodeSessionV2 call.

Refs

Three additional cloud-agent-next prepareSession 5xx error bodies are
now classified as retryable, so a fresh-session retry can recover from
transient sandbox-side failures that the new strict classifier in #3290
left in 'other_5xx' / 'unclassified_5xx':

- mkdir operation failed with exit code NaN/null (sandbox exec returned
  no real exit code; typically a stuck Cloudflare Sandbox DO)
- Network connection lost (Cloudflare DO connection drop mid-prepare)
- CommandError + Command timeout after Nms (e.g. git checkout 30s
  timeout); kept distinct from repo_clone_or_checkout_failure which
  remains hard-not-retryable for LFS / missing-object cases

Observed live in the 2026-05-18 code-reviews/up error_spike incident
where multiple bot reviewer sandboxes accumulated permanent failures
because the orchestrator no longer retried these transient cases. With
this change the orchestrator allocates a fresh sandbox on the next
attempt, which is the recovery path the previous classifier reached
implicitly via hasSandboxSignal && hasInternalServerSignal.

The new categories surface in the 'Fresh session retry skipped' /
'Retrying with a fresh session' logs so post-deploy effectiveness can
be measured per failure mode.
Comment thread services/code-review-infra/src/code-review-orchestrator.ts Outdated
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 18, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

The previously flagged inaccurate comment (claiming df pre-flight exitCode: null was covered by the mkdir operation failed with exit code matcher) has been removed along with the other explanatory comments on the surrounding if blocks. No new issues introduced.

Files Reviewed (2 files)
  • services/code-review-infra/src/code-review-orchestrator.ts — previous suggestion resolved (comment removed)
  • services/code-review-infra/test/integration/code-review-orchestrator.test.ts — no changes

Reviewed by claude-sonnet-4.6 · 131,807 tokens

Review guidance: REVIEW.md from base branch main

…sification

The explanatory comments above each retry condition in
classifyCloudAgentNextFreshSessionRetry were removed. The function name,
classification helper, and reason strings already convey sufficient
context for each case.
@imanolmzd-svg imanolmzd-svg enabled auto-merge (squash) May 18, 2026 22:10
@imanolmzd-svg imanolmzd-svg disabled auto-merge May 18, 2026 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants