Claude harness: avoid invalid --continue retries after SIGTERM/no-deferred-marker#31194
Conversation
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
--continue retries after SIGTERM/no-deferred-marker
There was a problem hiding this comment.
Pull request overview
Fixes Claude harness retry behavior to avoid invalid --continue retries after signal-style termination and after No deferred tool marker found, falling back to fresh retries when appropriate.
Changes:
- Added
isSignalTerminationExitCode()andshouldRetryWithContinue()to centralize/strengthen--continueretry eligibility. - Changed no-deferred-marker handling to disable
--continuepermanently and fall back to fresh retries when budget remains. - Added unit tests covering the new retry-decision matrix helpers.
Show a summary per file
| File | Description |
|---|---|
| actions/setup/js/claude_harness.cjs | Adds new helper functions and hardens retry logic to avoid invalid --continue paths and improve diagnostics. |
| actions/setup/js/claude_harness.test.cjs | Adds unit tests for the new helper functions and retry eligibility conditions. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 2
| @@ -79,6 +79,7 @@ const MAX_TURNS_EXIT_PATTERN = /"subtype"\s*:\s*"error_max_turns"/; | |||
| // window. Retrying with --continue will always produce the same instant failure, so | |||
| // this is a deterministic terminal condition that must not be retried. | |||
| log(`attempt ${attempt + 1}: no deferred tool marker — not retriable via --continue`); | ||
| if (attempt < MAX_RETRIES && result.hasOutput) { | ||
| useContinueOnRetry = false; | ||
| continueDisabledPermanently = true; |
|
✅ Necromancer fortified this PR with fresh regression coverage. |
🧪 Test Quality Sentinel ReportTest Quality Score: 88/100✅ Excellent test quality
Test Classification DetailsView all 5 tests
Flagged Tests — Requires ReviewNo tests flagged. All tests verify observable behavioral contracts. Score Breakdown
Language SupportTests analyzed:
Verdict
📖 Understanding Test ClassificationsDesign Tests (High Value) verify what the system does:
Implementation Tests (Low Value) verify how the system does it:
Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators. References: §25600541860
|
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /diagnose and /tdd — this is a targeted bug fix addressing invalid --continue retry paths after signal-style termination.
Key Themes
-
Dead parameter in
shouldRetryWithContinue: TheisNoDeferredMarkerparameter (and its guard inside the function) is unreachable at the only call site inmain, because theisNoDeferredMarkerearly-exit block alwayscontinues orbreaks before reaching the retry section. This makes the parameter misleading and its test case moot — see inline comment. -
Test coverage gaps on
shouldRetryWithContinue's early-guard compound condition: The firstif(attempt >= maxRetries || !hasOutput || continueDisabledPermanently) has three sub-conditions, none of which are individually tested — see inline comment.
Positive Highlights
- ✅ Extracting
shouldRetryWithContinueinto a pure function is exactly the right move — it's easily testable in isolation and makes the retry logic readable. - ✅
continueDisabledPermanentlyas a sticky flag prevents re-entering a broken continue-path on subsequent attempts — clean state machine design. - ✅ Explicit
failure_reason=log tokens are a nice addition for post-hoc diagnostics and log-based alerting. - ✅
SIGNAL_TERMINATION_EXIT_CODESas aSetis efficient and self-documenting.
Verdict
Commentary only — the core fix is sound and well-structured. The two issues flagged are a correctness concern (dead parameter that implies a reachable code path that doesn't exist) and a test-completeness concern. Neither blocks the fix, but addressing them would make the retry logic more trustworthy long-term.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · ● 4.7M
| * }} input | ||
| * @returns {boolean} | ||
| */ | ||
| function shouldRetryWithContinue({ attempt, maxRetries, exitCode, hasOutput, isNoDeferredMarker, continueDisabledPermanently }) { |
There was a problem hiding this comment.
[/diagnose] The isNoDeferredMarker guard inside shouldRetryWithContinue is dead code at the call site.
The early-exit block in main always either continues or breaks when isNoDeferredMarker is true, so by the time shouldRetryWithContinue is invoked, isNoDeferredMarker is invariably false. The test exercising isNoDeferredMarker: true therefore covers a path that cannot occur in production.
Consider removing the parameter and its guard from shouldRetryWithContinue, so the function contract matches its actual call sites.
| hasOutput: true, | ||
| isNoDeferredMarker: true, | ||
| continueDisabledPermanently: false, | ||
| }); |
There was a problem hiding this comment.
[/tdd] The three guards in shouldRetryWithContinue that fire before the signal-termination check are untested:
attempt >= maxRetries→ should returnfalse!hasOutput→ should returnfalsecontinueDisabledPermanently: true→ should returnfalse
Suggested additions:
it('returns false when attempt has reached maxRetries', () => {
expect(shouldRetryWithContinue({ attempt: 3, maxRetries: 3, exitCode: 1, hasOutput: true, isNoDeferredMarker: false, continueDisabledPermanently: false })).toBe(false);
});
it('returns false when hasOutput is false', () => {
expect(shouldRetryWithContinue({ attempt: 0, maxRetries: 3, exitCode: 1, hasOutput: false, isNoDeferredMarker: false, continueDisabledPermanently: false })).toBe(false);
});
it('returns false when continueDisabledPermanently is true', () => {
expect(shouldRetryWithContinue({ attempt: 0, maxRetries: 3, exitCode: 1, hasOutput: true, isNoDeferredMarker: false, continueDisabledPermanently: true })).toBe(false);
});Without these, a future refactor of the early-guard line could silently regress retry behaviour.
|
@copilot review all comments |
|
Commit pushed:
|
|
Necromancer regression pass complete. Linked issue investigated: #31177, covering Claude harness retries that incorrectly used Regression risk found: the PR had helper-level coverage, but the retry loop itself could still regress by re-entering Tests added: updated Validation run:
Changes pushed to PR branch:
|
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Addressed the review feedback in commit
Validated with |
Bug Fix
What was the bug?
claude_harness.cjstreated failed attempts with output as resumable via--continue. After SIGTERM/SIGKILL-style termination (143/137) orNo deferred tool marker found, retry flow could enter an invalid continue path and fail prematurely. It could also accidentally keep a caller-supplied--continueon “fresh” retries.How did you fix it?
Retry path hardening
isSignalTerminationExitCode()for137/143.shouldRetryWithContinue()to centralize continue eligibility checks.Fresh-retry fallback instead of terminal failure
isNoDeferredMarkerError, the harness now falls through to a fresh retry (when budget remains) instead of stopping immediately.--continueis permanently disabled after invalid continue-path detection.--continueso non-continue retries are truly fresh.Failure reason clarity
failure_reason=harness_retry_path_invalidfailure_reason=cancelled_or_timed_outUnit coverage for retry matrix
exitCode=143⇒shouldRetryWithContinue(...) === falseisNoDeferredMarker=true⇒shouldRetryWithContinue(...) === false--continueis removed on fresh retry after invalid continue-path detectionTesting
Targeted harness unit tests were updated and run to validate the retry-decision matrix, fresh-retry fallback behavior, and user-supplied
--continuestripping on fresh retries.