-
Notifications
You must be signed in to change notification settings - Fork 1.5k
fix(background-agent): prevent false failure notifications on network errors #837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
fix(background-agent): prevent false failure notifications on network errors #837
Conversation
|
Thank you for your contribution! Before we can merge this PR, we need you to sign our Contributor License Agreement (CLA). To sign the CLA, please comment on this PR with: This is a one-time requirement. Once signed, all your future contributions will be automatically accepted. I have read the CLA Document and I hereby sign the CLA 1 out of 2 committers have signed the CLA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b8057c03cc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| errorMessage.includes("agent.name") || | ||
| errorMessage.includes("undefined") || | ||
| errorMessage.includes("not found") || | ||
| errorMessage.includes("not registered") || | ||
| errorMessage.includes("does not exist") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Treat other hard prompt failures as fatal
The new fatal filter only considers a handful of substrings (agent.name, undefined, not found, etc.) as fatal, so a prompt failure due to other hard errors (e.g., auth/permission errors, invalid request payloads) will be treated as non‑fatal and the task will stay running without releasing concurrency or notifying the parent. Since a failed prompt means the background session may never start, polling can’t advance (status/messages calls will error), leaving the task stuck until the 30‑minute TTL. Consider handling other non‑transient failures as fatal or explicitly checking for auth/permission errors before returning.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 1 file
Confidence score: 3/5
- Prompt exceptions in
src/features/background-agent/manager.tsare swallowed, so failed tasks linger and keep concurrency slots occupied until TTL cleanup, which can block new work. - The issue looks confined to the background agent manager, so once the error propagation is fixed the rest of the PR should behave as intended.
- Pay close attention to
src/features/background-agent/manager.ts– ensure prompt errors propagate so stuck tasks release concurrency immediately.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/features/background-agent/manager.ts">
<violation number="1" location="src/features/background-agent/manager.ts:194">
P1: Non-fatal prompt errors are swallowed, leaving tasks stuck running with concurrency held and no parent notification until TTL cleanup.</violation>
</file>
Since this is your first cubic review, here's how it works:
- cubic automatically reviews your code and comments on bugs and improvements
- Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
- Ask questions if you need clarification on any suggestion
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
@vmlinuzx please sign the cli and resolve the cubic ai conversations. |
…sient only Address PR code-yeongyu#837 feedback from cubic AI and codex: - INVERTED LOGIC: Treat errors as fatal by default instead of only catching specific 'not found' patterns. This fixes the P1 issue where auth/permission/payload errors were swallowed, leaving tasks stuck until TTL cleanup. - Only skip notification for known TRANSIENT network errors: - TimeoutError, AbortError, FetchError (error names) - ETIMEDOUT, ECONNRESET, ECONNREFUSED, ENOTFOUND (Node.js errors) - EAI_AGAIN, EPIPE, ENETUNREACH, EHOSTUNREACH (additional network) - socket hang up, network request failed, connection/request timed out - Empty error objects ({} or '') from SDK issues - Fixed resume() to also release concurrency on fatal errors (was missing) - Stricter agent-not-found detection: requires 'agent' AND ('not found' OR 'not registered' OR 'does not exist') - prevents false positives from generic 'undefined' errors - Removed overly broad patterns ('network', 'timeout') that could match non-transient errors like 'invalid timeout value' Ultraworked with Sisyphus Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Addressed FeedbackThanks for the thorough review! I've pushed a fix that addresses the cubic AI and codex feedback: Changes Made
Verification
@cubic-dev-ai Please re-review! |
@vmlinuzx I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 1 file
Confidence score: 3/5
- Empty SDK errors in
src/features/background-agent/manager.tsslip past the transient-error handling, meaning what should be retried is treated as fatal and the agent may stop unnecessarily. - Given the medium severity and likelihood of real user impact, merging now carries moderate risk unless the transient-error detection is updated.
- Pay close attention to
src/features/background-agent/manager.ts- transient SDK errors aren’t classified correctly.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/features/background-agent/manager.ts">
<violation number="1" location="src/features/background-agent/manager.ts:208">
P2: Empty SDK error objects won’t match the transient-error check because `String(error)` produces `[object Object]`, so these cases remain fatal.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
I have read the CLA Document and I hereby sign the CLA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 issues found across 1 file (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/features/background-agent/manager.ts">
<violation number="1" location="src/features/background-agent/manager.ts:207">
P2: Empty-object SDK/network errors are still treated as fatal because `String(error)` produces `[object Object]`, so the transient check for `{}` never matches, causing false failure notifications.</violation>
<violation number="2" location="src/features/background-agent/manager.ts:420">
P1: Concurrency key is released twice on resume errors, over-releasing the concurrency limit.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
… errors
When API providers are under heavy load (e.g., Antigravity during the
Anthropic OAuth migration), session.prompt() may throw network errors
(timeouts, connection drops) even when the background session is still
running successfully.
Previously, ALL errors from session.prompt() would immediately mark the
task as 'error' and notify the parent agent of failure. This caused
agents to falsely believe their background tasks failed, triggering
unnecessary fallback behavior ('Background agents failed. Pivoting to
direct parallel searches').
Now, only fatal errors (agent not found, not registered, etc.) are
treated as task failures. Non-fatal errors are logged and the task
continues - polling will detect actual completion/failure.
Fixes false-positive failure notifications during high API load.
f1ae594 to
de838fc
Compare
…sient only Address PR code-yeongyu#837 feedback from cubic AI and codex: - INVERTED LOGIC: Treat errors as fatal by default instead of only catching specific 'not found' patterns. This fixes the P1 issue where auth/permission/payload errors were swallowed, leaving tasks stuck until TTL cleanup. - Only skip notification for known TRANSIENT network errors: - TimeoutError, AbortError, FetchError (error names) - ETIMEDOUT, ECONNRESET, ECONNREFUSED, ENOTFOUND (Node.js errors) - EAI_AGAIN, EPIPE, ENETUNREACH, EHOSTUNREACH (additional network) - socket hang up, network request failed, connection/request timed out - Empty error objects ({} or '') from SDK issues - Fixed resume() to also release concurrency on fatal errors (was missing) - Stricter agent-not-found detection: requires 'agent' AND ('not found' OR 'not registered' OR 'does not exist') - prevents false positives from generic 'undefined' errors - Removed overly broad patterns ('network', 'timeout') that could match non-transient errors like 'invalid timeout value' Ultraworked with Sisyphus Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
…sient only Address PR code-yeongyu#837 feedback from cubic AI and codex: - INVERTED LOGIC: Treat errors as fatal by default instead of only catching specific 'not found' patterns. This fixes the P1 issue where auth/permission/payload errors were swallowed, leaving tasks stuck until TTL cleanup. - Only skip notification for known TRANSIENT network errors: - TimeoutError, AbortError, FetchError (error names) - ETIMEDOUT, ECONNRESET, ECONNREFUSED, ENOTFOUND (Node.js errors) - EAI_AGAIN, EPIPE, ENETUNREACH, EHOSTUNREACH (additional network) - socket hang up, network request failed, connection/request timed out - Empty error objects ({} or '') from SDK issues - Fixed resume() to also release concurrency on fatal errors (was missing) - Stricter agent-not-found detection: requires 'agent' AND ('not found' OR 'not registered' OR 'does not exist') - prevents false positives from generic 'undefined' errors - Removed overly broad patterns ('network', 'timeout') that could match non-transient errors like 'invalid timeout value' Ultraworked with Sisyphus Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
de838fc to
e243dd4
Compare
|
@vmlinuzx I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 issues found across 1 file
Confidence score: 3/5
- Empty SDK errors in
src/features/background-agent/manager.tsare treated as fatal because the transient-check never matches, so benign failures can bubble up and halt background processing. - Transient network failures leave background tasks stuck running in
src/features/background-agent/manager.ts, leaking concurrency slots until TTL and risking throughput degradation. - Pay close attention to
src/features/background-agent/manager.ts– error classification and stuck-task handling need fixes.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/features/background-agent/manager.ts">
<violation number="1" location="src/features/background-agent/manager.ts:221">
P2: Empty SDK error objects are misclassified as fatal because `String(error)` returns `[object Object]`, so the `errorMessage === "{}"` transient check never triggers</violation>
<violation number="2" location="src/features/background-agent/manager.ts:225">
P2: Transient network errors leave tasks stuck running and hold concurrency until TTL, causing slot leaks</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
|
||
| if (isTransientNetworkError) { | ||
| log("[background-agent] Transient network error, letting polling detect actual status:", sessionID) | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Transient network errors leave tasks stuck running and hold concurrency until TTL, causing slot leaks
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/features/background-agent/manager.ts, line 225:
<comment>Transient network errors leave tasks stuck running and hold concurrency until TTL, causing slot leaks</comment>
<file context>
@@ -186,15 +186,61 @@ export class BackgroundManager {
+
+ if (isTransientNetworkError) {
+ log("[background-agent] Transient network error, letting polling detect actual status:", sessionID)
+ return
+ }
+
</file context>
String(emptyErrorObject) returns '[object Object]', not '{}', so the
transient error check was missing empty SDK error objects. This caused
false failure notifications when the session was actually still running.
Applied fix to both launch() and resume() error handlers.
Summary
Fixes false-positive background task failure notifications during high API load (e.g., Antigravity during Anthropic OAuth migration).
Problem
When API providers are under heavy load,
session.prompt()throws network errors (timeouts, connection drops) even when the background session is still running successfully.Before: ALL errors immediately marked task as 'error' and notified parent agent → Agent sees "Background agents failed" → Pivots to fallback strategies unnecessarily
After: Only fatal errors (agent not found, not registered) mark task as failed. Network errors are logged and polling detects actual completion/failure.
Root Cause Analysis
From logs during high Antigravity load:
The empty error
{}indicates a network/SDK issue, not an agent failure. The task continued running and completed successfully, but the agent had already received the failure notification.Changes
Testing
Summary by cubic
Prevents false background task failure notifications when session.prompt() hits transient network/SDK errors. Errors are fatal by default; only known network timeouts/drops are skipped, and polling confirms real completion.
Written for commit 86c1366. Summary will update on new commits.