fix(background-agent): prevent false failure notifications on network errors #837

vmlinuzx · 2026-01-16T03:04:06Z

Summary

Fixes false-positive background task failure notifications during high API load (e.g., Antigravity during Anthropic OAuth migration).

Problem

When API providers are under heavy load, session.prompt() throws network errors (timeouts, connection drops) even when the background session is still running successfully.

Before: ALL errors immediately marked task as 'error' and notified parent agent → Agent sees "Background agents failed" → Pivots to fallback strategies unnecessarily

After: Only fatal errors (agent not found, not registered) mark task as failed. Network errors are logged and polling detects actual completion/failure.

Root Cause Analysis

From logs during high Antigravity load:

[background-agent] promptAsync error: {}
[background-agent] notifyParentSession called for task: "bg_09c21064"  ← Error notification sent
...
[background-agent] Sent notification: {"taskId":"bg_09c21064","allComplete":true}  ← Task actually completed!

The empty error {} indicates a network/SDK issue, not an agent failure. The task continued running and completed successfully, but the agent had already received the failure notification.

Changes

Only treat errors as fatal if they clearly indicate agent/session creation issues
Non-fatal errors (network, timeout, SDK) are logged but don't mark task as failed
Polling mechanism detects actual completion/failure status

Testing

All 46 background-agent tests pass
Type check passes
Build succeeds

Summary by cubic

Prevents false background task failure notifications when session.prompt() hits transient network/SDK errors. Errors are fatal by default; only known network timeouts/drops are skipped, and polling confirms real completion.

Bug Fixes
- Narrow transient handling: only skip for specific network/SDK errors; added codes/signatures (“network request failed”, “connection/request timed out”, “[object Object]”). Removed broad matches that misclassified errors.
- Stricter agent-not-found detection: requires “agent” plus “not found”/“not registered”/“does not exist”; removed broad “undefined” matches.
- Add structured logging (sessionID, agent, error name/message) for easier debugging.
- Release concurrency on resume fatal errors; standardized messages (“Background task failed…”, “Resume failed…”).

^{Written for commit 86c1366. Summary will update on new commits.}

github-actions · 2026-01-16T03:04:15Z

Thank you for your contribution! Before we can merge this PR, we need you to sign our Contributor License Agreement (CLA).

To sign the CLA, please comment on this PR with:

I have read the CLA Document and I hereby sign the CLA

This is a one-time requirement. Once signed, all your future contributions will be automatically accepted.

I have read the CLA Document and I hereby sign the CLA

1 out of 2 committers have signed the CLA.
✅ (vmlinuzx)[https://github.com/vmlinuzx]
❌ @dc
DC seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8057c03cc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-16T03:06:28Z

src/features/background-agent/manager.ts

+        errorMessage.includes("agent.name") || 
+        errorMessage.includes("undefined") ||
+        errorMessage.includes("not found") ||
+        errorMessage.includes("not registered") ||
+        errorMessage.includes("does not exist")


Treat other hard prompt failures as fatal

The new fatal filter only considers a handful of substrings (agent.name, undefined, not found, etc.) as fatal, so a prompt failure due to other hard errors (e.g., auth/permission errors, invalid request payloads) will be treated as non‑fatal and the task will stay running without releasing concurrency or notifying the parent. Since a failed prompt means the background session may never start, polling can’t advance (status/messages calls will error), leaving the task stuck until the 30‑minute TTL. Consider handling other non‑transient failures as fatal or explicitly checking for auth/permission errors before returning.

Useful? React with 👍 / 👎.

cubic-dev-ai

1 issue found across 1 file

Confidence score: 3/5

Prompt exceptions in src/features/background-agent/manager.ts are swallowed, so failed tasks linger and keep concurrency slots occupied until TTL cleanup, which can block new work.
The issue looks confined to the background agent manager, so once the error propagation is fixed the rest of the PR should behave as intended.
Pay close attention to src/features/background-agent/manager.ts – ensure prompt errors propagate so stuck tasks release concurrency immediately.

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/features/background-agent/manager.ts">

<violation number="1" location="src/features/background-agent/manager.ts:194">
P1: Non-fatal prompt errors are swallowed, leaving tasks stuck running with concurrency held and no parent notification until TTL cleanup.</violation>
</file>

Since this is your first cubic review, here's how it works:

cubic automatically reviews your code and comments on bugs and improvements
Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
Ask questions if you need clarification on any suggestion

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

src/features/background-agent/manager.ts

kdcokenny · 2026-01-16T12:21:29Z

@vmlinuzx please sign the cli and resolve the cubic ai conversations.

…sient only Address PR code-yeongyu#837 feedback from cubic AI and codex: - INVERTED LOGIC: Treat errors as fatal by default instead of only catching specific 'not found' patterns. This fixes the P1 issue where auth/permission/payload errors were swallowed, leaving tasks stuck until TTL cleanup. - Only skip notification for known TRANSIENT network errors: - TimeoutError, AbortError, FetchError (error names) - ETIMEDOUT, ECONNRESET, ECONNREFUSED, ENOTFOUND (Node.js errors) - EAI_AGAIN, EPIPE, ENETUNREACH, EHOSTUNREACH (additional network) - socket hang up, network request failed, connection/request timed out - Empty error objects ({} or '') from SDK issues - Fixed resume() to also release concurrency on fatal errors (was missing) - Stricter agent-not-found detection: requires 'agent' AND ('not found' OR 'not registered' OR 'does not exist') - prevents false positives from generic 'undefined' errors - Removed overly broad patterns ('network', 'timeout') that could match non-transient errors like 'invalid timeout value' Ultraworked with Sisyphus Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

vmlinuzx · 2026-01-16T15:37:52Z

Addressed Feedback

Thanks for the thorough review! I've pushed a fix that addresses the cubic AI and codex feedback:

Changes Made

INVERTED LOGIC - Errors are now fatal by default. Only known transient network errors are swallowed. This fixes the P1 issue.
Tightened transient error detection - Removed overly broad patterns (network, timeout) that could match non-transient errors. Now only matches specific error codes/names:
- Error names: TimeoutError, AbortError, FetchError
- Node.js codes: ETIMEDOUT, ECONNRESET, ECONNREFUSED, ENOTFOUND, EAI_AGAIN, EPIPE, ENETUNREACH, EHOSTUNREACH
- Specific messages: socket hang up, network request failed, connection timed out, request timed out
- Empty SDK errors: {} or ""
Fixed concurrency leak in resume() - Fatal errors now properly release concurrency slot.
Stricter agent-not-found detection - Requires agent AND (not found | not registered | does not exist) to avoid false positives from generic undefined errors.

Verification

✅ All 46 background-agent tests pass
✅ Type check passes
✅ Build succeeds

@cubic-dev-ai Please re-review!

cubic-dev-ai · 2026-01-16T15:38:07Z

Addressed Feedback

Thanks for the thorough review! I've pushed a fix that addresses the cubic AI and codex feedback:

Changes Made

...

@vmlinuzx I have started the AI code review. It will take a few minutes to complete.

cubic-dev-ai

1 issue found across 1 file

Confidence score: 3/5

Empty SDK errors in src/features/background-agent/manager.ts slip past the transient-error handling, meaning what should be retried is treated as fatal and the agent may stop unnecessarily.
Given the medium severity and likelihood of real user impact, merging now carries moderate risk unless the transient-error detection is updated.
Pay close attention to src/features/background-agent/manager.ts - transient SDK errors aren’t classified correctly.

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/features/background-agent/manager.ts">

<violation number="1" location="src/features/background-agent/manager.ts:208">
P2: Empty SDK error objects won’t match the transient-error check because `String(error)` produces `[object Object]`, so these cases remain fatal.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

src/features/background-agent/manager.ts

vmlinuzx · 2026-01-16T15:45:52Z

I have read the CLA Document and I hereby sign the CLA

cubic-dev-ai

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/features/background-agent/manager.ts">

<violation number="1" location="src/features/background-agent/manager.ts:207">
P2: Empty-object SDK/network errors are still treated as fatal because `String(error)` produces `[object Object]`, so the transient check for `{}` never matches, causing false failure notifications.</violation>

<violation number="2" location="src/features/background-agent/manager.ts:420">
P1: Concurrency key is released twice on resume errors, over-releasing the concurrency limit.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

src/features/background-agent/manager.ts

… errors When API providers are under heavy load (e.g., Antigravity during the Anthropic OAuth migration), session.prompt() may throw network errors (timeouts, connection drops) even when the background session is still running successfully. Previously, ALL errors from session.prompt() would immediately mark the task as 'error' and notify the parent agent of failure. This caused agents to falsely believe their background tasks failed, triggering unnecessary fallback behavior ('Background agents failed. Pivoting to direct parallel searches'). Now, only fatal errors (agent not found, not registered, etc.) are treated as task failures. Non-fatal errors are logged and the task continues - polling will detect actual completion/failure. Fixes false-positive failure notifications during high API load.

…sient only Address PR code-yeongyu#837 feedback from cubic AI and codex: - INVERTED LOGIC: Treat errors as fatal by default instead of only catching specific 'not found' patterns. This fixes the P1 issue where auth/permission/payload errors were swallowed, leaving tasks stuck until TTL cleanup. - Only skip notification for known TRANSIENT network errors: - TimeoutError, AbortError, FetchError (error names) - ETIMEDOUT, ECONNRESET, ECONNREFUSED, ENOTFOUND (Node.js errors) - EAI_AGAIN, EPIPE, ENETUNREACH, EHOSTUNREACH (additional network) - socket hang up, network request failed, connection/request timed out - Empty error objects ({} or '') from SDK issues - Fixed resume() to also release concurrency on fatal errors (was missing) - Stricter agent-not-found detection: requires 'agent' AND ('not found' OR 'not registered' OR 'does not exist') - prevents false positives from generic 'undefined' errors - Removed overly broad patterns ('network', 'timeout') that could match non-transient errors like 'invalid timeout value' Ultraworked with Sisyphus Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

vmlinuzx · 2026-01-16T15:53:24Z

@cubic-dev-ai

cubic-dev-ai · 2026-01-16T15:53:29Z

@cubic-dev-ai

@vmlinuzx I have started the AI code review. It will take a few minutes to complete.

cubic-dev-ai

2 issues found across 1 file

Confidence score: 3/5

Empty SDK errors in src/features/background-agent/manager.ts are treated as fatal because the transient-check never matches, so benign failures can bubble up and halt background processing.
Transient network failures leave background tasks stuck running in src/features/background-agent/manager.ts, leaking concurrency slots until TTL and risking throughput degradation.
Pay close attention to src/features/background-agent/manager.ts – error classification and stuck-task handling need fixes.

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/features/background-agent/manager.ts">

<violation number="1" location="src/features/background-agent/manager.ts:221">
P2: Empty SDK error objects are misclassified as fatal because `String(error)` returns `[object Object]`, so the `errorMessage === "{}"` transient check never triggers</violation>

<violation number="2" location="src/features/background-agent/manager.ts:225">
P2: Transient network errors leave tasks stuck running and hold concurrency until TTL, causing slot leaks</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

src/features/background-agent/manager.ts

cubic-dev-ai · 2026-01-16T15:59:02Z

src/features/background-agent/manager.ts

+
+      if (isTransientNetworkError) {
+        log("[background-agent] Transient network error, letting polling detect actual status:", sessionID)
+        return


P2: Transient network errors leave tasks stuck running and hold concurrency until TTL, causing slot leaks

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At src/features/background-agent/manager.ts, line 225: <comment>Transient network errors leave tasks stuck running and hold concurrency until TTL, causing slot leaks</comment> <file context> @@ -186,15 +186,61 @@ export class BackgroundManager { + + if (isTransientNetworkError) { + log("[background-agent] Transient network error, letting polling detect actual status:", sessionID) + return + } + </file context>

String(emptyErrorObject) returns '[object Object]', not '{}', so the transient error check was missing empty SDK error objects. This caused false failure notifications when the session was actually still running. Applied fix to both launch() and resume() error handlers.

chatgpt-codex-connector bot reviewed Jan 16, 2026

View reviewed changes

cubic-dev-ai bot reviewed Jan 16, 2026

View reviewed changes

src/features/background-agent/manager.ts Outdated Show resolved Hide resolved

cubic-dev-ai bot reviewed Jan 16, 2026

View reviewed changes

src/features/background-agent/manager.ts Outdated Show resolved Hide resolved

cubic-dev-ai bot reviewed Jan 16, 2026

View reviewed changes

src/features/background-agent/manager.ts Show resolved Hide resolved

src/features/background-agent/manager.ts Show resolved Hide resolved

vmlinuzx force-pushed the fix/background-agent-false-failures branch from f1ae594 to de838fc Compare January 16, 2026 15:49

vmlinuzx force-pushed the fix/background-agent-false-failures branch from de838fc to e243dd4 Compare January 16, 2026 15:49

github-actions bot added a commit that referenced this pull request Jan 16, 2026

@vmlinuzx has signed the CLA in #837

a5097a4

cubic-dev-ai bot reviewed Jan 16, 2026

View reviewed changes

fix(background-agent): prevent false failure notifications on network errors #837

Are you sure you want to change the base?

fix(background-agent): prevent false failure notifications on network errors #837

Conversation

vmlinuzx commented Jan 16, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause Analysis

Changes

Testing

Summary by cubic

Uh oh!

github-actions bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kdcokenny commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vmlinuzx commented Jan 16, 2026

Addressed Feedback

Changes Made

Verification

Uh oh!

cubic-dev-ai bot commented Jan 16, 2026

Addressed Feedback

Changes Made

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vmlinuzx commented Jan 16, 2026

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vmlinuzx commented Jan 16, 2026

Uh oh!

cubic-dev-ai bot commented Jan 16, 2026

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vmlinuzx commented Jan 16, 2026 •

edited by cubic-dev-ai bot

Loading

github-actions bot commented Jan 16, 2026 •

edited

Loading

kdcokenny commented Jan 16, 2026 •

edited

Loading