Skip to content

Conversation

@vmlinuzx
Copy link

@vmlinuzx vmlinuzx commented Jan 16, 2026

Summary

Fixes false-positive background task failure notifications during high API load (e.g., Antigravity during Anthropic OAuth migration).

Problem

When API providers are under heavy load, session.prompt() throws network errors (timeouts, connection drops) even when the background session is still running successfully.

Before: ALL errors immediately marked task as 'error' and notified parent agent → Agent sees "Background agents failed" → Pivots to fallback strategies unnecessarily

After: Only fatal errors (agent not found, not registered) mark task as failed. Network errors are logged and polling detects actual completion/failure.

Root Cause Analysis

From logs during high Antigravity load:

[background-agent] promptAsync error: {}
[background-agent] notifyParentSession called for task: "bg_09c21064"  ← Error notification sent
...
[background-agent] Sent notification: {"taskId":"bg_09c21064","allComplete":true}  ← Task actually completed!

The empty error {} indicates a network/SDK issue, not an agent failure. The task continued running and completed successfully, but the agent had already received the failure notification.

Changes

  • Only treat errors as fatal if they clearly indicate agent/session creation issues
  • Non-fatal errors (network, timeout, SDK) are logged but don't mark task as failed
  • Polling mechanism detects actual completion/failure status

Testing

  • All 46 background-agent tests pass
  • Type check passes
  • Build succeeds

Summary by cubic

Prevents false background task failure notifications when session.prompt() hits transient network/SDK errors. Errors are fatal by default; only known network timeouts/drops are skipped, and polling confirms real completion.

  • Bug Fixes
    • Narrow transient handling: only skip for specific network/SDK errors; added codes/signatures (“network request failed”, “connection/request timed out”, “[object Object]”). Removed broad matches that misclassified errors.
    • Stricter agent-not-found detection: requires “agent” plus “not found”/“not registered”/“does not exist”; removed broad “undefined” matches.
    • Add structured logging (sessionID, agent, error name/message) for easier debugging.
    • Release concurrency on resume fatal errors; standardized messages (“Background task failed…”, “Resume failed…”).

Written for commit 86c1366. Summary will update on new commits.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 16, 2026

Thank you for your contribution! Before we can merge this PR, we need you to sign our Contributor License Agreement (CLA).

To sign the CLA, please comment on this PR with:

I have read the CLA Document and I hereby sign the CLA

This is a one-time requirement. Once signed, all your future contributions will be automatically accepted.


I have read the CLA Document and I hereby sign the CLA


1 out of 2 committers have signed the CLA.
✅ (vmlinuzx)[https://github.com/vmlinuzx]
@dc
DC seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8057c03cc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 188 to 192
errorMessage.includes("agent.name") ||
errorMessage.includes("undefined") ||
errorMessage.includes("not found") ||
errorMessage.includes("not registered") ||
errorMessage.includes("does not exist")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat other hard prompt failures as fatal

The new fatal filter only considers a handful of substrings (agent.name, undefined, not found, etc.) as fatal, so a prompt failure due to other hard errors (e.g., auth/permission errors, invalid request payloads) will be treated as non‑fatal and the task will stay running without releasing concurrency or notifying the parent. Since a failed prompt means the background session may never start, polling can’t advance (status/messages calls will error), leaving the task stuck until the 30‑minute TTL. Consider handling other non‑transient failures as fatal or explicitly checking for auth/permission errors before returning.

Useful? React with 👍 / 👎.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file

Confidence score: 3/5

  • Prompt exceptions in src/features/background-agent/manager.ts are swallowed, so failed tasks linger and keep concurrency slots occupied until TTL cleanup, which can block new work.
  • The issue looks confined to the background agent manager, so once the error propagation is fixed the rest of the PR should behave as intended.
  • Pay close attention to src/features/background-agent/manager.ts – ensure prompt errors propagate so stuck tasks release concurrency immediately.
Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/features/background-agent/manager.ts">

<violation number="1" location="src/features/background-agent/manager.ts:194">
P1: Non-fatal prompt errors are swallowed, leaving tasks stuck running with concurrency held and no parent notification until TTL cleanup.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@kdcokenny
Copy link
Collaborator

kdcokenny commented Jan 16, 2026

@vmlinuzx please sign the cli and resolve the cubic ai conversations.

vmlinuzx pushed a commit to vmlinuzx/oh-my-opencode that referenced this pull request Jan 16, 2026
…sient only

Address PR code-yeongyu#837 feedback from cubic AI and codex:

- INVERTED LOGIC: Treat errors as fatal by default instead of only
  catching specific 'not found' patterns. This fixes the P1 issue where
  auth/permission/payload errors were swallowed, leaving tasks stuck
  until TTL cleanup.

- Only skip notification for known TRANSIENT network errors:
  - TimeoutError, AbortError, FetchError (error names)
  - ETIMEDOUT, ECONNRESET, ECONNREFUSED, ENOTFOUND (Node.js errors)
  - EAI_AGAIN, EPIPE, ENETUNREACH, EHOSTUNREACH (additional network)
  - socket hang up, network request failed, connection/request timed out
  - Empty error objects ({} or '') from SDK issues

- Fixed resume() to also release concurrency on fatal errors (was missing)

- Stricter agent-not-found detection: requires 'agent' AND ('not found'
  OR 'not registered' OR 'does not exist') - prevents false positives
  from generic 'undefined' errors

- Removed overly broad patterns ('network', 'timeout') that could match
  non-transient errors like 'invalid timeout value'

Ultraworked with Sisyphus

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
@vmlinuzx
Copy link
Author

Addressed Feedback

Thanks for the thorough review! I've pushed a fix that addresses the cubic AI and codex feedback:

Changes Made

  1. INVERTED LOGIC - Errors are now fatal by default. Only known transient network errors are swallowed. This fixes the P1 issue.

  2. Tightened transient error detection - Removed overly broad patterns (network, timeout) that could match non-transient errors. Now only matches specific error codes/names:

    • Error names: TimeoutError, AbortError, FetchError
    • Node.js codes: ETIMEDOUT, ECONNRESET, ECONNREFUSED, ENOTFOUND, EAI_AGAIN, EPIPE, ENETUNREACH, EHOSTUNREACH
    • Specific messages: socket hang up, network request failed, connection timed out, request timed out
    • Empty SDK errors: {} or ""
  3. Fixed concurrency leak in resume() - Fatal errors now properly release concurrency slot.

  4. Stricter agent-not-found detection - Requires agent AND (not found | not registered | does not exist) to avoid false positives from generic undefined errors.

Verification

  • ✅ All 46 background-agent tests pass
  • ✅ Type check passes
  • ✅ Build succeeds

@cubic-dev-ai Please re-review!

@cubic-dev-ai
Copy link

cubic-dev-ai bot commented Jan 16, 2026

Addressed Feedback

Thanks for the thorough review! I've pushed a fix that addresses the cubic AI and codex feedback:

Changes Made

...

@vmlinuzx I have started the AI code review. It will take a few minutes to complete.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file

Confidence score: 3/5

  • Empty SDK errors in src/features/background-agent/manager.ts slip past the transient-error handling, meaning what should be retried is treated as fatal and the agent may stop unnecessarily.
  • Given the medium severity and likelihood of real user impact, merging now carries moderate risk unless the transient-error detection is updated.
  • Pay close attention to src/features/background-agent/manager.ts - transient SDK errors aren’t classified correctly.
Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/features/background-agent/manager.ts">

<violation number="1" location="src/features/background-agent/manager.ts:208">
P2: Empty SDK error objects won’t match the transient-error check because `String(error)` produces `[object Object]`, so these cases remain fatal.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@vmlinuzx
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/features/background-agent/manager.ts">

<violation number="1" location="src/features/background-agent/manager.ts:207">
P2: Empty-object SDK/network errors are still treated as fatal because `String(error)` produces `[object Object]`, so the transient check for `{}` never matches, causing false failure notifications.</violation>

<violation number="2" location="src/features/background-agent/manager.ts:420">
P1: Concurrency key is released twice on resume errors, over-releasing the concurrency limit.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

… errors

When API providers are under heavy load (e.g., Antigravity during the
Anthropic OAuth migration), session.prompt() may throw network errors
(timeouts, connection drops) even when the background session is still
running successfully.

Previously, ALL errors from session.prompt() would immediately mark the
task as 'error' and notify the parent agent of failure. This caused
agents to falsely believe their background tasks failed, triggering
unnecessary fallback behavior ('Background agents failed. Pivoting to
direct parallel searches').

Now, only fatal errors (agent not found, not registered, etc.) are
treated as task failures. Non-fatal errors are logged and the task
continues - polling will detect actual completion/failure.

Fixes false-positive failure notifications during high API load.
@vmlinuzx vmlinuzx force-pushed the fix/background-agent-false-failures branch from f1ae594 to de838fc Compare January 16, 2026 15:49
vmlinuzx pushed a commit to vmlinuzx/oh-my-opencode that referenced this pull request Jan 16, 2026
…sient only

Address PR code-yeongyu#837 feedback from cubic AI and codex:

- INVERTED LOGIC: Treat errors as fatal by default instead of only
  catching specific 'not found' patterns. This fixes the P1 issue where
  auth/permission/payload errors were swallowed, leaving tasks stuck
  until TTL cleanup.

- Only skip notification for known TRANSIENT network errors:
  - TimeoutError, AbortError, FetchError (error names)
  - ETIMEDOUT, ECONNRESET, ECONNREFUSED, ENOTFOUND (Node.js errors)
  - EAI_AGAIN, EPIPE, ENETUNREACH, EHOSTUNREACH (additional network)
  - socket hang up, network request failed, connection/request timed out
  - Empty error objects ({} or '') from SDK issues

- Fixed resume() to also release concurrency on fatal errors (was missing)

- Stricter agent-not-found detection: requires 'agent' AND ('not found'
  OR 'not registered' OR 'does not exist') - prevents false positives
  from generic 'undefined' errors

- Removed overly broad patterns ('network', 'timeout') that could match
  non-transient errors like 'invalid timeout value'

Ultraworked with Sisyphus

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
…sient only

Address PR code-yeongyu#837 feedback from cubic AI and codex:

- INVERTED LOGIC: Treat errors as fatal by default instead of only
  catching specific 'not found' patterns. This fixes the P1 issue where
  auth/permission/payload errors were swallowed, leaving tasks stuck
  until TTL cleanup.

- Only skip notification for known TRANSIENT network errors:
  - TimeoutError, AbortError, FetchError (error names)
  - ETIMEDOUT, ECONNRESET, ECONNREFUSED, ENOTFOUND (Node.js errors)
  - EAI_AGAIN, EPIPE, ENETUNREACH, EHOSTUNREACH (additional network)
  - socket hang up, network request failed, connection/request timed out
  - Empty error objects ({} or '') from SDK issues

- Fixed resume() to also release concurrency on fatal errors (was missing)

- Stricter agent-not-found detection: requires 'agent' AND ('not found'
  OR 'not registered' OR 'does not exist') - prevents false positives
  from generic 'undefined' errors

- Removed overly broad patterns ('network', 'timeout') that could match
  non-transient errors like 'invalid timeout value'

Ultraworked with Sisyphus

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
@vmlinuzx vmlinuzx force-pushed the fix/background-agent-false-failures branch from de838fc to e243dd4 Compare January 16, 2026 15:49
github-actions bot added a commit that referenced this pull request Jan 16, 2026
@vmlinuzx
Copy link
Author

@cubic-dev-ai

@cubic-dev-ai
Copy link

cubic-dev-ai bot commented Jan 16, 2026

@cubic-dev-ai

@vmlinuzx I have started the AI code review. It will take a few minutes to complete.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file

Confidence score: 3/5

  • Empty SDK errors in src/features/background-agent/manager.ts are treated as fatal because the transient-check never matches, so benign failures can bubble up and halt background processing.
  • Transient network failures leave background tasks stuck running in src/features/background-agent/manager.ts, leaking concurrency slots until TTL and risking throughput degradation.
  • Pay close attention to src/features/background-agent/manager.ts – error classification and stuck-task handling need fixes.
Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/features/background-agent/manager.ts">

<violation number="1" location="src/features/background-agent/manager.ts:221">
P2: Empty SDK error objects are misclassified as fatal because `String(error)` returns `[object Object]`, so the `errorMessage === "{}"` transient check never triggers</violation>

<violation number="2" location="src/features/background-agent/manager.ts:225">
P2: Transient network errors leave tasks stuck running and hold concurrency until TTL, causing slot leaks</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.


if (isTransientNetworkError) {
log("[background-agent] Transient network error, letting polling detect actual status:", sessionID)
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Transient network errors leave tasks stuck running and hold concurrency until TTL, causing slot leaks

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/features/background-agent/manager.ts, line 225:

<comment>Transient network errors leave tasks stuck running and hold concurrency until TTL, causing slot leaks</comment>

<file context>
@@ -186,15 +186,61 @@ export class BackgroundManager {
+      
+      if (isTransientNetworkError) {
+        log("[background-agent] Transient network error, letting polling detect actual status:", sessionID)
+        return
+      }
+      
</file context>

String(emptyErrorObject) returns '[object Object]', not '{}', so the
transient error check was missing empty SDK error objects. This caused
false failure notifications when the session was actually still running.

Applied fix to both launch() and resume() error handlers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants