🤖 feat(goals): long-horizon auto-continuation with budgets#3235
🤖 feat(goals): long-horizon auto-continuation with budgets#3235
Conversation
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
|
/coder-agents-review |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa89082972
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
First-pass review (Netero only). This is a mechanical scan; the full review panel has not yet reviewed this PR.
Netero raised 6 findings. 5 were dropped during verification: 3 cited an em-dash convention that does not exist in mux, and 2 cited a Storybook-only testing convention that mux does not follow (88 standalone .test.tsx files exist in src/browser). 1 P2 survives.
The surviving finding is a permanently empty UI section that ships to users with no path to displaying real data. The panel will review after this is addressed.
🤖 This review was automatically generated with Coder Agents.
|
@codex review P1 issues addressed in
Please take another look. |
|
/coder-agents-review |
There was a problem hiding this comment.
Round 2 is blocked. The open finding from round 1 has no author response and the code is unchanged.
DEREM-6 (P2, GoalTab.tsx:374-380): "Recent continuations" section is a permanent placeholder with no data source. No continuations prop, no schema field, no data pipeline. The author's e488c4ebd commit addressed the two Codex P1 findings but did not address or acknowledge DEREM-6.
Further review is blocked until the author responds to DEREM-6 (fix, acknowledge, or contest) or pushes a commit addressing it. The full review panel will run once this is resolved.
🤖 This review was automatically generated with Coder Agents.
|
Codex Review: Didn't find any major issues. Keep it up! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@coder-agents-review DEREM-6 addressed in the latest force-push: deleted the empty "Recent continuations" section from The full review panel can run now. |
|
/coder-agents-review |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a26bcdd0a4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review Two new P1s addressed in
Regression tests added for both. |
|
/coder-agents-review |
|
Codex Review: Didn't find any major issues. 👍 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Second Netero pass (round 3). The full review panel has not yet reviewed this PR.
DEREM-6 (placeholder UI) verified fixed. 4 new findings: 1 P2, 3 P3. The P2 is a test reliability issue (raw setTimeout for coordination, violating mux AGENTS.md). The P3s are dead code, missing test coverage, and a silently divergent helper. The panel will review after the P2 is addressed.
🤖 This review was automatically generated with Coder Agents.
|
@codex review Two more rounds addressed in |
|
/coder-agents-review |
|
Addressed all 9 threads from the previous P2
P3
/coder-agents-review |
|
/coder-agents-review |
There was a problem hiding this comment.
Re-review round (3 reviewers). All 9 prior findings verified fixed. The author's responsiveness has been exemplary across 7 rounds.
1 P2, 4 P3, 2 Nit new this round plus 3 Netero findings (DEREM-40 through DEREM-42). The P2 is a remaining 500 escape in the DEREM-36 fix: the wrapper catches two typed exceptions but a plain Error("Goal objective is required.") still leaks through the oRPC handler.
The outstanding findings are converging on diminishing returns: test coverage gaps for specific branches, a test utility that should be shared, and stale comments. The architecture, state machine, and concurrency fixes from prior rounds are solid. Once the P2 is addressed, this should be close to landing.
Process note from Mafu-san: drainPendingDispatches is now duplicated identically across 3 test files (DEREM-41). The agent acknowledged the duplication in comments but replicated rather than extracted. The PR description's "358 goal-specific tests" claim remains unverifiable from the diff (~188 new test functions countable).
🤖 This review was automatically generated with Coder Agents.
|
Addressed all 7 new threads from the previous round (DEREM-40..46): P2
P3
Nits
/coder-agents-review |
|
/coder-agents-review |
There was a problem hiding this comment.
Re-review round (3 reviewers). All 7 prior findings verified fixed. Netero clean (no findings). One new P2 and one Nit from the panel.
The P2 is the last unhandled-rejection escape path: applyPendingAfterStreamEnd calls setGoalImmediately directly (bypassing the DEREM-36 wrapper), and two of its three call sites use void. A WorkspaceGoalTransitionError from a queued mutation (e.g., pausing an already-paused goal) becomes an unhandled Promise rejection. The fix is a try/catch, same pattern as DEREM-14.
The Nit is the sibling-miss pattern one more time: waitForCondition was left as a local copy in the same commit that extracted drainPendingDispatches to the shared testDispatchHelpers.ts. Bisky and Mafu-san both flagged it.
Once the P2 is wrapped, this PR is ready to land.
🤖 This review was automatically generated with Coder Agents.
|
Addressed both threads from round 3 of the review (DEREM-47 + DEREM-48): P2 DEREM-47 Nit DEREM-48 Moved
/coder-agents-review |
|
/coder-agents-review |
There was a problem hiding this comment.
Clean. 48 findings across 9 rounds, all P0-P2 resolved. Two minor items noted below (P3 + Nit), neither blocking.
The architecture has been verified end-to-end: IdleDispatcher priority model, goal state machine persistence and restart recovery, optimistic-concurrency retry consolidation, experiment gating on all hot paths, error-handling contract (typed Result through oRPC), tool schema with goalId for conflict detection, and budget enforcement across pause/resume/model-switch. The author addressed every finding with code, tests, and clear explanations across 7 fix rounds.
Non-blocking notes:
P3 ActiveGoalsWarningToast.test.tsx: The enabled=false path (experiment off) has no test. If the guard were removed, users who toggled the experiment off mid-session would get spurious "N active goals" warnings. One-line test.
Nit workspaceService.test.ts:7570: Inline Date.now() poll loop instead of the centralized waitForCondition from testDispatchHelpers.ts. Same file already imports drainPendingDispatches from there.
Ship it.
🤖 This review was automatically generated with Coder Agents.
|
Addressed both threads from round 4 (DEREM-49 + DEREM-50): P3 DEREM-49 Added two regression tests in Nit DEREM-50 Replaced the inline
/coder-agents-review |
|
/coder-agents-review |
There was a problem hiding this comment.
Clean. All 50 prior findings resolved. 2 P3 and 1 Nit noted below, none blocking.
50 findings across 10 rounds. 42 fixed by the author, 8 dropped (wrong project conventions). The goal lifecycle, state machine persistence, concurrency model, experiment gating, error-handling contract, and test coverage have all been verified across multiple review rounds. CI green.
Non-blocking notes:
P3 [DEREM-52] assertPricedModelForBudgetedGoal (called from sendMessage/resumeStream) has no isExperimentEnabled() gate, same class as DEREM-19/DEREM-37. Off-experiment users incur one extra goal.json ENOENT read + pushSnapshot(null) write per send. Add the same if (!this.goalsEnabled) return guard used by sibling paths.
P3 [DEREM-51] hasBudgetedActiveGoal in budgetPricing.ts is exported but has zero production callers (superseded by hasBudgetedResumableGoal). Dead code.
Nit [DEREM-53] GoalSetInputSchema accepts budget_limited as a caller-supplied status. validateStatusTransition has no guard for it, so a caller can set status: "budget_limited" on a paused goal, which applyBudgetDrivenStatus then re-arms to active, bypassing the normal resume flow. Restrict the input schema to ["active", "paused", "complete"].
Ship it.
🤖 This review was automatically generated with Coder Agents.
|
Addressed all 3 threads from round 5 (DEREM-51..53): P3 DEREM-51 Deleted the unused P3 DEREM-52 Added an experiment-off short-circuit at the top of Nit DEREM-53 Added
/coder-agents-review |
|
/coder-agents-review |
There was a problem hiding this comment.
All 53 prior findings resolved. 1 P3 and 1 Nit noted below, neither blocking.
This is round 11. The PR has been through a 22-reviewer panel, 7 fix rounds, and 4 verification rounds. Every P0-P2 is fixed. The remaining findings are edge-case semantics (restart recovery for a rare user-origin budget exhaustion scenario) and test helper consolidation. The goal lifecycle, concurrency model, experiment gating, error contract, and test coverage are all verified solid.
P3 [DEREM-54] recoverPendingDispatchAfterRestart arms the budget wrapup for any budget_limited goal with budgetLimitInjectedForGoalId === null, including goals where the wrapup was intentionally suppressed because a user-sent message (not a continuation) exhausted the budget. After restart, the synthesized GOAL_CONTINUATION_KIND stamp bypasses the suppression guard. The fix is to persist originKind alongside the goal record. Low-probability scenario (user message must exceed budget) with benign consequence (an extra summarization continuation).
Nit [DEREM-55] enableGoalsExperiment helper duplicated in workspaceGoalService.test.ts and agentSession.budgetGate.test.ts. Extract to testDispatchHelpers.ts.
Ship it.
🤖 This review was automatically generated with Coder Agents.
Adds a workspace-scoped, persistent **Goal** primitive that the agent
auto-continues against an explicit dollar budget after each turn ends.
Inspired by OpenAI Codex's `/goal` but adapted to Mux's primitives.
## What ships
- `/goal "<objective>" [--budget $X[.YY]] [--turn-cap N]` slash command
plus parity in the command palette (set / pause / resume / complete /
clear / edit budget). Workspaces hold at most one active goal.
- New `IdleDispatcher` that serializes per-workspace dispatches and is
shared with `HeartbeatService`; goals get priority 100 (suppresses
background heartbeats while a goal is active).
- Cost / budget / turn / elapsed accounting persisted to
`~/.mux/sessions/<workspaceId>/goal.json` with atomic writes and an
optimistic-concurrency `expectedGoalId` on `setGoal`.
- Auto-continuation: at every stream-end, if the goal is `Active` and
the workspace is idle, a synthetic `<untrusted_objective>`-wrapped
user message is injected to drive the next turn (60s cooldown,
serialized globally).
- BudgetLimited wrap-up: once `costCents >= budgetCents`, exactly one
graceful "stop and summarize" continuation fires before further
dispatches are suppressed.
- Auto-pause hooks: any non-synthetic user message pauses an active
goal before streaming; partial-message recovery on startup arms a
user-acknowledgment gate that blocks continuations until the user
posts again.
- Sub-agent task costs are attributed up to the parent's active goal
on task completion (children do not inherit the goal).
- Forked workspaces inherit a snapshot of the parent goal but reset
accounting and default to `Paused` to avoid runaway loops.
- Two model-facing tools: `get_goal` (read-only) and
`complete_goal(summary)` — conditionally injected when the GOALS
experiment is on, the workspace has an active/budget-limited goal,
and the agent is exec-like with editing capabilities.
- UI: left-sidebar `Target` pill on every workspace with a goal,
conditional right-sidebar Goal tab (status + accounting +
Pause/Resume/Mark complete/Clear), a settings section under
Experiments, and custom transcript cards for `get_goal` /
`complete_goal` that mirror the GoalTab styling.
- Live snapshot fan-out via `WorkspaceService.emit("activity")` so
pill + tab refresh without a reload.
- Telemetry: lifecycle events (`goal_created/replaced/cleared/paused/
resumed/completed/budget_limited/budget_changed/continuation_fired/
wrapup_fired/crash_gate_set`) emitted through `AnalyticsService`.
Gated by `EXPERIMENT_IDS.GOALS` (off by default, user-overridable).
## Notes
- Dollars (cents internally), not tokens, because Mux supports many
models with different token economics. `turnCap` is a secondary
guardrail. Wall-clock time is display-only.
- Compaction streams are excluded from goal accounting; failed /
aborted streams are counted (provider cost is still paid).
- Synthetic continuation prompts XML-escape the objective inside an
`<untrusted_objective>` block and explicitly tell the model to
treat the objective as data, not instructions.
---
_Generated with `mux` • Model: `anthropic:claude-opus-4-7` • Thinking: `max` • Cost: `$768.23`_
<!-- mux-attribution: model=anthropic:claude-opus-4-7 thinking=max costs=768.23 -->
|
Addressed both threads from round 6 (DEREM-54 + DEREM-55): P3 DEREM-54 Persisted Nit DEREM-55 Extracted
/coder-agents-review |
|
/coder-agents-review |
There was a problem hiding this comment.
Clean. Zero open findings. Netero and panel both report no new issues.
55 findings across 12 rounds, all resolved. The budgetLimitOriginKind persistence from the last round was the final piece: restart recovery now correctly distinguishes user-origin from continuation-origin budget exhaustion. Test helpers are consolidated in testDispatchHelpers.ts. The public oRPC schema is locked down to caller-settable statuses only.
Nothing left to flag. Ship it.
🤖 This review was automatically generated with Coder Agents.
Summary
Adds a workspace-scoped, persistent Goal primitive that the agent auto-continues against an explicit dollar budget after each turn ends. Inspired by OpenAI Codex's
/goalbut adapted to Mux's primitives: synthetic visible user messages instead of adeveloperrole, dollar budgets instead of tokens (we already track normalized cost viasessionUsageService), a sharedIdleDispatcherconsolidating heartbeats and goal continuations, and conditionally-injected model tools.Gated end-to-end by
EXPERIMENT_IDS.GOALS(off by default, user-overridable). Off the experiment, the only externally observable change is the new ESM module graph; runtime behavior is identical tomain.Background
Long-horizon agent runs on Mux today require the user to nudge the agent ("continue", "what next") at every stream-end. Codex solved this with
/goal, but its model is token-budget + developer-role. Mux supports many providers with wildly different token economics, has no developer role, and already has an idle scheduler shared by heartbeats — so the design adapts the spec rather than copying it. The accompanying RFC (separategoals-planningbranch — not part of this PR to keep history clean) covers the full architecture, state diagram, accounting boundaries, and edge cases.Implementation
Backend
WorkspaceGoalService(new) owns goal lifecycle. Persists each workspace's goal at~/.mux/sessions/<id>/goal.jsonviawriteFileAtomic+ per-workspace file locks. Optimistic-concurrency onsetGoalvia an optionalexpectedGoalIdreturns a typedResult<GoalRecordV1, GoalSetError>so the frontend can retry without throwing.IdleDispatcher(new, also extracted fromHeartbeatService) serializes per-workspace dispatches with priority-ordered consumers — goals = 100, heartbeats = 50 — so an active goal naturally suppresses heartbeats. Global concurrency cap of 1.Active→Paused(manual user message or/goal pause),Active→BudgetLimited(cost ≥ budget),BudgetLimited→ exactly one wrap-up continuation → done,*→Complete(modelcomplete_goalor/goal complete). Setting an identical objective is idempotent (preservesgoalId+ accounting).agentSessionstream-end, if the goal isActiveand the workspace is idle, anIdleDispatchercandidate fires a synthetic user message wrapping the objective in<untrusted_objective>. 60s cooldown, gated byrequireUserAcknowledgmentSinceMsif a partial assistant message was recovered on startup.taskcompletion. Children do not inherit goals (avoids recursive budget exhaustion).Paused.get_goal()andcomplete_goal(summary)are conditionally registered when GOALS is on, the workspace has an active/budget-limited goal, and the agent has editing capabilities.create_goalwas deliberately omitted — goals must be user-initiated.Frontend
Target-icon pill on workspaces with a goal (live status, e.g.Target $0.04 / $1.00).get_goalandcomplete_goal(replacing the generic▶ tool_namerow), styled to match the GoalTab./goal "<obj>",/goal pause,/goal resume,/goal complete,/goal clear,/goal budget. All have command-palette equivalents.WorkspaceGoalService.pushSnapshotinvokes a callback wired incoreServicestoworkspaceService.emit("activity"), so the pill + tab refresh without a reload.Telemetry
AnalyticsService.recordGoalLifecycleEventemits a closed set of events:goal_created/replaced/cleared/paused/resumed/completed/budget_limited/budget_changed/continuation_fired/wrapup_fired/crash_gate_set. All carry buckets, never raw text.Validation
make static-checkgreen (typecheck + ESLint + Prettier + docs link checking)WorkspaceGoalService,IdleDispatcher,agentSession.goalAutoPause,goal.test(tools),GoalTab,goalToolUtils, slash-command parser, command palette, and the dispatcher registrydev-server-sandbox: kicked off two goals (one auto-completing, one budget-limited), confirmed the synthetic continuation fires within ~180ms ofsetGoal, the cost accumulator + budget gating advance correctly, and Pause/Resume/Complete/Clear all round-trip through the UIRisks
Medium. The dispatcher serialization, accounting accumulator, and stream-end hook all touch hot paths shared with heartbeats and sub-agents. Mitigations:
EXPERIMENT_IDS.GOALSgates every runtime branch — off by defaultIdleDispatcherwas extracted as a pure refactor in the first wave (no behavior change toHeartbeatService) and covered by 28 dedicated tests before goal logic was layered onworkspaceFileLocks+writeFileAtomic; corruptedgoal.jsonis treated as no goal rather than crashing the workspacegetGoalContinuationRuntimeStatereportingisInitializingindefinitely)Pains
The biggest sink was discovering, mid-dogfood, that the auto-continuation loop never fired because
WorkspaceService.getGoalContinuationRuntimeStatereturnedisInitializing: initState != nullinstead ofinitState?.status === "running".InitStateManagerkeeps the persisted status forever after init succeeds, so the eligibility check kept deferring withreason: "initializing". Existing tests stubbed the bridge, so the bug went undetected. Now covered by an integration test wiring the realWorkspaceServicethrough the real dispatcher.A separate snag:
coreServices.ts(used bymux server/ dev-server-sandbox) was constructing the goal service but never callingregisterGoalContinuationConsumer. Now wired in bothcoreServicesandserviceContainer.