From 074195a84e44157427cbb4380d66d283d72ca05d Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 4 May 2026 15:27:06 +0000 Subject: [PATCH 1/2] Initial plan From a9dc542c253e25a0529f20afc93fb11a5502fb5a Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 4 May 2026 15:35:19 +0000 Subject: [PATCH 2/2] Add cost-control deep-dive report Agent-Logs-Url: https://github.com/githubnext/autoloop/sessions/a0cff281-ac21-4c37-a270-c1410ce4b382 Co-authored-by: mrjf <180956+mrjf@users.noreply.github.com> --- AGENTS.md | 1 + docs/cost-control.md | 582 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 583 insertions(+) create mode 100644 docs/cost-control.md diff --git a/AGENTS.md b/AGENTS.md index 1ee2120..c77a40d 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -82,6 +82,7 @@ Programs can include an Evolution Strategy section (inspired by OpenEvolve) that - **Agentic Workflows**: https://github.com/github/gh-aw - **Quick Start**: https://github.github.com/gh-aw/setup/quick-start/ - **Autoloop Examples**: See the [example programs](.autoloop/programs/) included in this repo +- **Cost Control**: See [docs/cost-control.md](docs/cost-control.md) for a deep-dive report on cost-control possibilities for the harnesses Autoloop supports (token/$ budgets per iteration / day / week / month, cheaper-model strategies, and concrete proposals) ## Conventions diff --git a/docs/cost-control.md b/docs/cost-control.md new file mode 100644 index 0000000..a7043d8 --- /dev/null +++ b/docs/cost-control.md @@ -0,0 +1,582 @@ +# Cost Control for Autoloop + +Status: design report (not yet implemented). Resolves the "Dig in to +cost control possibilities" issue. + +Autoloop runs autonomously and indefinitely. Each iteration spends +both GitHub Actions minutes and AI inference tokens, so cost control +is a first-class concern: a misconfigured program could run every six +hours forever, and one bad iteration could blow a per-run token +budget. + +This document is a deep dive into what cost-control mechanisms are +*available today* (in [GitHub Agentic Workflows][gh-aw], the harness +that Autoloop is built on, and in the engines it can drive), what we +*currently use*, and what we *should add* — both at the gh-aw layer +and at the Autoloop layer. + +[gh-aw]: https://github.github.com/gh-aw/ + +## TL;DR + +- **Today, `workflows/autoloop.md` sets only one cost lever:** + `engine: copilot` plus `timeout-minutes: 45`. There is no model + pin, no per-iteration token cap, no per-program $/token budget, + and no daily/weekly/monthly cap. +- **gh-aw already gives us most of the primitives we need at the + workflow level:** `engine.model` (cheap-model selection), + `timeout-minutes` / `stop-after` (wall-clock caps), `rate-limit` + (per-user trigger caps), `skip-if-match` (skip the agent before + inference), `safe-outputs.*.max` (cap write-amplification), and + `gh aw logs` / `gh aw audit` for measurement. +- **gh-aw does *not* currently expose a hard per-run "max $" or + "max tokens" knob** for most engines. The only token-budget-shaped + knob is Claude's `max-turns`, plus the BYOK overrides + `COPILOT_PROVIDER_MAX_PROMPT_TOKENS` / + `COPILOT_PROVIDER_MAX_OUTPUT_TOKENS` for Copilot in BYOK mode. +- **Autoloop should add cost control at three layers:** + 1. **Per-program frontmatter** in `program.md` — `model:`, + `cost.per-iteration.max-{tokens,usd}`, `cost.budget` with a + `day/week/month` window. + 2. **Scheduler-side enforcement** in + `workflows/scripts/autoloop_scheduler.py` — read accumulated + spend from `gh aw logs --json`, compare against the program's + budget, and refuse to schedule a program that has exceeded it. + This is the only mechanism today that can enforce a real + `$/week` cap. + 3. **In-iteration cheap-model selection** — let a program declare a + `model:` override that the workflow honours when it dispatches + the agent for that program (instead of a single global engine). +- **Cheap-model strategy:** the highest-leverage default is to switch + the autoloop *workflow* itself to a small model + (`gpt-4.1-mini` / `claude-haiku-4-5`) for the bookkeeping/scheduling + work, and only escalate to a frontier model for the iteration's + actual proposal step — ideally as a per-program choice. + +--- + +## 1. Where cost is incurred in one Autoloop iteration + +A single scheduled run of `autoloop.md` executes (in order): + +| Phase | Job(s) | Tokens? | Actions minutes? | +|---|---|---|---| +| **Pre-activation** | runner setup, repo checkout, repo-memory clone | none | yes (~30–60 s + ~1.5 min runner overhead) | +| **Scheduler** | `autoloop_scheduler.py` (Python, stdlib-only) | none | yes (~5–15 s) | +| **Agent** | gh-aw runs the engine (Copilot today) on the autoloop prompt | **yes — dominant** | yes (1–15+ min) | +| **Evaluation sub-process** | `program.md`'s `Evaluation` command, run inside the agent step | depends on the program | counted in the agent step | +| **Safe outputs** | commit, push, PR/issue update jobs | none | yes (~30–60 s each) | + +The cost picture is therefore: + +``` +total_iteration_cost ≈ Actions_minutes_cost(all jobs) + + inference_cost(agent job) + + program_eval_cost +``` + +Two important observations: + +1. **Inference dominates** for any non-trivial program. The autoloop + prompt is large (~2k–5k tokens of system prompt plus the program + state file plus iteration history) and the agent is allowed up to + 45 minutes to think and call tools. A single iteration with a + frontier Copilot model is in the order of one to a few "premium + requests"; with `claude` or `codex` on direct API keys it is + typically a few cents to a few dollars depending on prompt and + tool-call volume. +2. **The evaluation step is part of the program**, not the harness. + Programs whose evaluation calls an LLM (e.g. autoresearch-style + training/inference loops) can spend orders of magnitude more than + the agent itself. Cost control must therefore be expressible at + the *program* level, not just the workflow level. + +There are also two failure modes that pure budgeting must catch: + +- **Runaway iteration** — the agent gets stuck in a tool loop and + burns the full 45-minute timeout every time. Today, only + `timeout-minutes` bounds this. +- **Runaway schedule** — a program with `every 5m` keeps firing + forever, well after it has stopped making progress. Today, nothing + bounds this except humans noticing and editing the schedule. + +--- + +## 2. What gh-aw gives us today + +This section catalogues every cost-relevant lever exposed by gh-aw +that Autoloop could (but mostly does not) use. + +### 2.1 Engine and model selection + +`engine:` accepts `copilot` (default), `claude`, `codex`, `gemini`, +`crush` (experimental), or `opencode` (experimental). Each engine has +distinct billing: + +| Engine | Billed to | Unit | Per-run order of magnitude | +|---|---|---|---| +| `copilot` | account owning `COPILOT_GITHUB_TOKEN` | premium requests | 1–2 per run | +| `claude` | Anthropic account on `ANTHROPIC_API_KEY` | tokens | $0.01–$1+ per run | +| `codex` | OpenAI account on `OPENAI_API_KEY` | tokens | $0.01–$1+ per run | +| `gemini` | Google account on `GEMINI_API_KEY` | tokens | $0.005–$0.50+ per run | +| `crush`, `opencode` | varies (typically Copilot token) | varies | varies | + +**Cheap-model selection** is via `engine.model`: + +```yaml +engine: + id: copilot + model: gpt-4.1-mini # ~10–20× cheaper than gpt-5 +# or +engine: + id: claude + model: claude-haiku-4-5 # ~10× cheaper than claude-sonnet +# or +engine: + id: codex + model: gpt-4o-mini +# or +engine: + id: gemini + model: gemini-2.5-flash # cheapest reliable Gemini tier +``` + +This is the single biggest cost lever and is currently unused by +Autoloop. + +### 2.2 Per-engine turn / token caps + +| Knob | Engines | What it caps | +|---|---|---| +| `max-turns` | claude only | number of agent turns per run | +| `max-continuations` | copilot only | autopilot continuation runs | +| `COPILOT_PROVIDER_MAX_PROMPT_TOKENS` (BYOK env) | copilot in BYOK mode | input tokens per call | +| `COPILOT_PROVIDER_MAX_OUTPUT_TOKENS` (BYOK env) | copilot in BYOK mode | output tokens per call | + +There is **no portable `token-budget:` field** in current gh-aw — +inference token caps only exist for Claude (via `max-turns`, which is +indirect) and for Copilot in BYOK mode. + +### 2.3 Wall-clock caps (Actions-minutes control) + +- `timeout-minutes:` — caps the agent step. Autoloop sets `45`. +- `stop-after: +48h` — cancels the run if too much time has passed + since the trigger. Useful for never-let-an-iteration-loop-overnight + semantics. Currently unused by Autoloop. +- Job-level `timeout-minutes` on safe-output jobs is implicit + (defaults to 360 minutes). + +### 2.4 Skip-the-agent gates (highest-leverage savings) + +`skip-if-match` / `skip-if-no-match` run during the cheap +pre-activation job and cancel the workflow before the agent fires: + +```yaml +on: + schedule: every 6h + skip-if-no-match: 'label:autoloop-program -label:autoloop-paused' +``` + +Today the Autoloop scheduler itself decides whether to run anything, +so this is partially redundant — but it could be used to *guarantee* +that paused programs can never trigger an agent run. + +### 2.5 Trigger throttling + +- `rate-limit: { max, window, ignored-roles }` — caps per-user + triggers. Useful for slash-command floods, irrelevant for the + scheduled trigger. +- `concurrency` — serialises runs. Autoloop should set this to + prevent two scheduled runs racing if the previous one ran long. + +### 2.6 Safe-output caps + +`safe-outputs.*.max` and `safe-outputs.*.environment` cap the number +of writes per run and gate them behind manual approval. These cap +*amplification*, not inference cost, but are still important for +governance. + +### 2.7 Measurement: `gh aw logs` and `gh aw audit` + +The most important fact for budget enforcement is that gh-aw exposes +historical per-run cost as JSON: + +```bash +gh aw logs --start-date -7d --json \ + | jq '.runs[] | {workflow, duration, estimated_cost, token_usage}' +``` + +Per-run fields include `duration`, `token_usage`, `estimated_cost`, +`workflow_name`, `agent` (engine ID); per-episode roll-ups include +`total_tokens`, `total_effective_tokens`, `total_estimated_cost`. + +For Copilot, `total_estimated_cost` is a heuristic — the source of +truth is `total_effective_tokens`. For Claude/Codex/Gemini the cost +is directly billable. + +This is the hook Autoloop needs for a real per-program $/week budget: +the scheduler can read recent cost, attribute it to programs by +parsing the per-iteration commit message, and enforce a cap. + +The same data is also exposed as an MCP tool (`agentic-workflows`), +so an Autoloop iteration could itself read its program's recent cost +and decide to throttle. + +### 2.8 Bring-your-own-key (BYOK) for Copilot + +Copilot can be redirected to OpenAI / Anthropic / Azure / Ollama via +`COPILOT_PROVIDER_BASE_URL`. This lets a single-engine workflow swap +billing accounts without changing the harness — useful for routing +expensive autoloop runs to a sponsored Anthropic account while +keeping cheap routine runs on the default Copilot account. + +--- + +## 3. What Autoloop uses today + +From `workflows/autoloop.md`: + +```yaml +engine: copilot +timeout-minutes: 45 +safe-outputs: + add-comment: { max: 7 } + create-pull-request: { max: 1 } + push-to-pull-request-branch: { max: 1 } + create-issue: { max: 1 } + update-issue: { max: 3 } + add-labels: { max: 2 } + remove-labels: { max: 2 } +``` + +That is the entire cost-control surface today: + +- ✅ uses Copilot (cheapest tier per request, but uses the default + frontier model — no `engine.model` override) +- ✅ caps wall clock at 45 minutes +- ✅ caps write amplification via `safe-outputs.*.max` +- ❌ no `engine.model` (always uses the Copilot default) +- ❌ no `stop-after` (a stuck run can re-fire on the next schedule) +- ❌ no `rate-limit` (slash-command floods cost real money) +- ❌ no `concurrency` group (two scheduled runs can race) +- ❌ no per-program model override +- ❌ no per-program $/token budget (the only "budget" is the + human-edited `target-metric` — once met, the program is marked + completed) +- ❌ no daily/weekly/monthly spend cap at any layer + +--- + +## 4. How to express cost control per job + +This is the core of the question. The answers split by *time scale* +and *enforcement layer*. + +### 4.1 Per-iteration ($ or tokens per run) + +**Goal:** an iteration is allowed up to `$X` or `T` tokens; if it +exceeds the cap, the agent step is killed and the iteration is +logged as `error`. + +**Today, with stock gh-aw:** + +| Engine | Per-iteration cap mechanism | +|---|---| +| `claude` | `max-turns: N` (indirect: bounds turns, not tokens) + `timeout-minutes` | +| `codex`, `gemini` | `timeout-minutes` only | +| `copilot` (default) | `timeout-minutes` only — but per-run cost is naturally bounded to ~1–2 premium requests | +| `copilot` (BYOK) | `COPILOT_PROVIDER_MAX_PROMPT_TOKENS` / `COPILOT_PROVIDER_MAX_OUTPUT_TOKENS` per call | + +**Recommended Autoloop addition:** + +Add a frontmatter field on each program: + +```markdown +--- +schedule: every 6h +target-metric: 0.95 +metric-direction: higher +cost: + per-iteration: + max-tokens: 200000 # hard cap; iteration is rejected if exceeded + max-usd: 1.00 # hard cap; iteration is rejected if exceeded +--- +``` + +This is enforced in two places: + +1. **Pre-iteration**, the workflow translates the cap to whatever + per-engine knob exists (`max-turns` for Claude; + `COPILOT_PROVIDER_MAX_*` for Copilot BYOK; `timeout-minutes` for + everything else). +2. **Post-iteration**, the agent reads the run's actual cost from + `gh aw audit ` (the `agentic-workflows` MCP tool) and, if + it exceeded the cap, marks the iteration `rejected: cost-overrun` + in the state file's Iteration History — even if the metric + improved. This produces the ratchet behaviour we want: a one-off + spike does not pollute the budget but also gets recorded as a + lesson. + +### 4.2 Per-program ($ or tokens per day / week / month) + +**Goal:** a program is allowed up to `$X / week`. Once exceeded, the +scheduler skips it until the window rolls. + +**Today, with stock gh-aw:** *no native mechanism.* This is the gap +the issue is asking us to fill. + +**Recommended Autoloop addition:** + +The scheduler in `workflows/scripts/autoloop_scheduler.py` already +owns the "which program runs next" decision. Extend it to: + +1. Parse a `cost.budget` block from the program frontmatter: + + ```markdown + --- + cost: + budget: + window: weekly # daily | weekly | monthly | rolling-Nh + max-usd: 20.00 + max-tokens: 5000000 + on-exceeded: pause # pause | skip-until-window-rolls | warn + --- + ``` + +2. On each scheduling pass, fetch recent runs via + `gh aw logs --start-date - --json` and group by program + (using a label or commit-message convention; see §6.1). Sum + `estimated_cost` and `token_usage` per program. + +3. If a program is over budget, treat it as not-due and append a + `pause_reason: "budget-exceeded: $21.40 of $20.00 weekly cap"` to + its state file. Optionally post a status comment. + +4. When the budget window rolls (or a human edits the cap), the + program becomes due again automatically. + +This is a small change to the scheduler — it already reads JSON +state — and it gives us the only real `$/week` cap available without +upstream gh-aw changes. + +### 4.3 Repository-wide cap + +**Goal:** the whole Autoloop installation in this repo costs no more +than `$X / month`, regardless of how many programs exist. + +Same scheduler hook as §4.2, with a single `repo-budget.md` (or a +section in `AGENTS.md`) declaring the cap. If the repo-wide cap is +exceeded, *all* programs are skipped until the window rolls, and a +single status issue is opened/updated. + +This is important because Autoloop encourages adding programs +liberally: without a repo cap, N programs scale cost linearly with +no safety net. + +### 4.4 Per-engine / per-account cap + +For `copilot`, the practical cap is "premium requests on the service +account that owns `COPILOT_GITHUB_TOKEN`". This is set in the GitHub +billing UI for that account, not in gh-aw, and is enforced by GitHub +itself — over-budget runs simply fail at the API layer. + +**Recommendation:** document the convention of using a *dedicated +service account* per Autoloop installation, with a GitHub-billing-side +spending limit set explicitly. This is the only truly hard cap +available for Copilot today. + +For `claude` / `codex` / `gemini`, set per-key spending limits in the +respective provider consoles. Same logic. + +--- + +## 5. How to better use cheaper models + +Three independent strategies, in increasing leverage: + +### 5.1 Strategy A: cheap default, expensive escalation + +The autoloop prompt is *mostly* bookkeeping: read state, parse +history, decide what to try, run a tool, write a comment, update a +state file. A small model (`gpt-4.1-mini`, `claude-haiku-4-5`, +`gemini-2.5-flash`) is more than capable of the bookkeeping; the +expensive part is the *proposal* (the actual code change). + +**Implementation:** keep `engine: copilot` but add +`engine.model: gpt-4.1-mini` as the default. Programs that want a +frontier model for proposals declare: + +```markdown +--- +model: gpt-5 # opt in to the expensive default +--- +``` + +The workflow reads this via the scheduler's output (already a JSON +blob written to `/tmp/gh-aw/autoloop.json`) and, if present, sets +`COPILOT_MODEL` for that run via `engine.env`. + +### 5.2 Strategy B: per-phase routing + +Split the iteration into phases and route each to a different model: + +| Phase | Suggested model | Why | +|---|---|---| +| State parsing, scheduling decision | smallest available | pure structured-data work | +| Lessons-learned summarisation / compaction | small reasoning model | summarisation, no code | +| Proposal (the actual change) | frontier reasoning model | the only step where capability matters | +| Evaluation log analysis | small model | structured output | +| State file update / commit message | smallest available | templated text | + +This is harder to express in stock gh-aw (one workflow = one engine). +Two paths: + +1. **Multi-job workflow:** split the autoloop workflow into a chain + of runs (`workflow_call` or `dispatch-workflow`), each with its + own `engine.model`. Episode-level cost data + (`gh aw logs --json .episodes[]`) makes this observable. +2. **In-iteration BYOK switching:** the agent itself, mid-run, + shells out to a cheaper model for sub-tasks via a small helper + script. Less elegant; only worth it if (1) is too disruptive. + +For now, Strategy A captures most of the savings of Strategy B with +a fraction of the complexity. + +### 5.3 Strategy C: quality-aware downgrade based on history + +The state file's Iteration History records, per iteration, whether +the proposal was accepted, rejected, or errored. The scheduler can +read this to *automatically* choose the model: + +- N consecutive `accepted` iterations → downgrade to a cheaper model + (the program is "easy" right now). +- M consecutive `rejected` iterations on the cheap model → upgrade + back to the frontier model (the program is "hard" right now). + +This is the autoresearch-flavoured approach: let the optimisation +loop optimise its own cost. Fits naturally into Autoloop's existing +ratchet philosophy. + +--- + +## 6. Concrete proposals for Autoloop + +Roughly in priority order; each is independently shippable. + +### 6.1 (P0) Add a `program` label to runs for cost attribution + +`gh aw logs --json` reports per-run cost but does not natively +distinguish *which program* a given run was for — they all share the +workflow name `autoloop`. Without attribution, no per-program budget +is possible. + +**Fix:** when the autoloop workflow starts, write the selected +program name to a run-level annotation that survives in the logs +output. Two cheap options: + +- Set a `concurrency.group: gh-aw-autoloop-` (already would + be useful for serialisation; doubles as an attribution key visible + in `gh aw logs --json`). +- Tag the agent commit with `Autoloop-Program: ` in the commit + trailer (already done by the workflow's accepted commits) and use + that for offline attribution. + +### 6.2 (P0) Add `engine.model` to `workflows/autoloop.md` + +Default the workflow to a cheap model (e.g. `gpt-4.1-mini`). +Document the override path for programs that need a frontier model. + +This is a one-line frontmatter change with the largest expected cost +reduction (estimated 5–10× for typical programs). + +### 6.3 (P0) Add `concurrency` and `stop-after` to the workflow + +```yaml +concurrency: + group: gh-aw-autoloop + cancel-in-progress: false +stop-after: +90m +``` + +Prevents two schedules racing and prevents a stuck run from re-firing +on the next 6h schedule until the current one is done. + +### 6.4 (P1) Frontmatter `cost:` block in `program.md` + +Spec proposed in §4.1 and §4.2. Implementation in +`workflows/scripts/autoloop_scheduler.py`: + +- Extend `parse_program_frontmatter` to return a 6-tuple, adding a + `cost_config` dict — or, preferably, refactor it to return a + dataclass; the 5-tuple is already at the readability cliff. +- Extend the scheduling decision to consult cost. + +### 6.5 (P1) Repo-wide budget in `AGENTS.md` + +A single fenced block: + +````markdown +```autoloop-budget +window: monthly +max-usd: 100.00 +on-exceeded: pause-all +``` +```` + +Read once per scheduler invocation; enforced before per-program +checks. Cheapest possible governance lever. + +### 6.6 (P2) Quality-aware model downgrade (Strategy C) + +Implement after 6.1–6.5 are in and we have ≥ 1 month of cost data +to validate the heuristics on. + +### 6.7 (P2) Per-phase model routing (Strategy B) + +Defer. Only worth doing if Strategy A + Strategy C combined still +leave inference cost as the dominant line item. + +### 6.8 (P3) Upstream `token-budget:` to gh-aw + +The cleanest end state is a portable `token-budget: N` in gh-aw +frontmatter that all engines respect. This is an upstream feature +request; track separately. Until it lands, Autoloop's per-program +budget is enforced post-hoc by the scheduler (§4.1 step 2), which is +sufficient for a $/week ratchet but does not stop a single runaway +iteration mid-flight. + +--- + +## 7. Summary of the cost-control surface (proposed final state) + +| Layer | Mechanism | Time scale | Hard or soft | +|---|---|---|---| +| GitHub billing | per-token-account spend cap | month | hard (provider-enforced) | +| gh-aw workflow | `engine.model` (cheap default) | per-iteration | soft (cost reduction, not cap) | +| gh-aw workflow | `timeout-minutes` (45) | per-iteration | hard (kills agent) | +| gh-aw workflow | `stop-after: +90m` | per-trigger | hard (cancels run) | +| gh-aw workflow | `concurrency` | parallel runs | hard (queues) | +| gh-aw workflow | `safe-outputs.*.max` | per-iteration | hard (caps writes) | +| gh-aw engine | `max-turns` (claude) / `COPILOT_PROVIDER_MAX_*` (BYOK) | per-iteration | hard (engine-enforced) | +| Autoloop program | `cost.per-iteration.max-{tokens,usd}` | per-iteration | soft (post-hoc reject) | +| Autoloop program | `cost.budget.{window,max-*}` | day/week/month | hard (scheduler refuses) | +| Autoloop repo | `autoloop-budget` block in `AGENTS.md` | day/week/month | hard (scheduler refuses) | +| Autoloop scheduler | quality-aware downgrade | rolling | soft (cost reduction) | + +The combination of *cheap default*, *per-iteration soft cap*, +*per-program hard $/week cap*, and *per-account hard $/month cap* +gives defense in depth without leaning on any one mechanism. + +--- + +## 8. References + +- gh-aw cost management: +- gh-aw engines reference: +- gh-aw rate-limiting controls: +- gh-aw effective-tokens spec: +- GitHub Copilot billing: +- `workflows/autoloop.md` — current Autoloop workflow definition +- `workflows/scripts/autoloop_scheduler.py` — scheduler that owns the + "which program runs next" decision and is the natural enforcement + point for budgets