AltimateAI · anandgupta42 · May 11, 2026 · May 11, 2026 · May 11, 2026 · May 11, 2026
diff --git a/.opencode/skills/dbt-develop/SKILL.md b/.opencode/skills/dbt-develop/SKILL.md
diff --git a/.opencode/skills/dbt-schema-verify/SKILL.md b/.opencode/skills/dbt-schema-verify/SKILL.md
@@ -0,0 +1,144 @@
+---
+name: dbt-schema-verify
+applyPaths:
+  - "dbt_project.yml"
+  - "**/dbt_project.yml"
+description: |
+  REQUIRED after building or modifying ANY dbt model that has columns declared
+  in `schema.yml` / `_models.yml`. Run `altimate-dbt schema-verify --model
+  <name>` to diff actual columns against the spec, and treat any `mismatch`
+  verdict as "not done."
+
+  The most common reason "the build is green but the tests still fail" is
+  that the model produces the right *data values* in the wrong *column
+  shape* — extra columns, missing columns, wrong order, wrong types. Many
+  dbt equality tests grade the column tuple `(name, type, position)`
+  exactly, and the agent's prior bias is to add "helpful" extras
+  (`p1`/`p2`/`p3` rank breakdowns, name-resolved variants, lineage
+  metadata) or reorder columns "more logically." Both break the contract.
+
+  This skill enforces the mechanical check that catches those bugs before
+  declaring done. Use it before declaring any model task complete.
+---
+
+# dbt schema-verify
+
+## When to invoke this skill — every time
+
+Run `altimate-dbt schema-verify --model <name>` before declaring any of the
+following tasks complete:
+
+- Creating a new dbt model that has (or will have) a `schema.yml` entry
+- Modifying an existing model whose `schema.yml` declares columns
+- Refactoring a CTE into its own intermediate model
+- Renaming columns or changing their order
+- Changing materialization config in a way that re-creates the table
+- Any task that says "match the schema", "produce these columns", "the
+  output should have columns X, Y, Z", or references a `_models.yml`
+- Any task with `AUTO_*_equality` or `AUTO_*_existence` tests on a model
+
+If the task touched N models, run schema-verify on **all N of them**, not
+just the last one. A `build` is not a verify.
+
+## How to run it
+
+```bash
+altimate-dbt schema-verify --model <name>
+```
+
+**Note**: `altimate-dbt build --model <name>` already runs schema-verify
+automatically after a successful build and includes the verdict in its
+response under a `schema_verify` field. You will see the diff in the same
+result that reported the build outcome — read it there before deciding
+the task is done. If you need to re-check after editing, call
+`schema-verify` directly.
+
+Returns a structured JSON result:
+
+```json
+{
+  "model": "int_asana__project_user_agg",
+  "verdict": "mismatch",
+  "expected_columns": ["project_id", "users", "number_of_users_involved"],
+  "actual_columns": ["project_id", "users"],
+  "columns_extra": [],
+  "columns_missing": ["number_of_users_involved"],
+  "columns_reordered": [],
+  "type_mismatches": []
+}
+```
+
+## How to read the verdict
+
+| verdict | meaning | what to do |
+|---|---|---|
+| `match` | actual columns match the spec exactly (case-insensitive on names) | DONE — proceed |
+| `mismatch` | one or more of `columns_extra`, `columns_missing`, `columns_reordered`, `type_mismatches` is non-empty | NOT DONE — read the diff, fix the model SQL, rebuild, re-run schema-verify |
+| `no-spec` | the model has no columns declared in `schema.yml` | DONE for shape-fidelity purposes — no contract to verify against |
+
+## How to act on a `mismatch`
+
+For each non-empty list, the fix is mechanical:
+
+| Field | What it means | What to change in the model SQL |
+|---|---|---|
+| `columns_extra` | columns in your model NOT in the spec | REMOVE them from the `SELECT` |
+| `columns_missing` | columns in the spec NOT in your model | ADD them to the `SELECT` (compute them, or rename an existing column if you used a synonym) |
+| `columns_reordered` | columns present in both but at different positions | REORDER the columns in your `SELECT` to match the spec's order |
+| `type_mismatches` | declared `data_type` in spec disagrees with the warehouse's reported type | CAST in the `SELECT` or change the upstream source |
+
+Then run `altimate-dbt build --model <name>` again, then re-run
+`altimate-dbt schema-verify --model <name>` until verdict is `match`.
+
+## Iron Rules
+
+1. **The verdict is the source of truth, not your inspection.** Reading the
+   columns yourself and concluding "looks right to me" does not count.
+   Run the command and read its output.
+2. **A `mismatch` is "not done", even if the build is green.** dbt build
+   only proves the SQL compiled and ran without errors. It does not prove
+   the column shape is correct. Equality tests grade shape AND values.
+3. **Do not reinterpret the spec to make the model right.** The spec is
+   the contract. If the spec lists `supplier_company` and your model has
+   `supplier_id`, the answer is to fix your model, not to argue that
+   `supplier_id` is more useful.
+4. **Run schema-verify on every model touched, not just the last one.**
+   The most common "almost-pass" is N-1 models passing and the Nth one
+   silently failing on column shape. Walk the list.
+5. **Skip only on `no-spec`.** Do not skip on the grounds that the model
+   is small, or trivial, or "obvious." The spec is small only because
+   the dbt project author already curated it.
+
+## Fallback when altimate-dbt is unavailable
+
+If `which altimate-dbt` returns nothing, do the same diff by hand:
+
+```bash
+# 1. Read expected columns from schema.yml
+cat models/**/schema.yml | grep -A 50 "name: <name>"   # or yq
+
+# 2. Read actual columns from the materialized table
+dbt show --select <name> --limit 0
+```
+
+Compare the two ordered lists. Produce the same four-bucket diff
+(`columns_extra`, `columns_missing`, `columns_reordered`,
+`type_mismatches`) in your head, and apply the same fix logic. The
+mechanics don't change; only the tool name does.
+
+## What this skill does NOT cover
+
+- **Value-level correctness** — passing schema-verify only proves shape;
+  whether the *values* in each column are right is a separate check
+  (`altimate-dbt test` + dbt unit tests). Generate unit tests with the
+  `dbt-unit-tests` skill when the model has non-trivial transformation
+  logic.
+- **Row count** — schema-verify compares columns, not rows. If a refactor
+  drops rows that should be preserved (common when extracting a CTE into
+  its own model — see `dbt-develop`'s "Refactoring a CTE into its own
+  model" section), schema-verify will pass while equality tests fail.
+  Check row counts separately.
+- **Custom tests** — `check_*` and other non-AUTO tests check
+  task-specific business rules, not column shape. schema-verify can pass
+  while a custom test fails. Read the custom test SQL to understand
+  what's being asserted.
diff --git a/.opencode/skills/dbt-unit-tests/SKILL.md b/.opencode/skills/dbt-unit-tests/SKILL.md
@@ -32,6 +32,19 @@ description: Generate dbt unit tests automatically for any model. Analyzes SQL l
 3. **Use sql format for ephemeral models.** Dict format fails silently for ephemeral upstreams.
 4. **Never weaken a test to make it pass.** If the test fails, the model logic may be wrong. Investigate before changing expected values.
 5. **Compile before committing.** Always run `altimate-dbt test --model <name>` to verify tests compile and execute.
+6. **Mock data MUST exercise the failure modes of every SQL construct in the model.** A unit test that only covers the happy path validates that the model handles easy inputs — it does not validate correctness. Before writing `given:` rows, list every SQL construct in the model and the boundary case it can mishandle, then ensure at least one mock row triggers each. Universal cases to always cover when the construct appears:
+   - **`LEFT JOIN` / `LEFT OUTER JOIN`** → at least one parent row with **no matching child** (catches `COUNT(*)` phantom rows, `SUM` over `NULL`, fan-out / dropout)
+   - **`INNER JOIN`** → at least one parent row whose child is filtered out by the JOIN condition (catches missing rows)
+   - **`COUNT(*)` / `COUNT(<col>)`** → row where the counted column is `NULL` (catches `COUNT(*)` vs `COUNT(col)` divergence)
+   - **`NULLIF(x, y)`** → row where `x = y` (so the result is `NULL`, exercising downstream `NULL`-handling)
+   - **`/` division** → row where the denominator is `0` or `NULL`
+   - **`CASE WHEN`** → at least one row matching each branch, including the implicit `ELSE NULL` if no explicit `ELSE` is set
+   - **`COALESCE` / `IFNULL`** → row where every argument is `NULL`
+   - **Window functions (`OVER`)** → an empty partition, a partition of size 1, and a row at the partition boundary
+   - **Date arithmetic / date spines** → a row at the start of range, end of range, and a gap day with no events
+   - **Aggregations with `GROUP BY`** → at least one group of size 1 (often masks fan-out bugs) and one group whose key is `NULL`
+   - **Incremental merge keys** → both an "insert" row and an "update" row matching an existing key
+   If you can't think of a failure mode for a construct, you don't yet understand it well enough to test it — read the SQL again before guessing inputs.
 
 ## Core Workflow: Analyze -> Generate -> Refine -> Validate -> Write
 

diff --git a/benchmark/ade-bench/README.md b/benchmark/ade-bench/README.md
@@ -0,0 +1,134 @@
+# Reproducing altimate-code on ADE-Bench
+
+This folder contains everything you need to plug altimate-code into [ADE-Bench](https://github.com/dbt-labs/ade-bench) (dbt Labs's Analytics & Data Engineering benchmark) and reproduce the **81.3% pass rate** reported in [`../../research/kimi-k26-ade-bench-2026-05-10/findings.md`](../../research/kimi-k26-ade-bench-2026-05-10/findings.md).
+
+It deliberately does **not** ship the trace files, the per-trial result JSONs, the seed DuckDB databases, or the prebuilt 130 MB tarball — those are either large binaries or run outputs. Everything here is source code + scripts + 4 short patches against upstream ade-bench. Run the steps below and you'll get equivalent data.
+
+## What's in this folder
+
+```
+benchmark/ade-bench/
+├── README.md                              ← you are here
+├── altimate_code_agent/                   ← drop-in agent module for ade-bench
+│   ├── __init__.py
+│   ├── altimate_code_agent.py             ← the AltimateCodeAgent class
+│   ├── altimate-code-setup.sh             ← installs altimate-code inside the trial container
+│   └── build-local-tarball.sh             ← builds the linux/x64+arm64 tarball from source
+└── patches/                               ← 4 small patches to upstream ade-bench
+    ├── 01-agent_name.py.patch
+    ├── 02-agent_factory.py.patch
+    ├── 03-installed_agents_init.py.patch
+    └── 04-agent_setup.py.patch
+```
+
+The agent module is ~280 lines of Python + ~80 lines of shell. The 4 patches add a total of ~12 lines across the upstream tree. Nothing here is benchmark-targeted — the agent module just wires altimate-code into ade-bench's pluggable `--agent` mechanism the same way the upstream `claude`, `codex`, `gemini`, and `macro` agents are wired in.
+
+## Prerequisites
+
+- **Docker Desktop** ≥ 4.0, configured with **≥ 8 GiB memory** (12 GiB recommended for concurrency=6). Lower than 6 GiB causes `npm install` inside the trial container to OOM-swap and trip the setup timeout.
+- **macOS, Linux, or WSL2.** Apple Silicon is fine — the tarball builder produces both linux/amd64 and linux/arm64 binaries so the container runs natively on either host arch.
+- **bun ≥ 1.3** on the host (`brew install oven-sh/bun/bun` or [bun.sh](https://bun.sh)) for building the altimate-code tarball.
+- **Python ≥ 3.10** and [`uv`](https://docs.astral.sh/uv/getting-started/installation/) for the ade-bench harness.
+- **`gh` CLI** authenticated to GitHub (used to download ade-bench's shared seed databases).
+- **An OpenRouter API key** (`OPENROUTER_API_KEY`). Any LLM provider altimate-code supports will work; the published results use `moonshotai/kimi-k2.6-20260420` via OpenRouter, baseURL `https://openrouter.ai/api/v1`.
+
+## End-to-end reproduction (~30 min setup + ~1–2 h benchmark)
+
+```bash
+# === 0. Clone altimate-code (this repo) and ade-bench side by side ===
+mkdir -p ~/ade-bench-repro && cd ~/ade-bench-repro
+git clone https://github.com/AltimateAI/altimate-code
+git clone https://github.com/dbt-labs/ade-bench
+cd ade-bench
+
+# === 1. Wire altimate-code into ade-bench ===
+# a) Drop the agent module in:
+cp -r ../altimate-code/benchmark/ade-bench/altimate_code_agent \
+      ade_bench/agents/installed_agents/altimate_code
+
+# b) Apply the 4 small patches that register the agent + route AGENTS.md to it:
+for p in ../altimate-code/benchmark/ade-bench/patches/*.patch; do
+  git apply "$p"
+done
+
+# === 2. Install the ade-bench harness ===
+uv venv && source .venv/bin/activate
+uv pip install -e .
+
+# === 3. Download the shared seed databases ===
+mkdir -p shared/databases/duckdb
+gh release download databases --repo dbt-labs/ade-bench \
+  --pattern "*.duckdb" --dir shared/databases/duckdb
+
+# === 4. Build the altimate-code tarball from source ===
+# Produces ade_bench/agents/installed_agents/altimate_code/altimate-code-local.tgz
+# (~130 MB, contains linux/amd64 + linux/arm64 binaries + skills + dbt-tools)
+./ade_bench/agents/installed_agents/altimate_code/build-local-tarball.sh
+
+# === 5. Run the benchmark ===
+export OPENROUTER_API_KEY=sk-or-v1-...
+export DEFAULT_AGENT_TIMEOUT_SEC=1800     # 30 min wall cap per trial
+export SETUP_TIMEOUT_SEC=300              # 5 min cap on dbt-deps + altimate-code install
+export DEFAULT_TEST_TIMEOUT_SEC=120       # test-phase cap
+
+ade run all \
+  --db duckdb \
+  --project-type dbt \
+  --agent altimate \
+  --model openrouter/moonshotai/kimi-k2.6-20260420 \
+  --no-rebuild \
+  --n-concurrent-trials 6 \
+  --max-episodes 80
+```
+
+After the run, `ade view` opens the local HTML dashboard with per-trial detail (transcript, file diffs, dbt test output, cost & token counts).
+
+## How the agent module works
+
+`altimate_code_agent.py` defines `AltimateCodeAgent(AbstractInstalledAgent)`, which:
+
+1. **`_install_agent_script`** returns the path to `altimate-code-setup.sh`. ade-bench copies the script into `/installed-agent/install-agent.sh` inside each trial container and sources it.
+2. **`perform_task`** (overridden) also copies the locally-built tarball to `/installed-agent/altimate-code-local.tgz` before invoking the install script. Inside the container, `altimate-code-setup.sh` does `npm install -g /installed-agent/altimate-code-local.tgz`, picks the right per-arch binary (`uname -m`), and writes `~/.config/altimate-code/altimate-code.json` with the OpenRouter provider config.
+3. **`_run_agent_commands`** emits `altimate-code run --format json --yolo --model <model_id> --max-turns 80 <task_prompt>` and tee's the JSON event stream so the harness can parse per-step token counts, cost, and tool usage.
+4. **`AltimateCodeParser`** reads `step_finish` events out of the JSON stream and aggregates per-trial cost, runtime, turn count, input/output/cache token totals.
+5. **`AltimateCodeLogFormatter`** renders a human-readable transcript for the per-trial HTML dashboard.
+
+The 4 patches register `AgentName.ALTIMATE_CODE = "altimate"` and route the shared `AGENTS.md` baseline config (the same file Codex receives) into the container — putting altimate-code on equal footing with the other benchmarked agents.
+
+## Knobs
+
+Most behavior comes from environment variables read by the ade-bench harness and altimate-code's setup script. The relevant ones:
+
+| Variable | Default | What it controls |
+|---|---|---|
+| `OPENROUTER_API_KEY` | (required if `--model openrouter/...`) | OpenRouter API key. Baked into `~/.config/altimate-code/altimate-code.json` at container setup time. |
+| `OPENROUTER_MODEL_ID` | `moonshotai/kimi-k2.6-20260420` | Override only if you want a different OpenRouter-routed model. The `--model` flag must match: `openrouter/<this-id>`. |
+| `AZURE_RESOURCE_NAME` + `AZURE_API_KEY` | unset | Optional. If both are set, an `azure-foundry` provider is also registered against `https://<resource>.services.ai.azure.com/openai/v1`. Lets you A/B against an Azure-hosted Kimi or other Foundry deployment. |
+| `AZURE_DEPLOYMENT_NAME` | `Kimi-K2.6` | Azure Foundry deployment name (used only if Azure env vars are set). |
+| `DEFAULT_AGENT_TIMEOUT_SEC` | 180 (upstream); set to **1800** for these runs | Wall-clock cap per trial. Kimi-K2.6 spends ~89% of wall time reasoning; lower caps will cause hard tasks to time out. |
+| `SETUP_TIMEOUT_SEC` | 120 (upstream); set to **300** | Cap on the install phase. With ≥ 8 GiB Docker memory you rarely need more than 60 s; 300 s gives a margin under concurrent load. |
+| `DEFAULT_TEST_TIMEOUT_SEC` | 30 (upstream); set to **120** | Cap on the post-agent dbt-test phase. A few tasks have ~15 sub-tests that exceed 30 s on the first run. |
+
+`--n-concurrent-trials 6` was the sweet spot for a 12 GiB Docker / 8 CPU host. Higher concurrency works on a beefier host but `npm install` inside each container is the main bottleneck — 6 simultaneous installs comfortably finish in ~30 s; 10 starts to thrash.
+
+## Troubleshooting
+
+- **`agent_setup_timeout` on most trials.** Bump Docker memory. Symptom is `npm install -g /installed-agent/altimate-code-local.tgz` swapping for minutes. Anything below 6 GiB will do this.
+- **`Error response from daemon: 500 ...` from Docker.** Container created during memory pressure. Same fix: bump Docker memory + restart Docker Desktop.
+- **`Cannot find package @altimateai/altimate-code-linux-arm64` during npm install.** You're running an older copy of `altimate-code-setup.sh` that expected the per-arch optionalDependencies layout. Re-copy the script from `altimate_code_agent/altimate-code-setup.sh` — it uses the cached-binary trick that ships both archs inside one tarball.
+- **`OSError: [Errno 63] File name too long: 'tasks/airbnb007 airbnb009 ...'`** when re-running specific tasks. Caused by shell-quoting in some setups; pass each task ID as a separate argv item, not a single space-separated string.
+- **Pass rate noticeably lower than 81.3% on a fresh run.** First check: did the agent actually call OpenRouter (not a stale Azure config)? Inside one of the trial containers, `cat ~/.config/altimate-code/altimate-code.json | jq '.provider | keys'` should list `openrouter`. Second: are you using `--n-concurrent-trials 1` against the original Azure deployment by mistake? That hit 100 K TPM throttling in early runs.
+
+## What's intentionally NOT in this folder
+
+- **Trace data / `results.json` / `agent.log`** — those live under `experiments/` after a run. Re-run to regenerate.
+- **The 130 MB built tarball (`altimate-code-local.tgz`)** — rebuild with `build-local-tarball.sh` (~5–10 min the first time, ~30 s on subsequent builds while bun cache is warm).
+- **Seed databases (`*.duckdb`)** — pulled from `dbt-labs/ade-bench` GitHub releases by step 3 above. They're large (300–500 MB total).
+- **Per-task ground-truth seeds and test SQL** — those live in upstream ade-bench's `tasks/<id>/` and are never sent to the agent during a run.
+
+## Pointers
+
+- The behavioral analysis of the run: [`../../research/kimi-k26-ade-bench-2026-05-10/findings.md`](../../research/kimi-k26-ade-bench-2026-05-10/findings.md)
+- altimate-code source: this repository
+- ade-bench source: https://github.com/dbt-labs/ade-bench
+- OpenRouter Kimi-K2.6 model card: https://openrouter.ai/moonshotai/kimi-k2.6-20260420
diff --git a/benchmark/ade-bench/altimate_code_agent/__init__.py b/benchmark/ade-bench/altimate_code_agent/__init__.py
@@ -0,0 +1,5 @@
+from ade_bench.agents.installed_agents.altimate_code.altimate_code_agent import (
+    AltimateCodeAgent,
+)
+
+__all__ = ["AltimateCodeAgent"]