From bd12feaab1ab00ad1c3dca56fac79ec520c4424a Mon Sep 17 00:00:00 2001
From: Nigel Jones <jonesn@uk.ibm.com>
Date: Fri, 16 Jan 2026 13:53:58 +0000
Subject: [PATCH 1/2] docs: add AGENTS.md guidelines for AI coding assistants

Add structured guidance for AI assistants working with Mellea:
- AGENTS.md for contributors modifying Mellea internals
- docs/AGENTS_TEMPLATE.md for downstream projects to copy
---
 AGENTS.md               |  78 +++++++++++++++++
 docs/AGENTS_TEMPLATE.md | 183 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 261 insertions(+)
 create mode 100644 AGENTS.md
 create mode 100644 docs/AGENTS_TEMPLATE.md

diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 00000000..b50bb8c7
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,78 @@
+<!--
+AGENTS.md — Instructions for AI coding assistants (Claude, Cursor, Copilot, Codex, Roo, etc.)
+-->
+
+# Agent Guidelines for Mellea Contributors
+
+> **Which guide?** Modifying `mellea/`, `cli/`, or `test/` → this file. Writing code that imports Mellea → [`docs/AGENTS_TEMPLATE.md`](docs/AGENTS_TEMPLATE.md).
+
+## 1. Quick Reference
+```bash
+pre-commit install                  # Required: install git hooks
+uv sync                             # Install deps & fix lockfile
+uv run pytest -m "not qualitative"  # Fast loop (unit tests only)
+uv run pytest                       # Full suite (includes LLM tests)
+uv run pytest -m integration        # Tests requiring API keys
+uv run ruff format . && uv run ruff check .  # Lint & format
+```
+**Branches**: `feat/topic`, `fix/issue-id`, `docs/topic`
+
+## 2. Directory Structure
+| Path | Contents |
+|------|----------|
+| `mellea/stdlib` | Core: Sessions, Genslots, Requirements, Sampling, Context |
+| `mellea/backends` | Providers: HF, OpenAI, Ollama, Watsonx, LiteLLM |
+| `mellea/helpers` | Utilities, logging, model ID tables |
+| `cli/` | CLI commands (`m serve`, `m alora`, `m decompose`, `m eval`) |
+| `test/` | All tests. Unmarked = unit tests (no network/API keys) |
+| `scratchpad/` | Experiments (git-ignored) |
+
+## 3. Test Markers
+- `@pytest.mark.qualitative` — LLM output quality tests (skipped in CI)
+- `@pytest.mark.integration` — Requires API keys
+- **Unmarked** — Pure unit tests: no network, deterministic
+
+⚠️ Don't add `qualitative` to trivial tests—keep the fast loop fast.
+
+## 4. Coding Standards
+- **Types required** on all core functions
+- **Docstrings are prompts** — be specific, the LLM reads them
+- **Google-style docstrings**
+- **Ruff** for linting/formatting
+- Use `...` in `@generative` function bodies
+- Prefer primitives over classes
+
+## 5. Commits & Hooks
+[Angular format](https://github.com/angular/angular/blob/main/CONTRIBUTING.md#commit): `feat:`, `fix:`, `docs:`, `test:`, `refactor:`, `release:`
+
+Pre-commit runs: ruff, mypy, uv-lock, codespell
+
+## 6. Timing
+> **Don't cancel**: `pytest` (full) and `pre-commit --all-files` may take minutes. Canceling mid-run can corrupt state.
+
+## 7. Common Issues
+| Problem | Fix |
+|---------|-----|
+| `ComponentParseError` | Add examples to docstring |
+| `uv.lock` out of sync | Run `uv sync` |
+| Ollama refused | Run `ollama serve` |
+
+## 8. Self-Review (before notifying user)
+1. `uv run pytest -m "not qualitative"` passes?
+2. `ruff format` and `ruff check` clean?
+3. New functions typed with concise docstrings?
+4. Unit tests added for new functionality?
+5. Avoided over-engineering?
+
+## 9. Writing Tests
+- Place tests in `test/` mirroring source structure
+- Name files `test_*.py` (required for pydocstyle)
+- Use `gh_run` fixture for CI-aware tests (see `conftest.py`)
+- **No LLM calls** in unmarked tests—mock or mark `qualitative`
+- If a test fails, fix the **code**, not the test (unless the test was wrong)
+
+## 10. Feedback Loop
+Found a bug, workaround, or pattern? Update the docs:
+- **Issue/workaround?** → Add to Section 7 (Common Issues) in this file
+- **Usage pattern?** → Add to [`docs/AGENTS_TEMPLATE.md`](docs/AGENTS_TEMPLATE.md)
+- **New pitfall?** → Add warning near relevant section
diff --git a/docs/AGENTS_TEMPLATE.md b/docs/AGENTS_TEMPLATE.md
new file mode 100644
index 00000000..a1543c1e
--- /dev/null
+++ b/docs/AGENTS_TEMPLATE.md
@@ -0,0 +1,183 @@
+<!--
+AGENTS_TEMPLATE.md — Copy into your project's AGENTS.md to teach AI assistants Mellea patterns.
+-->
+
+# Mellea Usage Guidelines
+
+> **This file**: For code that *imports* Mellea. For Mellea internals, see [`../AGENTS.md`](../AGENTS.md).
+
+Copy below into your `AGENTS.md` or system prompt.
+
+---
+
+### Library: Mellea
+Use `mellea` for LLM interactions. No direct OpenAI/Anthropic calls or LangChain OutputParsers.
+
+**Prerequisites**: `pip install mellea` · [Docs](https://mellea.ai) · [Repo](https://github.com/generative-computing/mellea)
+
+#### 1. The `@generative` Pattern
+**Don't** write prompt templates or regex parsers:
+```python
+# BAD - don't do this
+response = openai.chat.completions.create(...)
+age = int(re.search(r"\d+", response).group())
+```
+**Do** use typed function signatures:
+```python
+from mellea import generative, start_session
+
+@generative
+def extract_age(text: str) -> int:
+    """Extract the user's age from text."""
+    ...
+
+m = start_session()
+age = extract_age(m, text="Alice is 30")  # Returns int(30)
+```
+
+#### 2. Complex Types
+```python
+from pydantic import BaseModel
+from mellea import generative
+
+class UserProfile(BaseModel):
+    name: str
+    age: int
+    interests: list[str]
+
+@generative
+def parse_profile(bio: str) -> UserProfile: ...
+```
+
+#### 3. Chain-of-Thought
+Add `reasoning` field to force the LLM to "think" before answering:
+```python
+from typing import Literal
+from pydantic import BaseModel, Field
+
+class AnalysisResult(BaseModel):
+    reasoning: str  # LLM fills first
+    conclusion: Literal["approve", "reject"]
+    confidence: float = Field(ge=0.0, le=1.0)
+
+@generative
+def analyze_document(doc: str) -> AnalysisResult: ...
+```
+
+#### 4. Control Flow
+Use Python `if/for/while`. No graph frameworks needed:
+```python
+if analyze_sentiment(m, email) == "negative":
+    draft = draft_apology(m, email)
+else:
+    draft = draft_response(m, email)
+```
+
+#### 5. Instruct-Validate-Repair
+For strict requirements, use `m.instruct()`:
+```python
+from mellea.stdlib.requirements import req, simple_validate
+from mellea.stdlib.sampling import RejectionSamplingStrategy
+
+email = m.instruct(
+    "Write an invite for {{name}}",
+    requirements=[
+        req("Must be formal"),
+        req("Lowercase only", validation_fn=simple_validate(lambda x: x.islower()))
+    ],
+    strategy=RejectionSamplingStrategy(loop_budget=3),
+    user_variables={"name": "Alice"}
+)
+```
+
+#### 6. Small Model Fix
+Small models (1B-8B) can't calculate. Extract params with LLM, compute in Python:
+```python
+from pydantic import BaseModel
+
+class PhysicsParams(BaseModel):
+    speed_a: float
+    speed_b: float
+    delay_hours: float
+
+@generative
+def extract_params(text: str) -> PhysicsParams:
+    """EXTRACT numbers only. Do not calculate."""
+    ...
+
+def calculate_gap(p: PhysicsParams) -> float:
+    return p.speed_a * p.delay_hours
+```
+
+#### 7. One-Shot Examples
+If model struggles, add examples to docstring:
+```python
+@generative
+def identify_fruit(text: str) -> str | None:
+    """
+    Extract fruit from text, or None if none mentioned.
+    Ex: "I ate an apple" -> "apple"
+    Ex: "The sky is blue" -> None
+    """
+    ...
+```
+
+#### 8. Backend Config
+```python
+from mellea import start_session
+from mellea.backends.model_options import ModelOption
+
+m = start_session(
+    model_id="granite3.3:8b",
+    model_options={ModelOption.TEMPERATURE: 0.0, ModelOption.MAX_NEW_TOKENS: 500}
+)
+```
+Options: `TEMPERATURE`, `MAX_NEW_TOKENS`, `SYSTEM_PROMPT`, `SEED`, `TOOLS`, `CONTEXT_WINDOW`, `THINKING`, `STREAM`
+
+#### 9. Async
+```python
+@generative
+async def extract_age(text: str) -> int:
+    """Extract age."""
+    ...
+
+result = await extract_age(m, text="Alice is 30")
+```
+Session methods: `ainstruct`, `achat`, `aact`, `avalidate`, `aquery`, `atransform`
+
+#### 10. Auth
+- **Ollama**: `start_session()` (no setup)
+- **OpenAI**: `export OPENAI_API_KEY="..."`
+- **Watsonx**: `export WATSONX_API_KEY="..."`, `WATSONX_URL`, `WATSONX_PROJECT_ID`
+
+**Never hardcode API keys.**
+
+#### 11. Anti-Patterns
+- **Don't** retry `@generative` calls — Mellea handles retries internally
+- **Don't** use `json.loads()` — use typed returns
+- **Don't** wrap single functions in classes
+- **Do** use `try/except` at app boundaries for network errors
+
+#### 12. Debugging
+```python
+from mellea.core import FancyLogger
+FancyLogger.get_logger().setLevel("DEBUG")
+```
+- `m.last_prompt()` — see exact prompt sent
+
+#### 13. Common Errors
+| Error | Fix |
+|-------|-----|
+| `ComponentParseError` | LLM output didn't match type—add docstring examples |
+| `TypeError: missing positional argument` | First arg must be session `m` |
+| `ConnectionRefusedError` | Run `ollama serve` |
+| Output wrong/None | Model too small—try larger or add `reasoning` field |
+
+#### 14. Testing
+```bash
+uv run pytest -m "not qualitative"  # Fast loop
+uv run pytest                        # Full (verify prompts work)
+```
+
+#### 15. Feedback
+Found a workaround or pattern? Add it to Section 13 (Common Errors) above, or update this file with new guidance.

From ed490564aa5aebc6f64f9563e64fa1847d810b53 Mon Sep 17 00:00:00 2001
From: Nigel Jones <jonesn@uk.ibm.com>
Date: Fri, 16 Jan 2026 14:30:47 +0000
Subject: [PATCH 2/2] docs: fix AGENTS.md to match actual test infrastructure

- Use uv sync --all-extras --all-groups (required for tests)
- Add ollama serve requirement
- Remove non-existent integration marker
- Fix test timing expectations (~2 min, not instant)
- Remove contradictory unmarked test guidance
---
 AGENTS.md | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/AGENTS.md b/AGENTS.md
index b50bb8c7..b7b2950d 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -9,10 +9,10 @@ AGENTS.md — Instructions for AI coding assistants (Claude, Cursor, Copilot, Co
 ## 1. Quick Reference
 ```bash
 pre-commit install                  # Required: install git hooks
-uv sync                             # Install deps & fix lockfile
-uv run pytest -m "not qualitative"  # Fast loop (unit tests only)
-uv run pytest                       # Full suite (includes LLM tests)
-uv run pytest -m integration        # Tests requiring API keys
+uv sync --all-extras --all-groups   # Install all deps (required for tests)
+ollama serve                        # Start Ollama (required for most tests)
+uv run pytest -m "not qualitative"  # Skips LLM quality tests (~2 min)
+uv run pytest                       # Full suite (includes LLM quality tests)
 uv run ruff format . && uv run ruff check .  # Lint & format
 ```
 **Branches**: `feat/topic`, `fix/issue-id`, `docs/topic`
@@ -24,13 +24,12 @@ uv run ruff format . && uv run ruff check .  # Lint & format
 | `mellea/backends` | Providers: HF, OpenAI, Ollama, Watsonx, LiteLLM |
 | `mellea/helpers` | Utilities, logging, model ID tables |
 | `cli/` | CLI commands (`m serve`, `m alora`, `m decompose`, `m eval`) |
-| `test/` | All tests. Unmarked = unit tests (no network/API keys) |
+| `test/` | All tests (run from repo root) |
 | `scratchpad/` | Experiments (git-ignored) |
 
 ## 3. Test Markers
-- `@pytest.mark.qualitative` — LLM output quality tests (skipped in CI)
-- `@pytest.mark.integration` — Requires API keys
-- **Unmarked** — Pure unit tests: no network, deterministic
+- `@pytest.mark.qualitative` — LLM output quality tests (skipped in CI via `CICD=1`)
+- **Unmarked** — Unit tests (may still require Ollama running locally)
 
 ⚠️ Don't add `qualitative` to trivial tests—keep the fast loop fast.
 
@@ -67,8 +66,8 @@ Pre-commit runs: ruff, mypy, uv-lock, codespell
 ## 9. Writing Tests
 - Place tests in `test/` mirroring source structure
 - Name files `test_*.py` (required for pydocstyle)
-- Use `gh_run` fixture for CI-aware tests (see `conftest.py`)
-- **No LLM calls** in unmarked tests—mock or mark `qualitative`
+- Use `gh_run` fixture for CI-aware tests (see `test/conftest.py`)
+- Mark tests checking LLM output quality with `@pytest.mark.qualitative`
 - If a test fails, fix the **code**, not the test (unless the test was wrong)
 
 ## 10. Feedback Loop