Cross-LLM consistency — model-awareness, evaluation pipeline, and prompt hardening

# Issue: Enhancement: Cross-LLM consistency — model-awareness, evaluation pipeline, and prompt hardening

## Summary

PromptKit is designed as a model-agnostic prompt composition system, but the prompts it assembles are ultimately executed by LLMs — and different LLMs interpret the same instructions with measurably different fidelity. A prompt that reliably produces a well-structured `investigation-report` on GPT-4o may produce an incomplete or re-ordered output on a smaller model, or may omit epistemic labels on a model that wasn't trained to follow that convention strictly.

This issue proposes a plan to make PromptKit **model-aware** and to drive toward **predictable, deterministic outputs regardless of which LLM executes the assembled prompt**.

---

## Problem Statement

PromptKit's value proposition is that composing the right persona + protocols + format + template produces a reliable, high-quality output. But "reliable" is currently an implicit assumption — there is no mechanism to:

1. **Measure** how much output quality and structure vary across LLMs for a given prompt
2. **Identify** which prompt components or phrasings are fragile across model families
3. **Harden** components against known model-specific failure modes
4. **Signal** to users which templates have been validated on which models

The practical effect is that a PromptKit user running `review-code` on Claude Sonnet gets a different experience than one running it on GPT-4.1 or Gemini 2.0 Flash — not because the task differs, but because the prompt is inadvertently model-tuned by whoever authored it.

---

## Dimensions of Variation (Known Risk Areas)

| Dimension | Example Failure Mode |
|---|---|
| **Format adherence** | Model re-orders sections, omits required fields, invents section names |
| **Protocol compliance** | Model skips phases (e.g., hypothesis generation), treats multi-phase protocol as a checklist |
| **Epistemic labeling** | Model omits `KNOWN`/`INFERRED`/`ASSUMED` tags or uses them inconsistently |
| **Section completeness** | Model writes "None" instead of "None identified" or omits empty sections entirely |
| **Instruction following precision** | Model ignores quantitative constraints (e.g., "re-verify 3–5 specific claims") |
| **Non-goal enforcement** | Model expands scope beyond stated non-goals |
| **Self-verification depth** | Model produces shallow verification ("I have reviewed the above") vs. genuine re-checking |

---

## Proposed Enhancement Plan

### Phase 1 — Evaluation Framework

Define a **prompt portability evaluation methodology**:

- Select a representative set of PromptKit templates (covering each category)
- Define **golden inputs**: deterministic, minimal input fixtures (e.g., a known buggy C snippet for `investigate-bug`)
- Define a **scoring rubric** for each template covering: section presence, field completeness, protocol phase coverage, epistemic label usage, non-goal adherence
- Execute each template × input pair against a matrix of target LLMs
- Record structured results (pass/fail per rubric criterion, plus qualitative notes)

**New component candidates:**
- Protocol: `model-portability` — authoring guidelines that make PromptKit components robust across model families (e.g., prefer numbered phases over bullet lists, always use imperative mood, avoid ambiguous pronouns, bound instruction scope explicitly)
- Template: `evaluate-prompt-portability` — systematic evaluation of a PromptKit prompt against multiple LLMs using a scoring rubric
- Format: `portability-report` — structured output capturing per-model, per-criterion scores and recommended prompt changes

### Phase 2 — CI Pipeline Integration

Integrate evaluation into CI/CD:

- Add a GitHub Actions workflow that runs a selected subset of golden-input × template pairs against configurable LLM endpoints (using GitHub Models or other API)
- The workflow compares structured output against rubric expectations (regex/schema checks for required fields, section headers, epistemic label presence)
- Failures surface as PR checks — a protocol change that breaks format adherence on a target model is caught before merge
- Results are stored as workflow artifacts for trend analysis over time

**Open questions:**
- Which LLMs to target in CI? (Cost, API availability, model stability — suggest: Claude Sonnet, GPT-4o, Gemini 2.0 Flash, Llama 3 as a baseline)
- Should evaluation be per-PR (expensive) or nightly (cheaper, lower signal)?
- How to handle non-determinism — temperature=0 where supported, seeded prompts?

### Phase 3 — Prompt Hardening Feedback Loop

Use evaluation data to improve PromptKit components:

- For each discovered fragility, trace it to the responsible component (persona, protocol, format, or template)
- Apply targeted rewrites following the `model-portability` protocol
- Re-evaluate after the rewrite to confirm regression closure
- Add `model_notes` to template frontmatter recording known limitations and validated models:

```yaml
model_notes:
  validated_on: [claude-sonnet-4, gpt-4o, gemini-2.0-flash]
  known_issues:
    - model: gpt-4.1-mini
      issue: "Omits Phase 3 self-verification step; adds a shallow summary instead"
      workaround: "Add explicit 'You MUST execute Phase 3...' reminder at end of protocol"
```

### Phase 4 — Model Compatibility Matrix (Documentation)

Publish a **model compatibility matrix** in the docs:

- Per-template, per-model compatibility scores (Verified ✅ / Partial ⚠️ / Known Issues ❌ / Not Tested ?)
- Guidance for users on which models to prefer for high-stakes tasks
- Link to evaluation run artifacts for auditability

---

## Scope of Changes

| Area | Change |
|---|---|
| `protocols/guardrails/` | New `model-portability.md` protocol |
| `templates/` | New `evaluate-prompt-portability.md` template |
| `formats/` | New `portability-report.md` format |
| `manifest.yaml` | Register new components |
| `.github/workflows/` | New `evaluate-portability.yml` CI workflow |
| `docs/` | Model compatibility matrix |
| `tests/` | Golden input fixtures + rubric definitions |

---

## Success Criteria

- [ ] At least 10 representative templates evaluated against ≥ 3 LLMs
- [ ] Evaluation results are reproducible (deterministic inputs, recorded outputs)
- [ ] At least one prompt hardening cycle completed and verified
- [ ] CI workflow runs evaluations and reports pass/fail per template × model
- [ ] `model_notes` frontmatter populated for all evaluated templates
- [ ] Model compatibility matrix published in docs

---

## Related

- This enhancement extends the existing `self-verification` and `anti-hallucination` guardrails — those protocols assume the model will follow instructions, but don't harden the instructions against model-specific failure modes.
- The `extend-library` interactive template is the recommended entry point for designing the new components (`model-portability`, `evaluate-prompt-portability`, `portability-report`).
- The `profile-session` template (session log analysis) is complementary — it can help identify which protocol phases are being skipped in practice.


Area	Change
`protocols/guardrails/`	New `model-portability.md` protocol
`templates/`	New `evaluate-prompt-portability.md` template
`formats/`	New `portability-report.md` format
`manifest.yaml`	Register new components
`.github/workflows/`	New `evaluate-portability.yml` CI workflow
`docs/`	Model compatibility matrix
`tests/`	Golden input fixtures + rubric definitions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-LLM consistency — model-awareness, evaluation pipeline, and prompt hardening #127

Issue: Enhancement: Cross-LLM consistency — model-awareness, evaluation pipeline, and prompt hardening

Summary

Problem Statement

Dimensions of Variation (Known Risk Areas)

Proposed Enhancement Plan

Phase 1 — Evaluation Framework

Phase 2 — CI Pipeline Integration

Phase 3 — Prompt Hardening Feedback Loop

Phase 4 — Model Compatibility Matrix (Documentation)

Scope of Changes

Success Criteria

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dimension	Example Failure Mode
Format adherence	Model re-orders sections, omits required fields, invents section names
Protocol compliance	Model skips phases (e.g., hypothesis generation), treats multi-phase protocol as a checklist
Epistemic labeling	Model omits `KNOWN`/`INFERRED`/`ASSUMED` tags or uses them inconsistently
Section completeness	Model writes "None" instead of "None identified" or omits empty sections entirely
Instruction following precision	Model ignores quantitative constraints (e.g., "re-verify 3–5 specific claims")
Non-goal enforcement	Model expands scope beyond stated non-goals
Self-verification depth	Model produces shallow verification ("I have reviewed the above") vs. genuine re-checking

Cross-LLM consistency — model-awareness, evaluation pipeline, and prompt hardening #127

Description

Issue: Enhancement: Cross-LLM consistency — model-awareness, evaluation pipeline, and prompt hardening

Summary

Problem Statement

Dimensions of Variation (Known Risk Areas)

Proposed Enhancement Plan

Phase 1 — Evaluation Framework

Phase 2 — CI Pipeline Integration

Phase 3 — Prompt Hardening Feedback Loop

Phase 4 — Model Compatibility Matrix (Documentation)

Scope of Changes

Success Criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions