-
Notifications
You must be signed in to change notification settings - Fork 5
Cross-LLM consistency — model-awareness, evaluation pipeline, and prompt hardening #127
Description
Issue: Enhancement: Cross-LLM consistency — model-awareness, evaluation pipeline, and prompt hardening
Summary
PromptKit is designed as a model-agnostic prompt composition system, but the prompts it assembles are ultimately executed by LLMs — and different LLMs interpret the same instructions with measurably different fidelity. A prompt that reliably produces a well-structured investigation-report on GPT-4o may produce an incomplete or re-ordered output on a smaller model, or may omit epistemic labels on a model that wasn't trained to follow that convention strictly.
This issue proposes a plan to make PromptKit model-aware and to drive toward predictable, deterministic outputs regardless of which LLM executes the assembled prompt.
Problem Statement
PromptKit's value proposition is that composing the right persona + protocols + format + template produces a reliable, high-quality output. But "reliable" is currently an implicit assumption — there is no mechanism to:
- Measure how much output quality and structure vary across LLMs for a given prompt
- Identify which prompt components or phrasings are fragile across model families
- Harden components against known model-specific failure modes
- Signal to users which templates have been validated on which models
The practical effect is that a PromptKit user running review-code on Claude Sonnet gets a different experience than one running it on GPT-4.1 or Gemini 2.0 Flash — not because the task differs, but because the prompt is inadvertently model-tuned by whoever authored it.
Dimensions of Variation (Known Risk Areas)
| Dimension | Example Failure Mode |
|---|---|
| Format adherence | Model re-orders sections, omits required fields, invents section names |
| Protocol compliance | Model skips phases (e.g., hypothesis generation), treats multi-phase protocol as a checklist |
| Epistemic labeling | Model omits KNOWN/INFERRED/ASSUMED tags or uses them inconsistently |
| Section completeness | Model writes "None" instead of "None identified" or omits empty sections entirely |
| Instruction following precision | Model ignores quantitative constraints (e.g., "re-verify 3–5 specific claims") |
| Non-goal enforcement | Model expands scope beyond stated non-goals |
| Self-verification depth | Model produces shallow verification ("I have reviewed the above") vs. genuine re-checking |
Proposed Enhancement Plan
Phase 1 — Evaluation Framework
Define a prompt portability evaluation methodology:
- Select a representative set of PromptKit templates (covering each category)
- Define golden inputs: deterministic, minimal input fixtures (e.g., a known buggy C snippet for
investigate-bug) - Define a scoring rubric for each template covering: section presence, field completeness, protocol phase coverage, epistemic label usage, non-goal adherence
- Execute each template × input pair against a matrix of target LLMs
- Record structured results (pass/fail per rubric criterion, plus qualitative notes)
New component candidates:
- Protocol:
model-portability— authoring guidelines that make PromptKit components robust across model families (e.g., prefer numbered phases over bullet lists, always use imperative mood, avoid ambiguous pronouns, bound instruction scope explicitly) - Template:
evaluate-prompt-portability— systematic evaluation of a PromptKit prompt against multiple LLMs using a scoring rubric - Format:
portability-report— structured output capturing per-model, per-criterion scores and recommended prompt changes
Phase 2 — CI Pipeline Integration
Integrate evaluation into CI/CD:
- Add a GitHub Actions workflow that runs a selected subset of golden-input × template pairs against configurable LLM endpoints (using GitHub Models or other API)
- The workflow compares structured output against rubric expectations (regex/schema checks for required fields, section headers, epistemic label presence)
- Failures surface as PR checks — a protocol change that breaks format adherence on a target model is caught before merge
- Results are stored as workflow artifacts for trend analysis over time
Open questions:
- Which LLMs to target in CI? (Cost, API availability, model stability — suggest: Claude Sonnet, GPT-4o, Gemini 2.0 Flash, Llama 3 as a baseline)
- Should evaluation be per-PR (expensive) or nightly (cheaper, lower signal)?
- How to handle non-determinism — temperature=0 where supported, seeded prompts?
Phase 3 — Prompt Hardening Feedback Loop
Use evaluation data to improve PromptKit components:
- For each discovered fragility, trace it to the responsible component (persona, protocol, format, or template)
- Apply targeted rewrites following the
model-portabilityprotocol - Re-evaluate after the rewrite to confirm regression closure
- Add
model_notesto template frontmatter recording known limitations and validated models:
model_notes:
validated_on: [claude-sonnet-4, gpt-4o, gemini-2.0-flash]
known_issues:
- model: gpt-4.1-mini
issue: "Omits Phase 3 self-verification step; adds a shallow summary instead"
workaround: "Add explicit 'You MUST execute Phase 3...' reminder at end of protocol"Phase 4 — Model Compatibility Matrix (Documentation)
Publish a model compatibility matrix in the docs:
- Per-template, per-model compatibility scores (Verified ✅ / Partial
⚠️ / Known Issues ❌ / Not Tested ?) - Guidance for users on which models to prefer for high-stakes tasks
- Link to evaluation run artifacts for auditability
Scope of Changes
| Area | Change |
|---|---|
protocols/guardrails/ |
New model-portability.md protocol |
templates/ |
New evaluate-prompt-portability.md template |
formats/ |
New portability-report.md format |
manifest.yaml |
Register new components |
.github/workflows/ |
New evaluate-portability.yml CI workflow |
docs/ |
Model compatibility matrix |
tests/ |
Golden input fixtures + rubric definitions |
Success Criteria
- At least 10 representative templates evaluated against ≥ 3 LLMs
- Evaluation results are reproducible (deterministic inputs, recorded outputs)
- At least one prompt hardening cycle completed and verified
- CI workflow runs evaluations and reports pass/fail per template × model
-
model_notesfrontmatter populated for all evaluated templates - Model compatibility matrix published in docs
Related
- This enhancement extends the existing
self-verificationandanti-hallucinationguardrails — those protocols assume the model will follow instructions, but don't harden the instructions against model-specific failure modes. - The
extend-libraryinteractive template is the recommended entry point for designing the new components (model-portability,evaluate-prompt-portability,portability-report). - The
profile-sessiontemplate (session log analysis) is complementary — it can help identify which protocol phases are being skipped in practice.