Eval framework. Define correct, test against it, get results.
-
Updated
Feb 17, 2026 - Go
Eval framework. Define correct, test against it, get results.
Evidence & eval layer for strategic intelligence agents. Markdown protocol, JSON schemas, CLI checks, and a stdio MCP server that help AI agents produce auditable, schema-valid, evidence-aware policy / sanctions / regulation / geopolitical-risk briefs. Not a factuality verifier.
Local-first LLM evaluation runner with baselines, caching, markdown reports, and CI-friendly quality, latency, and cost gates.
A web-based interactive demo for the GuessArena evaluation framework
4-model parallel planning workflow with eval framework — Claude, Gemini, Codex, GLM-5 · OpenClaw ecosystem
Binary safety verdicts (SAFE/HELD/LEAK/MISS/BROKE) + persona fan-out for LLM pipeline evals
Open-source evaluation framework for AI agents. Define test suites with rubrics, run your agent, get LLM-as-judge scores against criteria, inspect full execution traces, and diff runs to catch behavioral regressions.
MCP server exposing portfolio tools (Semantic Search, Eval Framework, Observability) via Model Context Protocol
Curated AI agent evaluation skills from Microsoft's Eval Guide — plan, generate, run, and interpret eval suites for Copilot Studio agents
Add a description, image, and links to the eval-framework topic page so that developers can more easily learn about it.
To associate your repository with the eval-framework topic, visit your repo's landing page and select "manage topics."