eval-framework

Star

Here are 9 public repositories matching this topic...

greynewell / matchspec

Sponsor

Star

Eval framework. Define correct, test against it, get results.

Updated Feb 17, 2026
Go

vassiliylakhonin / agenda-intelligence-md

Star

Evidence & eval layer for strategic intelligence agents. Markdown protocol, JSON schemas, CLI checks, and a stdio MCP server that help AI agents produce auditable, schema-valid, evidence-aware policy / sanctions / regulation / geopolitical-risk briefs. Not a factuality verifier.

json-schema risk-analysis mcp policy-analysis sanctions ai-agents regulatory-compliance llm-tools llm-evaluation model-context-protocol mcp-server strategic-intelligence eval-framework geopolitical-risk agent-infrastructure

Updated May 18, 2026
Python

abhijeetnardele24-hash / dev-eval-innovator

Star

Local-first LLM evaluation runner with baselines, caching, markdown reports, and CI-friendly quality, latency, and cost gates.

python ci developer-tools prompt-engineering llm-testing llm-evals openai-compatible eval-framework

Updated Apr 13, 2026
Python

Duguce / GuessArena-Demo

Star

A web-based interactive demo for the GuessArena evaluation framework

flask demo large-language-models chatgpt deepseek guessarena eval-framework

Updated Nov 15, 2025
HTML

cyperx84 / multiplan

Star

4-model parallel planning workflow with eval framework — Claude, Gemini, Codex, GLM-5 · OpenClaw ecosystem

cli golang multi-model ai-planning llm eval-framework openclaw

Updated Mar 20, 2026
Go

svetkis / triage-voice-eval

Star

Binary safety verdicts (SAFE/HELD/LEAK/MISS/BROKE) + persona fan-out for LLM pipeline evals

python testing jailbreak evaluation safety safety-critical guardrails llm prompt-injection crisis-detection eval-framework verdicts

Updated May 5, 2026
Python

RitikPatill / rubric-lab

Star

Open-source evaluation framework for AI agents. Define test suites with rubrics, run your agent, get LLM-as-judge scores against criteria, inspect full execution traces, and diff runs to catch behavioral regressions.

python open-source ai nextjs agents llm anthropic llm-as-judge agentic-ai agent-evals eval-framework agent-tracing

Updated May 12, 2026
Python

vola-trebla / toad-mcp-server

Star

MCP server exposing portfolio tools (Semantic Search, Eval Framework, Observability) via Model Context Protocol

typescript mcp semantic-search claude ai-tools llm prompt-management model-context-protocol mcp-server eval-framework

Updated Mar 17, 2026
TypeScript

varunk130 / AI-Eval-Skills

Star

Curated AI agent evaluation skills from Microsoft's Eval Guide — plan, generate, run, and interpret eval suites for Copilot Studio agents

ai-agents ai-evaluation llm-evaluation claude-code agent-evaluation eval-framework claude-skills

Updated May 16, 2026

Improve this page

Add a description, image, and links to the eval-framework topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the eval-framework topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval-framework

Here are 9 public repositories matching this topic...

greynewell / matchspec

vassiliylakhonin / agenda-intelligence-md

abhijeetnardele24-hash / dev-eval-innovator

Duguce / GuessArena-Demo

cyperx84 / multiplan

svetkis / triage-voice-eval

RitikPatill / rubric-lab

vola-trebla / toad-mcp-server

varunk130 / AI-Eval-Skills

Improve this page

Add this topic to your repo