Python: feat(evals): Foundry Adaptive Evals integration (rubric-generation) by alliscode · Pull Request #6101 · microsoft/agent-framework

alliscode · 2026-05-27T00:51:54Z

Motivation and Context

Integrates Foundry Adaptive Evals (rubric-generation) into Agent Framework's Python eval surface, strictly additively on top of the existing FoundryEvals integration (ADR 0023). Adds:

New core typed surfaces — GeneratedEvaluatorRef, RubricDimension, RubricScore, EvalGenerationSource (all @experimental(EVALS)).
FoundryEvals accepts GeneratedEvaluatorRef mixed into the existing evaluators= sequence and emits the correct azure_ai_evaluator testing-criteria.
Per-dimension rubric scores parsed off output_item.results[*].properties.rubric_scores into EvalScoreResult.dimensions.
Assertion helpers on EvalResults — assert_score_at_least, assert_dimension_score_at_least, assert_no_failed_items — for CI gating.
BaseAgent.as_eval_source() / Workflow.as_eval_source() to package the richest source available (instructions, tool defs, context-provider classes, topology) for rubric generation, with conservative privacy defaults.
FoundryEvals.generate_rubric(...) — orchestrates the beta.evaluators.create_generation_job LRO, polls to terminal status, returns a pinned GeneratedEvaluatorRef.
YAML config loader (load_evaluators_from_yaml) + end-to-end sample under python/samples/05-end-to-end/evaluation/foundry_evals/.

The testing-criterion side is unchanged on the wire — we already emit azure_ai_evaluator for builtin.* names, so the new path just supplies a custom evaluator_name + pinned evaluator_version.

The generate_rubric helper gracefully degrades when the installed azure-ai-projects version pre-dates the rubric generation APIs (raises a clear NotImplementedError with install guidance).

Description

6 commits, one per logical phase. Reviewers can step through them in order:

feat(evals): GeneratedEvaluatorRef + RubricDimension/RubricScore types — core only, no Foundry coupling.
feat(foundry-evals): accept GeneratedEvaluatorRef in evaluators= — wires phase 1 into _build_testing_criteria and preserves refs through _filter_tool_evaluators.
feat(evals): parse rubric_scores from output items + assertion helpers — adds EvalScoreResult.dimensions, the three assert_* helpers, and _extract_rubric_scores in _foundry_evals.py.
feat(evals): agent.as_eval_source / workflow.as_eval_source — new EvalGenerationSource core type plus BaseAgent.as_eval_source(...) / Workflow.as_eval_source(...) source-export helpers.
feat(foundry-evals): generate_rubric helper — LRO orchestrator. Imports azure.ai.projects.models.EvaluatorGenerationInputs etc. lazily; raises NotImplementedError with install guidance when unavailable.
feat(foundry-evals): YAML config loader + sample — load_evaluators_from_yaml and evaluate_with_generated_rubric_sample.py.

.NET parity

Not in this PR. Planned as a follow-up 6-commit stack against Microsoft.Agents.AI + Microsoft.Agents.AI.Foundry — unblocked today (the rubric APIs are in Azure.AI.Projects 2.1.0-beta.2, already pinned).

Contribution Checklist

The code builds clean without any errors or warnings
Follows the Coding Guidelines
All unit tests pass (85 core local-eval + 285 foundry; ruff check + format clean; pyright clean on changed files)
I have added unit tests where appropriate
I have updated the documentation accordingly (sample added; ADR follow-up tracked)

moonbox3 · 2026-05-27T00:56:47Z

Python Test Coverage Report •

File	Stmts	Miss	Cover	Missing
packages/core/agent_framework
_agents.py	415	52	87%	509, 518, 573, 1068, 1113, 1186–1190, 1250, 1278, 1315, 1336, 1356–1357, 1362, 1409, 1451, 1473, 1475, 1488, 1494, 1539, 1541, 1550–1555, 1560, 1562, 1568–1569, 1576, 1578–1579, 1587–1588, 1591–1593, 1603–1608, 1612, 1617, 1619
_evaluation.py	851	83	90%	164, 172, 488, 490, 534, 583, 585, 601, 636, 846, 867–868, 1013–1015, 1172, 1175, 1254–1256, 1261, 1298–1301, 1357–1358, 1361, 1367–1369, 1373, 1406–1408, 1464, 1500, 1512–1514, 1519, 1543–1548, 1641, 1719–1720, 1722–1726, 1732, 1771, 2119, 2121, 2129, 2139, 2143, 2188, 2206–2207, 2285, 2287, 2293, 2301, 2316, 2354, 2360–2364, 2396, 2427–2428, 2430, 2455–2456, 2461
packages/core/agent_framework/_workflows
_workflow.py	341	26	92%	60, 62, 67, 91, 96, 157, 193, 401–403, 405–406, 450, 452, 478, 512, 676, 887, 908, 956, 968, 974, 979, 999–1001
packages/foundry/agent_framework_foundry
_evals_config.py	126	11	91%	172–173, 203, 209, 225, 230, 258, 264, 356, 370, 374
_foundry_evals.py	470	44	90%	431, 477–478, 584, 589, 772, 839, 1028–1031, 1033–1038, 1040–1042, 1044, 1059, 1080–1082, 1101, 1105, 1109, 1114, 1116–1125, 1138, 1168, 1170, 1187–1188
TOTAL	36924	4378	88%

Python Unit Test Overview

Tests	Skipped	Failures	Errors	Time
7323	34 💤	0 ❌	0 🔥	1m 55s ⏱️

Copilot

Pull request overview

Adds an additive Python integration for Foundry Adaptive Evals rubric-generation, extending the existing Foundry evals surface with typed rubric references, rubric score parsing, CI-friendly assertions, source export helpers (agent/workflow → generation sources), a YAML config loader, and an end-to-end sample.

Changes:

Introduces new experimental core evaluation types (GeneratedEvaluatorRef, RubricDimension, RubricScore, EvalGenerationSource) plus agent/workflow “export to eval source” helpers.
Extends FoundryEvals to accept generated rubric evaluators, generate new rubrics via Foundry LROs, and parse per-dimension rubric scores into results.
Adds YAML-based evaluator config loading + sample, plus tests covering the new behaviors.

Show a summary per file

File	Description
python/samples/05-end-to-end/evaluation/foundry_evals/evaluators.yaml	Sample YAML config describing a generated rubric evaluator.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_with_generated_rubric_sample.py	End-to-end sample that generates a rubric, runs evals, and asserts quality gates.
python/packages/foundry/tests/test_foundry_evals.py	Adds unit tests for generated rubric refs, filtering behavior, rubric score extraction, and rubric-generation orchestration.
python/packages/foundry/tests/test_evals_config.py	New tests for YAML-driven rubric-generation config parsing and source building.
python/packages/foundry/agent_framework_foundry/_foundry_evals.py	Implements generated rubric support in criteria building, rubric score extraction, and `generate_rubric()` LRO polling + conversion to `GeneratedEvaluatorRef`.
python/packages/foundry/agent_framework_foundry/_evals_config.py	New YAML config schema + loader + source builder for rubric generation.
python/packages/foundry/agent_framework_foundry/init.py	Exposes the new evals-config loader/schemas at the foundry package surface.
python/packages/core/tests/core/test_local_eval.py	Adds tests for new rubric assertion helpers and agent/workflow eval-source export helpers.
python/packages/core/agent_framework/_workflows/_workflow.py	Adds `Workflow.as_eval_source()` convenience wrapper.
python/packages/core/agent_framework/_evaluation.py	Adds new rubric-related types, result fields, assertion helpers, and eval-source export helpers.
python/packages/core/agent_framework/_agents.py	Adds `BaseAgent.as_eval_source()` convenience wrapper.
python/packages/core/agent_framework/init.py	Re-exports new evaluation types/helpers at top-level.

Copilot's findings

Files reviewed: 12/12 changed files
Comments generated: 4

github-actions

Automated Code Review

Reviewers: 4 | Confidence: 82% | Result: All clear

Reviewed: Correctness, Security Reliability, Test Coverage, Design Approach

Automated review by alliscode's agents

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…+ accept in evaluators= Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…n helpers Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…elper Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Addresses 4 Copilot review comments on PR microsoft#6101: 1. assert_dimension_score_at_least: drop the (not evaluator or found_any) guard so require_applicable=True correctly raises when the named evaluator produces no entries for the dimension. Adds TestRubricAssertions covering the regression. 2. GeneratedEvaluatorRef docstring: reword to describe actual behaviour (pinning recommended, not required) so it matches the dataclass default and FoundryEvals warning path. 3. _poll_generation_job: switch from asyncio.get_event_loop() to get_running_loop() and bound the per-iteration sleep by remaining time, matching _poll_eval_run. 4. generate_rubric: type category as Literal['quality','safety'] and validate at the entry point with a ValueError; drop the silent 'invalid -> quality' rewrite in _generation_job_to_ref. Adds a regression test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 27, 2026 00:51

Copilot started reviewing on behalf of alliscode May 27, 2026 00:52 View session

moonbox3 added the python label May 27, 2026

Copilot AI reviewed May 27, 2026

View reviewed changes

github-actions Bot reviewed May 27, 2026

View reviewed changes

alliscode and others added 6 commits May 27, 2026 08:06

Python: feat(evals): RubricScore type + EvalScoreResult.dimensions

e45b934

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Python: feat(foundry-evals): RubricDimension + GeneratedEvaluatorRef …

e5830dd

…+ accept in evaluators= Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Python: feat(evals): parse rubric_scores from output items + assertio…

4bc6046

…n helpers Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Python: feat(evals): BaseAgent.as_eval_source / Workflow.as_eval_source

38d51d1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Python: feat(foundry-evals): EvalGenerationSource + generate_rubric h…

a9e4676

…elper Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Python: feat(foundry-evals): YAML config loader + sample

4c7f94f

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

alliscode force-pushed the adaptive-evals branch from 08f3b46 to 4c7f94f Compare May 27, 2026 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: feat(evals): Foundry Adaptive Evals integration (rubric-generation)#6101

Python: feat(evals): Foundry Adaptive Evals integration (rubric-generation)#6101
alliscode wants to merge 7 commits into
microsoft:mainfrom
alliscode:adaptive-evals

alliscode commented May 27, 2026

Uh oh!

moonbox3 commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alliscode commented May 27, 2026

Motivation and Context

Description

.NET parity

Contribution Checklist

Uh oh!

moonbox3 commented May 27, 2026

Python Unit Test Overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Automated Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants