Skip to content

Python: feat(evals): Foundry Adaptive Evals integration (rubric-generation)#6101

Draft
alliscode wants to merge 7 commits into
microsoft:mainfrom
alliscode:adaptive-evals
Draft

Python: feat(evals): Foundry Adaptive Evals integration (rubric-generation)#6101
alliscode wants to merge 7 commits into
microsoft:mainfrom
alliscode:adaptive-evals

Conversation

@alliscode
Copy link
Copy Markdown
Member

Motivation and Context

Integrates Foundry Adaptive Evals (rubric-generation) into Agent Framework's Python eval surface, strictly additively on top of the existing FoundryEvals integration (ADR 0023). Adds:

  • New core typed surfaces — GeneratedEvaluatorRef, RubricDimension, RubricScore, EvalGenerationSource (all @experimental(EVALS)).
  • FoundryEvals accepts GeneratedEvaluatorRef mixed into the existing evaluators= sequence and emits the correct azure_ai_evaluator testing-criteria.
  • Per-dimension rubric scores parsed off output_item.results[*].properties.rubric_scores into EvalScoreResult.dimensions.
  • Assertion helpers on EvalResultsassert_score_at_least, assert_dimension_score_at_least, assert_no_failed_items — for CI gating.
  • BaseAgent.as_eval_source() / Workflow.as_eval_source() to package the richest source available (instructions, tool defs, context-provider classes, topology) for rubric generation, with conservative privacy defaults.
  • FoundryEvals.generate_rubric(...) — orchestrates the beta.evaluators.create_generation_job LRO, polls to terminal status, returns a pinned GeneratedEvaluatorRef.
  • YAML config loader (load_evaluators_from_yaml) + end-to-end sample under python/samples/05-end-to-end/evaluation/foundry_evals/.

The testing-criterion side is unchanged on the wire — we already emit azure_ai_evaluator for builtin.* names, so the new path just supplies a custom evaluator_name + pinned evaluator_version.

The generate_rubric helper gracefully degrades when the installed azure-ai-projects version pre-dates the rubric generation APIs (raises a clear NotImplementedError with install guidance).

Description

6 commits, one per logical phase. Reviewers can step through them in order:

  1. feat(evals): GeneratedEvaluatorRef + RubricDimension/RubricScore types — core only, no Foundry coupling.
  2. feat(foundry-evals): accept GeneratedEvaluatorRef in evaluators= — wires phase 1 into _build_testing_criteria and preserves refs through _filter_tool_evaluators.
  3. feat(evals): parse rubric_scores from output items + assertion helpers — adds EvalScoreResult.dimensions, the three assert_* helpers, and _extract_rubric_scores in _foundry_evals.py.
  4. feat(evals): agent.as_eval_source / workflow.as_eval_source — new EvalGenerationSource core type plus BaseAgent.as_eval_source(...) / Workflow.as_eval_source(...) source-export helpers.
  5. feat(foundry-evals): generate_rubric helper — LRO orchestrator. Imports azure.ai.projects.models.EvaluatorGenerationInputs etc. lazily; raises NotImplementedError with install guidance when unavailable.
  6. feat(foundry-evals): YAML config loader + sampleload_evaluators_from_yaml and evaluate_with_generated_rubric_sample.py.

.NET parity

Not in this PR. Planned as a follow-up 6-commit stack against Microsoft.Agents.AI + Microsoft.Agents.AI.Foundry — unblocked today (the rubric APIs are in Azure.AI.Projects 2.1.0-beta.2, already pinned).

Contribution Checklist

  • The code builds clean without any errors or warnings
  • Follows the Coding Guidelines
  • All unit tests pass (85 core local-eval + 285 foundry; ruff check + format clean; pyright clean on changed files)
  • I have added unit tests where appropriate
  • I have updated the documentation accordingly (sample added; ADR follow-up tracked)

Copilot AI review requested due to automatic review settings May 27, 2026 00:51
@moonbox3
Copy link
Copy Markdown
Contributor

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/core/agent_framework
   _agents.py4155287%509, 518, 573, 1068, 1113, 1186–1190, 1250, 1278, 1315, 1336, 1356–1357, 1362, 1409, 1451, 1473, 1475, 1488, 1494, 1539, 1541, 1550–1555, 1560, 1562, 1568–1569, 1576, 1578–1579, 1587–1588, 1591–1593, 1603–1608, 1612, 1617, 1619
   _evaluation.py8518390%164, 172, 488, 490, 534, 583, 585, 601, 636, 846, 867–868, 1013–1015, 1172, 1175, 1254–1256, 1261, 1298–1301, 1357–1358, 1361, 1367–1369, 1373, 1406–1408, 1464, 1500, 1512–1514, 1519, 1543–1548, 1641, 1719–1720, 1722–1726, 1732, 1771, 2119, 2121, 2129, 2139, 2143, 2188, 2206–2207, 2285, 2287, 2293, 2301, 2316, 2354, 2360–2364, 2396, 2427–2428, 2430, 2455–2456, 2461
packages/core/agent_framework/_workflows
   _workflow.py3412692%60, 62, 67, 91, 96, 157, 193, 401–403, 405–406, 450, 452, 478, 512, 676, 887, 908, 956, 968, 974, 979, 999–1001
packages/foundry/agent_framework_foundry
   _evals_config.py1261191%172–173, 203, 209, 225, 230, 258, 264, 356, 370, 374
   _foundry_evals.py4704490%431, 477–478, 584, 589, 772, 839, 1028–1031, 1033–1038, 1040–1042, 1044, 1059, 1080–1082, 1101, 1105, 1109, 1114, 1116–1125, 1138, 1168, 1170, 1187–1188
TOTAL36924437888% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
7323 34 💤 0 ❌ 0 🔥 1m 55s ⏱️

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an additive Python integration for Foundry Adaptive Evals rubric-generation, extending the existing Foundry evals surface with typed rubric references, rubric score parsing, CI-friendly assertions, source export helpers (agent/workflow → generation sources), a YAML config loader, and an end-to-end sample.

Changes:

  • Introduces new experimental core evaluation types (GeneratedEvaluatorRef, RubricDimension, RubricScore, EvalGenerationSource) plus agent/workflow “export to eval source” helpers.
  • Extends FoundryEvals to accept generated rubric evaluators, generate new rubrics via Foundry LROs, and parse per-dimension rubric scores into results.
  • Adds YAML-based evaluator config loading + sample, plus tests covering the new behaviors.
Show a summary per file
File Description
python/samples/05-end-to-end/evaluation/foundry_evals/evaluators.yaml Sample YAML config describing a generated rubric evaluator.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_with_generated_rubric_sample.py End-to-end sample that generates a rubric, runs evals, and asserts quality gates.
python/packages/foundry/tests/test_foundry_evals.py Adds unit tests for generated rubric refs, filtering behavior, rubric score extraction, and rubric-generation orchestration.
python/packages/foundry/tests/test_evals_config.py New tests for YAML-driven rubric-generation config parsing and source building.
python/packages/foundry/agent_framework_foundry/_foundry_evals.py Implements generated rubric support in criteria building, rubric score extraction, and generate_rubric() LRO polling + conversion to GeneratedEvaluatorRef.
python/packages/foundry/agent_framework_foundry/_evals_config.py New YAML config schema + loader + source builder for rubric generation.
python/packages/foundry/agent_framework_foundry/init.py Exposes the new evals-config loader/schemas at the foundry package surface.
python/packages/core/tests/core/test_local_eval.py Adds tests for new rubric assertion helpers and agent/workflow eval-source export helpers.
python/packages/core/agent_framework/_workflows/_workflow.py Adds Workflow.as_eval_source() convenience wrapper.
python/packages/core/agent_framework/_evaluation.py Adds new rubric-related types, result fields, assertion helpers, and eval-source export helpers.
python/packages/core/agent_framework/_agents.py Adds BaseAgent.as_eval_source() convenience wrapper.
python/packages/core/agent_framework/init.py Re-exports new evaluation types/helpers at top-level.

Copilot's findings

  • Files reviewed: 12/12 changed files
  • Comments generated: 4

Comment thread python/packages/core/agent_framework/_evaluation.py
Comment thread python/packages/core/agent_framework/_evaluation.py Outdated
Comment thread python/packages/foundry/agent_framework_foundry/_foundry_evals.py Outdated
Comment thread python/packages/foundry/agent_framework_foundry/_foundry_evals.py Outdated
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 82% | Result: All clear

Reviewed: Correctness, Security Reliability, Test Coverage, Design Approach


Automated review by alliscode's agents

alliscode and others added 6 commits May 27, 2026 08:06
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…+ accept in evaluators=

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n helpers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…elper

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses 4 Copilot review comments on PR microsoft#6101:

1. assert_dimension_score_at_least: drop the (not evaluator or found_any) guard so require_applicable=True correctly raises when the named evaluator produces no entries for the dimension. Adds TestRubricAssertions covering the regression.

2. GeneratedEvaluatorRef docstring: reword to describe actual behaviour (pinning recommended, not required) so it matches the dataclass default and FoundryEvals warning path.

3. _poll_generation_job: switch from asyncio.get_event_loop() to get_running_loop() and bound the per-iteration sleep by remaining time, matching _poll_eval_run.

4. generate_rubric: type category as Literal['quality','safety'] and validate at the entry point with a ValueError; drop the silent 'invalid -> quality' rewrite in _generation_job_to_ref. Adds a regression test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants