Python: feat(evals): Foundry Adaptive Evals integration (rubric-generation)#6101
Draft
alliscode wants to merge 7 commits into
Draft
Python: feat(evals): Foundry Adaptive Evals integration (rubric-generation)#6101alliscode wants to merge 7 commits into
alliscode wants to merge 7 commits into
Conversation
Contributor
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an additive Python integration for Foundry Adaptive Evals rubric-generation, extending the existing Foundry evals surface with typed rubric references, rubric score parsing, CI-friendly assertions, source export helpers (agent/workflow → generation sources), a YAML config loader, and an end-to-end sample.
Changes:
- Introduces new experimental core evaluation types (
GeneratedEvaluatorRef,RubricDimension,RubricScore,EvalGenerationSource) plus agent/workflow “export to eval source” helpers. - Extends
FoundryEvalsto accept generated rubric evaluators, generate new rubrics via Foundry LROs, and parse per-dimension rubric scores into results. - Adds YAML-based evaluator config loading + sample, plus tests covering the new behaviors.
Show a summary per file
| File | Description |
|---|---|
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluators.yaml | Sample YAML config describing a generated rubric evaluator. |
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_with_generated_rubric_sample.py | End-to-end sample that generates a rubric, runs evals, and asserts quality gates. |
| python/packages/foundry/tests/test_foundry_evals.py | Adds unit tests for generated rubric refs, filtering behavior, rubric score extraction, and rubric-generation orchestration. |
| python/packages/foundry/tests/test_evals_config.py | New tests for YAML-driven rubric-generation config parsing and source building. |
| python/packages/foundry/agent_framework_foundry/_foundry_evals.py | Implements generated rubric support in criteria building, rubric score extraction, and generate_rubric() LRO polling + conversion to GeneratedEvaluatorRef. |
| python/packages/foundry/agent_framework_foundry/_evals_config.py | New YAML config schema + loader + source builder for rubric generation. |
| python/packages/foundry/agent_framework_foundry/init.py | Exposes the new evals-config loader/schemas at the foundry package surface. |
| python/packages/core/tests/core/test_local_eval.py | Adds tests for new rubric assertion helpers and agent/workflow eval-source export helpers. |
| python/packages/core/agent_framework/_workflows/_workflow.py | Adds Workflow.as_eval_source() convenience wrapper. |
| python/packages/core/agent_framework/_evaluation.py | Adds new rubric-related types, result fields, assertion helpers, and eval-source export helpers. |
| python/packages/core/agent_framework/_agents.py | Adds BaseAgent.as_eval_source() convenience wrapper. |
| python/packages/core/agent_framework/init.py | Re-exports new evaluation types/helpers at top-level. |
Copilot's findings
- Files reviewed: 12/12 changed files
- Comments generated: 4
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…+ accept in evaluators= Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n helpers Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…elper Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses 4 Copilot review comments on PR microsoft#6101: 1. assert_dimension_score_at_least: drop the (not evaluator or found_any) guard so require_applicable=True correctly raises when the named evaluator produces no entries for the dimension. Adds TestRubricAssertions covering the regression. 2. GeneratedEvaluatorRef docstring: reword to describe actual behaviour (pinning recommended, not required) so it matches the dataclass default and FoundryEvals warning path. 3. _poll_generation_job: switch from asyncio.get_event_loop() to get_running_loop() and bound the per-iteration sleep by remaining time, matching _poll_eval_run. 4. generate_rubric: type category as Literal['quality','safety'] and validate at the entry point with a ValueError; drop the silent 'invalid -> quality' rewrite in _generation_job_to_ref. Adds a regression test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation and Context
Integrates Foundry Adaptive Evals (rubric-generation) into Agent Framework's Python eval surface, strictly additively on top of the existing
FoundryEvalsintegration (ADR 0023). Adds:GeneratedEvaluatorRef,RubricDimension,RubricScore,EvalGenerationSource(all@experimental(EVALS)).FoundryEvalsacceptsGeneratedEvaluatorRefmixed into the existingevaluators=sequence and emits the correctazure_ai_evaluatortesting-criteria.output_item.results[*].properties.rubric_scoresintoEvalScoreResult.dimensions.EvalResults—assert_score_at_least,assert_dimension_score_at_least,assert_no_failed_items— for CI gating.BaseAgent.as_eval_source()/Workflow.as_eval_source()to package the richest source available (instructions, tool defs, context-provider classes, topology) for rubric generation, with conservative privacy defaults.FoundryEvals.generate_rubric(...)— orchestrates thebeta.evaluators.create_generation_jobLRO, polls to terminal status, returns a pinnedGeneratedEvaluatorRef.load_evaluators_from_yaml) + end-to-end sample underpython/samples/05-end-to-end/evaluation/foundry_evals/.The testing-criterion side is unchanged on the wire — we already emit
azure_ai_evaluatorforbuiltin.*names, so the new path just supplies a customevaluator_name+ pinnedevaluator_version.The
generate_rubrichelper gracefully degrades when the installedazure-ai-projectsversion pre-dates the rubric generation APIs (raises a clearNotImplementedErrorwith install guidance).Description
6 commits, one per logical phase. Reviewers can step through them in order:
feat(evals): GeneratedEvaluatorRef + RubricDimension/RubricScore types— core only, no Foundry coupling.feat(foundry-evals): accept GeneratedEvaluatorRef in evaluators=— wires phase 1 into_build_testing_criteriaand preserves refs through_filter_tool_evaluators.feat(evals): parse rubric_scores from output items + assertion helpers— addsEvalScoreResult.dimensions, the threeassert_*helpers, and_extract_rubric_scoresin_foundry_evals.py.feat(evals): agent.as_eval_source / workflow.as_eval_source— newEvalGenerationSourcecore type plusBaseAgent.as_eval_source(...)/Workflow.as_eval_source(...)source-export helpers.feat(foundry-evals): generate_rubric helper— LRO orchestrator. Importsazure.ai.projects.models.EvaluatorGenerationInputsetc. lazily; raisesNotImplementedErrorwith install guidance when unavailable.feat(foundry-evals): YAML config loader + sample—load_evaluators_from_yamlandevaluate_with_generated_rubric_sample.py..NET parity
Not in this PR. Planned as a follow-up 6-commit stack against
Microsoft.Agents.AI+Microsoft.Agents.AI.Foundry— unblocked today (the rubric APIs are inAzure.AI.Projects 2.1.0-beta.2, already pinned).Contribution Checklist