Skip to content

Python: Use AI Foundry evaluators for self-reflection#2250

Merged
YusakuNo1 merged 21 commits intomainfrom
users/daviwu/self_reflection
Nov 19, 2025
Merged

Python: Use AI Foundry evaluators for self-reflection#2250
YusakuNo1 merged 21 commits intomainfrom
users/daviwu/self_reflection

Conversation

@YusakuNo1
Copy link
Contributor

@YusakuNo1 YusakuNo1 commented Nov 15, 2025

Motivation and Context

Description

Demostrate how to implement self-reflection from agent framework

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

Copilot AI review requested due to automatic review settings November 15, 2025 17:04
@markwallace-microsoft markwallace-microsoft added documentation Improvements or additions to documentation python labels Nov 15, 2025
@github-actions github-actions bot changed the title Use AI Foundry evaluators for self-reflection Python: Use AI Foundry evaluators for self-reflection Nov 15, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new Python sample demonstrating self-reflection capabilities for LLM responses using AI Foundry's groundedness evaluators. The sample shows how to iteratively improve LLM responses by evaluating them and providing feedback for refinement.

  • New self-reflection sample using groundedness evaluation
  • Batch processing capability for evaluating multiple prompts
  • Integration with Azure OpenAI and AI Foundry evaluators

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.

File Description
python/samples/getting_started/observability/self_reflection.py New sample implementing self-reflection loop with groundedness evaluation for LLM responses
python/samples/getting_started/observability/resources/suboptimal_groundedness_prompts.parquet Test data file containing prompts for self-reflection evaluation
python/samples/getting_started/observability/.env.example Added Azure OpenAI configuration variables for the self-reflection sample
python/samples/README.md Added reference to the new self-reflection sample
Comments suppressed due to low confidence (2)

python/samples/getting_started/observability/.env.example:21

  • The model name "gpt-4.1" in the .env.example file doesn't appear to be a valid Azure OpenAI model deployment name. This should be updated to a valid model name like "gpt-4", "gpt-4o", or "gpt-4-turbo".
    python/samples/getting_started/observability/self_reflection.py:161
  • This statement is unreachable.
        best_response = raw_response.choices[0].message.content

@YusakuNo1 YusakuNo1 enabled auto-merge November 16, 2025 03:11
@YusakuNo1 YusakuNo1 disabled auto-merge November 16, 2025 04:23
@YusakuNo1 YusakuNo1 enabled auto-merge November 16, 2025 04:38
Copy link
Member

@eavanvalkenburg eavanvalkenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the idea of this, but could you have a look at our ChatMiddleware and implement using that, that way it becomes a much more native way of working, and if you put the actual middleware function inside of a class then all the setup and best_score etc. can be captured and maintained, this samples shows classes, the same works for ChatMiddleware, but this could also be implemented with AgentMiddleware: https://github.com/microsoft/agent-framework/blob/main/python/samples/getting_started/middleware/class_based_middleware.py

@YusakuNo1
Copy link
Contributor Author

Love the idea of this, but could you have a look at our ChatMiddleware and implement using that, that way it becomes a much more native way of working, and if you put the actual middleware function inside of a class then all the setup and best_score etc. can be captured and maintained, this samples shows classes, the same works for ChatMiddleware, but this could also be implemented with AgentMiddleware: https://github.com/microsoft/agent-framework/blob/main/python/samples/getting_started/middleware/class_based_middleware.py

Hi @eavanvalkenburg , thanks for the suggestion! This approach is not added into middleware because it's slow and consumes a lot of tokens, we would like the AI developers to try it as "optional" instead of in the core path. What do you think?

@YusakuNo1
Copy link
Contributor Author

Love the idea of this, but could you have a look at our ChatMiddleware and implement using that, that way it becomes a much more native way of working, and if you put the actual middleware function inside of a class then all the setup and best_score etc. can be captured and maintained, this samples shows classes, the same works for ChatMiddleware, but this could also be implemented with AgentMiddleware: https://github.com/microsoft/agent-framework/blob/main/python/samples/getting_started/middleware/class_based_middleware.py

Within the team, we discussed offline for the approach of using middeware. For the use case of middleware, it'll intercept the traffic and do something extra, but for our current use case, the self-reflection code will also modify the original user input, and then run again. With this, maybe it's not a good use case for middleware, plus, this is observability use case, maybe we can make it simple for the user for now. What do you think?

Copy link
Contributor

@TaoChenOSU TaoChenOSU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably wouldn't put this sample in the observability samples because it does not have anything to do with observability.

Maybe a new evaluation folder?

@YusakuNo1
Copy link
Contributor Author

I probably wouldn't put this sample in the observability samples because it does not have anything to do with observability.

Maybe a new evaluation folder?

For organization definition, our "observability" includes both tracing and evaluation... I think we can host the sample in observability for now?

@TaoChenOSU
Copy link
Contributor

I probably wouldn't put this sample in the observability samples because it does not have anything to do with observability.
Maybe a new evaluation folder?

For organization definition, our "observability" includes both tracing and evaluation... I think we can host the sample in observability for now?

What do you think @eavanvalkenburg? If we want to keep evaluation under observability, could we change the name of the sample to evaluation_with_self_reflection?

Copy link
Member

@eavanvalkenburg eavanvalkenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small notes, but looks good overall

YusakuNo1 and others added 2 commits November 19, 2025 08:57
…luation/self_reflection.py

Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com>
@YusakuNo1 YusakuNo1 added this pull request to the merge queue Nov 19, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 19, 2025
@YusakuNo1 YusakuNo1 added this pull request to the merge queue Nov 19, 2025
Merged via the queue into main with commit b3e96b8 Nov 19, 2025
23 checks passed
@YusakuNo1 YusakuNo1 deleted the users/daviwu/self_reflection branch November 19, 2025 18:47
arisng pushed a commit to arisng/agent-framework that referenced this pull request Feb 2, 2026
* First working version

* Simplify the implementations

* Remove unused env var

* Update Python syntax

* Address feedbacks

* Fix a typo

* Update names as review suggestions

* Citation for self-reflection

* Move to independent folder

* Update python/samples/getting_started/evaluation/azure_ai_foundry/evaluation/README.md

Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com>

* Updated from parquet to JSONL and hide the default environment variables

* As review feedback, remove the purpose of using `run_self_reflection_batch` as a library, only use it as sample code

* Update python/samples/getting_started/evaluation/azure_ai_foundry/evaluation/self_reflection.py

Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com>

---------

Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation python

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants