Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions python/samples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,7 @@ This directory contains samples demonstrating the capabilities of Microsoft Agen
|------|-------------|
| [`getting_started/evaluation/red_teaming/red_team_agent_sample.py`](./getting_started/evaluation/red_teaming/red_team_agent_sample.py) | Red team agent evaluation sample for Azure AI Foundry |
| [`getting_started/evaluation/self_reflection/self_reflection.py`](./getting_started/evaluation/self_reflection/self_reflection.py) | LLM self-reflection with AI Foundry graders example |
| [`demos/workflow_evaluation/run_evaluation.py`](./demos/workflow_evaluation/run_evaluation.py) | Multi-agent workflow evaluation demo with travel planning agents evaluated using Azure AI Foundry evaluators |

## MCP (Model Context Protocol)

Expand Down
2 changes: 2 additions & 0 deletions python/samples/demos/workflow_evaluation/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
AZURE_AI_PROJECT_ENDPOINT="<your-project-endpoint>"
AZURE_AI_MODEL_DEPLOYMENT_NAME="<your-model-deployment>"
30 changes: 30 additions & 0 deletions python/samples/demos/workflow_evaluation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Multi-Agent Travel Planning Workflow Evaluation

This sample demonstrates evaluating a multi-agent workflow using Azure AI's built-in evaluators. The workflow processes travel planning requests through seven specialized agents in a fan-out/fan-in pattern: travel request handler, hotel/flight/activity search agents, booking aggregator, booking confirmation, and payment processing.

## Evaluation Metrics

The evaluation uses four Azure AI built-in evaluators:

- **Relevance** - How well responses address the user query
- **Groundedness** - Whether responses are grounded in available context
- **Tool Call Accuracy** - Correct tool selection and parameter usage
- **Tool Output Utilization** - Effective use of tool outputs in responses

## Setup

Create a `.env` file with configuration as in the `.env.example` file in this folder.

## Running the Evaluation

Execute the complete workflow and evaluation:

```bash
python run_evaluation.py
```

The script will:
1. Execute the multi-agent travel planning workflow
2. Display response summary for each agent
3. Create and run evaluation on hotel, flight, and activity search agents
4. Monitor progress and display the evaluation report URL
Loading
Loading