Add AML agent evaluation script by fcogidi · Pull Request #60 · VectorInstitute/eval-agents

fcogidi · 2026-02-18T17:34:11Z

Summary

Add evaluation script for the AML investigation agent.

Clickup Ticket(s): N/A

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🔧 Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test improvements
🔒 Security fix

Changes Made

Update agent's system prompt.
Add timeout to agent runs.
Update description of flagged_transaction_ids to minimize confusion between list and string output.
Remove unused run_evaluations from TraceEvalResult.
Add evaluation script and rubric.

Testing

Tests pass locally (uv run pytest tests/)
Type checking passes (uv run mypy <src_dir>)
Linting passes (uv run ruff check src_dir/)
Manual testing performed (describe below)

Manual testing details:
N/A

Screenshots/Recordings

N/A

Related Issues

N/A

Deployment Notes

N/A

Checklist

Code follows the project's style guidelines
Self-review of code completed
Documentation updated (if applicable)
No sensitive information (API keys, credentials) exposed

Copilot

Pull request overview

This PR adds a comprehensive evaluation framework for the AML investigation agent, including a new evaluation script, rubric for assessing narrative quality, system prompt improvements, and timeout configuration.

Changes:

Added evaluation script (evaluate.py) with CLI interface for running experiments with item-level, trace-level, and run-level evaluators
Enhanced agent system prompt with clearer investigation workflow, detailed typology definitions, and strategic query guidance
Added timeout support to agent configuration using HttpOptions
Created narrative quality rubric for LLM-based evaluation of investigation reasoning
Updated field descriptions for clarity and removed unused run_evaluations field from TraceEvalResult

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
implementations/aml_investigation/evaluate.py	New evaluation script with comprehensive CLI options for dataset upload, agent evaluation, and results display
implementations/aml_investigation/rubrics/narrative_pattern_quality.md	New rubric defining scoring criteria for narrative quality and pattern description assessments
aieng-eval-agents/aieng/agent_evals/aml_investigation/agent.py	Updated system prompt with enhanced workflow guidance, added timeout parameter and HttpOptions integration
aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py	Updated flagged_transaction_ids field description for clarity (contains minor grammar error)
aieng-eval-agents/aieng/agent_evals/evaluation/types.py	Removed unused run_evaluations field from TraceEvalResult dataclass

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py

…ds field

Add AML agent evaluation script

054a95b

fcogidi requested review from amrit110, Copilot and lotif February 18, 2026 17:34

fcogidi self-assigned this Feb 18, 2026

fcogidi added enhancement New feature or request refactor Refactor or clean up code structure labels Feb 18, 2026

Copilot started reviewing on behalf of fcogidi February 18, 2026 17:35 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py Outdated Show resolved Hide resolved

amrit110 approved these changes Feb 18, 2026

View reviewed changes

amrit110 and others added 3 commits February 18, 2026 12:45

Merge branch 'main' into fco/add_aml_eval_script

37f79cf

Update README to include documentation on evaluation script

be2858b

Fix typo in AnalystOutput model description for flagged_transaction_i…

48fb98a

…ds field

fcogidi merged commit 96e8f59 into main Feb 18, 2026
3 checks passed

fcogidi deleted the fco/add_aml_eval_script branch February 18, 2026 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add AML agent evaluation script#60

Add AML agent evaluation script#60
fcogidi merged 4 commits intomainfrom
fco/add_aml_eval_script

fcogidi commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

fcogidi commented Feb 18, 2026

Summary

Type of Change

Changes Made

Testing

Screenshots/Recordings

Related Issues

Deployment Notes

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants