Skip to content

Comments

Add AML agent evaluation script#60

Merged
fcogidi merged 4 commits intomainfrom
fco/add_aml_eval_script
Feb 18, 2026
Merged

Add AML agent evaluation script#60
fcogidi merged 4 commits intomainfrom
fco/add_aml_eval_script

Conversation

@fcogidi
Copy link
Collaborator

@fcogidi fcogidi commented Feb 18, 2026

Summary

Add evaluation script for the AML investigation agent.

Clickup Ticket(s): N/A

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📝 Documentation update
  • 🔧 Refactoring (no functional changes)
  • ⚡ Performance improvement
  • 🧪 Test improvements
  • 🔒 Security fix

Changes Made

  • Update agent's system prompt.
  • Add timeout to agent runs.
  • Update description of flagged_transaction_ids to minimize confusion between list and string output.
  • Remove unused run_evaluations from TraceEvalResult.
  • Add evaluation script and rubric.

Testing

  • Tests pass locally (uv run pytest tests/)
  • Type checking passes (uv run mypy <src_dir>)
  • Linting passes (uv run ruff check src_dir/)
  • Manual testing performed (describe below)

Manual testing details:
N/A

Screenshots/Recordings

N/A

Related Issues

N/A

Deployment Notes

N/A

Checklist

  • Code follows the project's style guidelines
  • Self-review of code completed
  • Documentation updated (if applicable)
  • No sensitive information (API keys, credentials) exposed

@fcogidi fcogidi self-assigned this Feb 18, 2026
@fcogidi fcogidi added enhancement New feature or request refactor Refactor or clean up code structure labels Feb 18, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a comprehensive evaluation framework for the AML investigation agent, including a new evaluation script, rubric for assessing narrative quality, system prompt improvements, and timeout configuration.

Changes:

  • Added evaluation script (evaluate.py) with CLI interface for running experiments with item-level, trace-level, and run-level evaluators
  • Enhanced agent system prompt with clearer investigation workflow, detailed typology definitions, and strategic query guidance
  • Added timeout support to agent configuration using HttpOptions
  • Created narrative quality rubric for LLM-based evaluation of investigation reasoning
  • Updated field descriptions for clarity and removed unused run_evaluations field from TraceEvalResult

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
implementations/aml_investigation/evaluate.py New evaluation script with comprehensive CLI options for dataset upload, agent evaluation, and results display
implementations/aml_investigation/rubrics/narrative_pattern_quality.md New rubric defining scoring criteria for narrative quality and pattern description assessments
aieng-eval-agents/aieng/agent_evals/aml_investigation/agent.py Updated system prompt with enhanced workflow guidance, added timeout parameter and HttpOptions integration
aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py Updated flagged_transaction_ids field description for clarity (contains minor grammar error)
aieng-eval-agents/aieng/agent_evals/evaluation/types.py Removed unused run_evaluations field from TraceEvalResult dataclass

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@fcogidi fcogidi merged commit 96e8f59 into main Feb 18, 2026
3 checks passed
@fcogidi fcogidi deleted the fco/add_aml_eval_script branch February 18, 2026 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request refactor Refactor or clean up code structure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants