VectorInstitute · fcogidi · Feb 18, 2026 · Feb 18, 2026 · Feb 18, 2026 · Feb 18, 2026
diff --git a/aieng-eval-agents/aieng/agent_evals/aml_investigation/agent.py b/aieng-eval-agents/aieng/agent_evals/aml_investigation/agent.py
@@ -24,69 +24,110 @@
 from google.adk.agents.base_agent import AfterAgentCallback, BeforeAgentCallback
 from google.adk.agents.llm_agent import AfterModelCallback, BeforeModelCallback
 from google.adk.tools.function_tool import FunctionTool
-from google.genai.types import GenerateContentConfig, ThinkingConfig
+from google.genai.types import GenerateContentConfig, HttpOptions, ThinkingConfig
 
 
 _DEFAULT_AGENT_DESCRIPTION = "Conducts multi-step investigations for money laundering patterns using database queries."
 
 ANALYST_PROMPT = """\
 You are an Anti‑Money Laundering (AML) Investigation Analyst at a financial institution. \
-Your job is to investigate one case by reviewing activity in the available database and explaining whether the \
-observed behavior within the case window is consistent with money laundering or a benign explanation.
+Your job is to investigate a case by reviewing activity in the available database and explaining whether the \
+observed behavior within the case window is consistent with a money laundering pattern or a benign explanation.
 
-You have access to database query tools. Use them. Do not guess or invent transactions.
+You have access to tools for querying the database. Use them strategically. Do NOT guess or invent transactions.
 
-## Core Principle: Falsification
-Start with the hypothesis that the case is benign. Prefer legitimate explanations unless the transaction-level evidence supports laundering.
+## Core Principles
+- Start with the hypothesis that activity is legitimate/benign unless evidence contradicts this.
+- Laundering requires multiple indicators from different categories, not single factors alone.
+- Entity type, business model, and transaction purpose determine whether patterns are suspicious.
+- Base conclusions on observable transaction patterns, not speculation or absence of information.
 
 ## Input
 You will be given a JSON object with these fields:
 - `case_id`: unique case identifier.
 - `seed_transaction_id`: identifier for the primary transaction that triggered the case.
 - `seed_timestamp`: timestamp of the seed transaction (end of the investigation window).
 - `window_start`: timestamp of the beginning of the investigation window.
-- `trigger_label`: upstream alert/review label or heuristic hint (may be wrong).
+- `trigger_label`: upstream alert or review label that initiated the case. This may be noisy and should not be taken \
+  at face value.
 
-### Time Scope Constraint
-**Critical**: Only analyze events with timestamps between `window_start` and `seed_timestamp` (inclusive). Exclude any events after `seed_timestamp`.
+**Time Scope**: Only analyze events with timestamps between `window_start` and `seed_timestamp` (inclusive).
 
 ## Investigation Workflow
 
-### Step 1: Orient
-Review the `trigger_label` as context only. Do not assume it is correct.
+### Step 1: Seed Transaction Review
+Query the seed transaction and extract:
+- Involved parties and their entity types (Corporation, Sole Proprietorship, Partnership, Individual)
+- Amounts, currencies, payment channels
+- Timestamps and jurisdictions
 
-### Step 2: Seed Transaction Review
-- Query the seed transaction using `seed_transaction_id`
-- Extract: involved parties, amounts, payment channels, instruments, and other relevant attributes
+### Step 2: Scope and Collect
+**Note**: You have limited context window and limited number of queries to the database. Be strategic with the queries \
+you run to avoid hitting limits before gathering enough evidence to make a determination.
 
-### Step 3: Scope and Collect
-Pull related activity for involved entities within the investigation window (`window_start` to `seed_timestamp`, inclusive).
+**For each account you investigate**:
 
-### Step 4: Assess Benign Explanations (Default Hypothesis)
+1. **Always start with aggregates**:
+   ```
+   - COUNT(*) transactions
+   - COUNT(DISTINCT counterparty)
+   - SUM(amount) by direction
+   - Distribution by payment type/time
+   ```
+
+2. **Pull details selectively**:
+   - If count ≤ 20 transactions: Safe to SELECT all
+   - If count > 20: Query top counterparties, then pull samples for suspicious patterns
+   - Never pull thousands of raw transactions - use aggregates + samples
+
+3. **Expand strategically**:
+   - Follow promising leads from aggregates (unusual counterparties, timing clusters)
+   - Maximum 2-3 hops from seed unless clear layering chain
+
+### Step 3: Assess Benign Explanations (Default Hypothesis)
 Attempt to explain observed activity as legitimate first:
 - State which evidence supports the benign hypothesis
 - Identify what additional data would strengthen this explanation
-- Only proceed to Step 5 if benign explanations are insufficient
+- Only proceed to Step 4 if benign explanations are insufficient
 
-### Step 5: Test Laundering Hypotheses (If Needed)
+### Step 4: Test Laundering Hypotheses (If Needed)
 If benign explanations fail to account for the evidence:
 - Test whether the evidence supports known laundering typologies
 - Cite concrete indicators that rule out benign explanations
 
 ## Typologies / Heuristics
-When assessing patterns, consider these typologies:
-- FAN-IN (aggregation): Many sources aggregating to one destination
-- FAN-OUT (dispersion): One source dispersing to many destinations
-- GATHER-SCATTER / SCATTER-GATHER: Aggregation followed by dispersion (or reverse) over short time windows
-- STACK / LAYERING: Multiple hops meant to obscure origin
-- CYCLE: Circular fund movement
-- BIPARTITE: Structured flows between two distinct groups
-- RANDOM: Complex pattern with no discernible structure
+Consider the following typologies when assessing laundering patterns:
+
+- FAN-IN: *Many* distinct source accounts -> *One* destination account (consolidation/aggregation)
+- FAN-OUT: *One* source account -> *Many* distinct destination accounts (distribution/dispersion)
+- GATHER-SCATTER: *Many* sources -> *One* hub -> *Many* destinations (in that temporal order)
+  - First phase: Hub gathers from multiple sources
+  - Second phase: Hub scatters to multiple destinations
+  - Time gap between phases: typically hours to days.
+- SCATTER-GATHER: *One* source -> *Many* intermediaries -> *One* destination (in that temporal order)
+  - First phase: Source scatters to multiple intermediaries
+  - Second phase: Intermediaries gather to final destination
+  - Creates layering through multiple parallel paths.
+- STACK / LAYERING: Sequential hops through multiple accounts (linear chain). The purpose is typically to obscure the \
+  origin through distance/complexity.
+- CYCLE: Funds return to their origin point, creating a circular flow.
+- BIPARTITE: Structured flows between two distinct, segregated groups with no within-group transactions. The segregation \
+  and lack of within-group transactions is the defining characteristic. It's not just two-way flows, it's structured \
+  isolation between groups.
+- RANDOM: Complex pattern with no discernible structure. Use only when activity is clearly suspicious but doesn't fit \
+  other typologies.
+- NONE: No laundering pattern is supported by evidence in the investigation window.
 
 ## Output Format
 Return a single JSON object matching the configured output schema exactly. Populate every field.
 Use `pattern_type = "NONE"` when no laundering pattern is supported by evidence in the investigation window.
 
+**Rules for flagging transactions IDs**:
+- **Causal Chain Only**: Include *only* the transactions that form the identified laundering pattern.
+- **Exclude Noise**: If an account has more transactions but only 3 are part of the laundering chain, output *only* those 3 IDs.
+- When flagging transaction IDs, the seed transaction should be the last transaction in the chain (i.e., the most recent transaction), \
+  since the investigation window ends with the seed transaction.
+
 ## Handling Uncertainty
 If you lack sufficient information to make a determination, explicitly state what data is missing. \
 Do not fabricate transaction details or make unsupported inferences. When uncertain between benign and suspicious, \
@@ -110,6 +151,7 @@ def create_aml_investigation_agent(
     after_agent_callback: AfterAgentCallback | None = None,
     before_model_callback: BeforeModelCallback | None = None,
     after_model_callback: AfterModelCallback | None = None,
+    timeout_sec: int | None = None,
     enable_tracing: bool = True,
 ) -> LlmAgent:
     """Create a configured AML investigation agent.
@@ -155,6 +197,9 @@ def create_aml_investigation_agent(
         Callback executed before each model call.
     after_model_callback : AfterModelCallback | None, optional
         Callback executed after each model call.
+    timeout_sec : int | None, optional
+        Optional timeout in seconds for model calls. If specified, model calls
+        that exceed this duration will be cancelled.
     enable_tracing : bool, optional, default=True
         Whether to initialize Langfuse tracing for this agent. If ``True``, Langfuse
         tracing is initialized with the agent's name as the service name.
@@ -201,6 +246,7 @@ def create_aml_investigation_agent(
         instruction=instructions or ANALYST_PROMPT,
         tools=[FunctionTool(db.get_schema_info), FunctionTool(db.execute)],
         generate_content_config=GenerateContentConfig(
+            http_options=HttpOptions(timeout=timeout_sec * 1000) if timeout_sec is not None else None,
             temperature=temperature,
             top_p=top_p,
             top_k=top_k,

diff --git a/aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py b/aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py
@@ -99,7 +99,7 @@ class AnalystOutput(BaseModel):
     pattern_type: LaunderingPattern = Field(..., description="The type of laundering pattern in the case.")
     pattern_description: str = Field(..., description="A short description of the laundering pattern.")
     flagged_transaction_ids: str = Field(
-        ..., description="A comma-separated list of transaction IDs flagged by the analyst as suspicious."
+        ..., description="A string of comma-separated transaction IDs that make up the laundering pattern."
     )
 
 

diff --git a/aieng-eval-agents/aieng/agent_evals/evaluation/types.py b/aieng-eval-agents/aieng/agent_evals/evaluation/types.py
@@ -137,15 +137,12 @@ class TraceEvalResult:
         Trace IDs that failed due to errors during evaluation.
     errors_by_trace_id : dict[str, str]
         Error messages associated with skipped or failed traces.
-    run_evaluations : list[Evaluation]
-        Aggregated trace evaluation metrics written at dataset-run level.
     """
 
     evaluations_by_trace_id: dict[str, list[Evaluation]] = field(default_factory=dict)
     skipped_trace_ids: list[str] = field(default_factory=list)
     failed_trace_ids: list[str] = field(default_factory=list)
     errors_by_trace_id: dict[str, str] = field(default_factory=dict)
-    run_evaluations: list[Evaluation] = field(default_factory=list)
 
 
 @dataclass(frozen=True)

diff --git a/implementations/aml_investigation/README.md b/implementations/aml_investigation/README.md
@@ -81,6 +81,56 @@ uv run --env-file .env implementations/aml_investigation/cli.py
 
 The script prints a simple confusion matrix for `is_laundering` based on the cases that have `output`.
 
+## Evaluate the Agent
+
+The evaluation script uploads the AML investigation dataset to Langfuse, runs a comprehensive evaluation experiment with multiple types of evaluators, and displays results in the console.
+
+```bash
+uv run --env-file .env implementations/aml_investigation/evaluate.py \
+  --dataset-path implementations/aml_investigation/data/aml_cases.jsonl \
+  --dataset-name AML-investigation
+```
+
+### Evaluation Levels
+
+The evaluation framework assesses agent performance at three levels:
+
+**Item-Level Evaluators** — Score each individual case prediction:
+
+- **Deterministic grader**: Checks correctness of `is_laundering`, `pattern_type`, and flagged transaction IDs against ground truth
+- **Narrative quality evaluator**: LLM-as-judge that scores the investigation reasoning and pattern description quality using the rubric in `rubrics/narrative_pattern_quality.md`
+
+**Trace-Level Evaluators** — Analyze tool-use trajectories for each agent run:
+
+- **Trace deterministic grader**: Validates SQL safety (read-only compliance), time window adherence, and query redundancy metrics
+- **Trace groundedness evaluator**: LLM-based assessment of whether the agent's narrative is grounded in the actual tool outputs
+
+**Run-Level Grader** — Aggregates results across all cases:
+
+- Computes precision, recall, and F1-score for `is_laundering` detection
+- Generates macro F1-score and confusion matrix for `pattern_type` classification
+
+### CLI Options
+
+Key options you may want to adjust:
+
+- `--agent-timeout`: Timeout in seconds for each agent run (default: 300)
+- `--llm-judge-timeout`: Timeout for LLM judge evaluations (default: 120)
+- `--llm-judge-retries`: Retry attempts for LLM judge failures (default: 3)
+- `--max-concurrent-cases`: Maximum concurrent cases to process (default: 5)
+- `--max-concurrent-traces`: Maximum concurrent trace evaluations (default: 10)
+- `--max-trace-wait-time`: Maximum seconds to wait for trace data (default: 300)
+
+### Output
+
+The evaluation displays:
+
+1. **Per-item metrics tables**: Shows deterministic and narrative quality scores for each case
+2. **Run-level aggregate metrics**: Overall precision, recall, F1-score, and confusion matrix
+3. **Trace evaluation summary**: Count of successful, skipped, and failed trace evaluations
+
+All results are uploaded to Langfuse for further analysis and visualization.
+
 ## Run with ADK Web UI
 
 If you want to inspect the agent interactively, the module exposes a top-level `root_agent` for ADK discovery.