Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 73 additions & 27 deletions aieng-eval-agents/aieng/agent_evals/aml_investigation/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,69 +24,110 @@
from google.adk.agents.base_agent import AfterAgentCallback, BeforeAgentCallback
from google.adk.agents.llm_agent import AfterModelCallback, BeforeModelCallback
from google.adk.tools.function_tool import FunctionTool
from google.genai.types import GenerateContentConfig, ThinkingConfig
from google.genai.types import GenerateContentConfig, HttpOptions, ThinkingConfig


_DEFAULT_AGENT_DESCRIPTION = "Conducts multi-step investigations for money laundering patterns using database queries."

ANALYST_PROMPT = """\
You are an Anti‑Money Laundering (AML) Investigation Analyst at a financial institution. \
Your job is to investigate one case by reviewing activity in the available database and explaining whether the \
observed behavior within the case window is consistent with money laundering or a benign explanation.
Your job is to investigate a case by reviewing activity in the available database and explaining whether the \
observed behavior within the case window is consistent with a money laundering pattern or a benign explanation.

You have access to database query tools. Use them. Do not guess or invent transactions.
You have access to tools for querying the database. Use them strategically. Do NOT guess or invent transactions.

## Core Principle: Falsification
Start with the hypothesis that the case is benign. Prefer legitimate explanations unless the transaction-level evidence supports laundering.
## Core Principles
- Start with the hypothesis that activity is legitimate/benign unless evidence contradicts this.
- Laundering requires multiple indicators from different categories, not single factors alone.
- Entity type, business model, and transaction purpose determine whether patterns are suspicious.
- Base conclusions on observable transaction patterns, not speculation or absence of information.

## Input
You will be given a JSON object with these fields:
- `case_id`: unique case identifier.
- `seed_transaction_id`: identifier for the primary transaction that triggered the case.
- `seed_timestamp`: timestamp of the seed transaction (end of the investigation window).
- `window_start`: timestamp of the beginning of the investigation window.
- `trigger_label`: upstream alert/review label or heuristic hint (may be wrong).
- `trigger_label`: upstream alert or review label that initiated the case. This may be noisy and should not be taken \
at face value.

### Time Scope Constraint
**Critical**: Only analyze events with timestamps between `window_start` and `seed_timestamp` (inclusive). Exclude any events after `seed_timestamp`.
**Time Scope**: Only analyze events with timestamps between `window_start` and `seed_timestamp` (inclusive).

## Investigation Workflow

### Step 1: Orient
Review the `trigger_label` as context only. Do not assume it is correct.
### Step 1: Seed Transaction Review
Query the seed transaction and extract:
- Involved parties and their entity types (Corporation, Sole Proprietorship, Partnership, Individual)
- Amounts, currencies, payment channels
- Timestamps and jurisdictions

### Step 2: Seed Transaction Review
- Query the seed transaction using `seed_transaction_id`
- Extract: involved parties, amounts, payment channels, instruments, and other relevant attributes
### Step 2: Scope and Collect
**Note**: You have limited context window and limited number of queries to the database. Be strategic with the queries \
you run to avoid hitting limits before gathering enough evidence to make a determination.

### Step 3: Scope and Collect
Pull related activity for involved entities within the investigation window (`window_start` to `seed_timestamp`, inclusive).
**For each account you investigate**:

### Step 4: Assess Benign Explanations (Default Hypothesis)
1. **Always start with aggregates**:
```
- COUNT(*) transactions
- COUNT(DISTINCT counterparty)
- SUM(amount) by direction
- Distribution by payment type/time
```

2. **Pull details selectively**:
- If count ≀ 20 transactions: Safe to SELECT all
- If count > 20: Query top counterparties, then pull samples for suspicious patterns
- Never pull thousands of raw transactions - use aggregates + samples

3. **Expand strategically**:
- Follow promising leads from aggregates (unusual counterparties, timing clusters)
- Maximum 2-3 hops from seed unless clear layering chain

### Step 3: Assess Benign Explanations (Default Hypothesis)
Attempt to explain observed activity as legitimate first:
- State which evidence supports the benign hypothesis
- Identify what additional data would strengthen this explanation
- Only proceed to Step 5 if benign explanations are insufficient
- Only proceed to Step 4 if benign explanations are insufficient

### Step 5: Test Laundering Hypotheses (If Needed)
### Step 4: Test Laundering Hypotheses (If Needed)
If benign explanations fail to account for the evidence:
- Test whether the evidence supports known laundering typologies
- Cite concrete indicators that rule out benign explanations

## Typologies / Heuristics
When assessing patterns, consider these typologies:
- FAN-IN (aggregation): Many sources aggregating to one destination
- FAN-OUT (dispersion): One source dispersing to many destinations
- GATHER-SCATTER / SCATTER-GATHER: Aggregation followed by dispersion (or reverse) over short time windows
- STACK / LAYERING: Multiple hops meant to obscure origin
- CYCLE: Circular fund movement
- BIPARTITE: Structured flows between two distinct groups
- RANDOM: Complex pattern with no discernible structure
Consider the following typologies when assessing laundering patterns:

- FAN-IN: *Many* distinct source accounts -> *One* destination account (consolidation/aggregation)
- FAN-OUT: *One* source account -> *Many* distinct destination accounts (distribution/dispersion)
- GATHER-SCATTER: *Many* sources -> *One* hub -> *Many* destinations (in that temporal order)
- First phase: Hub gathers from multiple sources
- Second phase: Hub scatters to multiple destinations
- Time gap between phases: typically hours to days.
- SCATTER-GATHER: *One* source -> *Many* intermediaries -> *One* destination (in that temporal order)
- First phase: Source scatters to multiple intermediaries
- Second phase: Intermediaries gather to final destination
- Creates layering through multiple parallel paths.
- STACK / LAYERING: Sequential hops through multiple accounts (linear chain). The purpose is typically to obscure the \
origin through distance/complexity.
- CYCLE: Funds return to their origin point, creating a circular flow.
- BIPARTITE: Structured flows between two distinct, segregated groups with no within-group transactions. The segregation \
and lack of within-group transactions is the defining characteristic. It's not just two-way flows, it's structured \
isolation between groups.
- RANDOM: Complex pattern with no discernible structure. Use only when activity is clearly suspicious but doesn't fit \
other typologies.
- NONE: No laundering pattern is supported by evidence in the investigation window.

## Output Format
Return a single JSON object matching the configured output schema exactly. Populate every field.
Use `pattern_type = "NONE"` when no laundering pattern is supported by evidence in the investigation window.

**Rules for flagging transactions IDs**:
- **Causal Chain Only**: Include *only* the transactions that form the identified laundering pattern.
- **Exclude Noise**: If an account has more transactions but only 3 are part of the laundering chain, output *only* those 3 IDs.
- When flagging transaction IDs, the seed transaction should be the last transaction in the chain (i.e., the most recent transaction), \
since the investigation window ends with the seed transaction.

## Handling Uncertainty
If you lack sufficient information to make a determination, explicitly state what data is missing. \
Do not fabricate transaction details or make unsupported inferences. When uncertain between benign and suspicious, \
Expand All @@ -110,6 +151,7 @@ def create_aml_investigation_agent(
after_agent_callback: AfterAgentCallback | None = None,
before_model_callback: BeforeModelCallback | None = None,
after_model_callback: AfterModelCallback | None = None,
timeout_sec: int | None = None,
enable_tracing: bool = True,
) -> LlmAgent:
"""Create a configured AML investigation agent.
Expand Down Expand Up @@ -155,6 +197,9 @@ def create_aml_investigation_agent(
Callback executed before each model call.
after_model_callback : AfterModelCallback | None, optional
Callback executed after each model call.
timeout_sec : int | None, optional
Optional timeout in seconds for model calls. If specified, model calls
that exceed this duration will be cancelled.
enable_tracing : bool, optional, default=True
Whether to initialize Langfuse tracing for this agent. If ``True``, Langfuse
tracing is initialized with the agent's name as the service name.
Expand Down Expand Up @@ -201,6 +246,7 @@ def create_aml_investigation_agent(
instruction=instructions or ANALYST_PROMPT,
tools=[FunctionTool(db.get_schema_info), FunctionTool(db.execute)],
generate_content_config=GenerateContentConfig(
http_options=HttpOptions(timeout=timeout_sec * 1000) if timeout_sec is not None else None,
temperature=temperature,
top_p=top_p,
top_k=top_k,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ class AnalystOutput(BaseModel):
pattern_type: LaunderingPattern = Field(..., description="The type of laundering pattern in the case.")
pattern_description: str = Field(..., description="A short description of the laundering pattern.")
flagged_transaction_ids: str = Field(
..., description="A comma-separated list of transaction IDs flagged by the analyst as suspicious."
..., description="A string of comma-separated transaction IDs that make up the laundering pattern."
)


Expand Down
3 changes: 0 additions & 3 deletions aieng-eval-agents/aieng/agent_evals/evaluation/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,15 +137,12 @@ class TraceEvalResult:
Trace IDs that failed due to errors during evaluation.
errors_by_trace_id : dict[str, str]
Error messages associated with skipped or failed traces.
run_evaluations : list[Evaluation]
Aggregated trace evaluation metrics written at dataset-run level.
"""

evaluations_by_trace_id: dict[str, list[Evaluation]] = field(default_factory=dict)
skipped_trace_ids: list[str] = field(default_factory=list)
failed_trace_ids: list[str] = field(default_factory=list)
errors_by_trace_id: dict[str, str] = field(default_factory=dict)
run_evaluations: list[Evaluation] = field(default_factory=list)


@dataclass(frozen=True)
Expand Down
50 changes: 50 additions & 0 deletions implementations/aml_investigation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,56 @@ uv run --env-file .env implementations/aml_investigation/cli.py

The script prints a simple confusion matrix for `is_laundering` based on the cases that have `output`.

## Evaluate the Agent

The evaluation script uploads the AML investigation dataset to Langfuse, runs a comprehensive evaluation experiment with multiple types of evaluators, and displays results in the console.

```bash
uv run --env-file .env implementations/aml_investigation/evaluate.py \
--dataset-path implementations/aml_investigation/data/aml_cases.jsonl \
--dataset-name AML-investigation
```

### Evaluation Levels

The evaluation framework assesses agent performance at three levels:

**Item-Level Evaluators** β€” Score each individual case prediction:

- **Deterministic grader**: Checks correctness of `is_laundering`, `pattern_type`, and flagged transaction IDs against ground truth
- **Narrative quality evaluator**: LLM-as-judge that scores the investigation reasoning and pattern description quality using the rubric in `rubrics/narrative_pattern_quality.md`

**Trace-Level Evaluators** β€” Analyze tool-use trajectories for each agent run:

- **Trace deterministic grader**: Validates SQL safety (read-only compliance), time window adherence, and query redundancy metrics
- **Trace groundedness evaluator**: LLM-based assessment of whether the agent's narrative is grounded in the actual tool outputs

**Run-Level Grader** β€” Aggregates results across all cases:

- Computes precision, recall, and F1-score for `is_laundering` detection
- Generates macro F1-score and confusion matrix for `pattern_type` classification

### CLI Options

Key options you may want to adjust:

- `--agent-timeout`: Timeout in seconds for each agent run (default: 300)
- `--llm-judge-timeout`: Timeout for LLM judge evaluations (default: 120)
- `--llm-judge-retries`: Retry attempts for LLM judge failures (default: 3)
- `--max-concurrent-cases`: Maximum concurrent cases to process (default: 5)
- `--max-concurrent-traces`: Maximum concurrent trace evaluations (default: 10)
- `--max-trace-wait-time`: Maximum seconds to wait for trace data (default: 300)

### Output

The evaluation displays:

1. **Per-item metrics tables**: Shows deterministic and narrative quality scores for each case
2. **Run-level aggregate metrics**: Overall precision, recall, F1-score, and confusion matrix
3. **Trace evaluation summary**: Count of successful, skipped, and failed trace evaluations

All results are uploaded to Langfuse for further analysis and visualization.

## Run with ADK Web UI

If you want to inspect the agent interactively, the module exposes a top-level `root_agent` for ADK discovery.
Expand Down
Loading