Rename fields in `CaseRecord` for consistency with langfuse evaluators by fcogidi · Pull Request #38 · VectorInstitute/eval-agents

fcogidi · 2026-02-06T15:17:32Z

Summary

Rename fields in CaseRecord for consistency with langfuse evaluators

Clickup Ticket(s): N/A

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🔧 Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test improvements
🔒 Security fix

Changes Made

Rename case to input.
Rename groundtruth to expected_output.
Rename analysis to output.

Testing

Tests pass locally (uv run pytest tests/)
Type checking passes (uv run mypy <src_dir>)
Linting passes (uv run ruff check src_dir/)
Manual testing performed (describe below)

Manual testing details:
N/A

Screenshots/Recordings

N/A

Related Issues

N/A

Deployment Notes

N/A

Checklist

Code follows the project's style guidelines
Self-review of code completed
Documentation updated (if applicable)
No sensitive information (API keys, credentials) exposed

Copilot

Pull request overview

Renames CaseRecord fields to align with langfuse evaluator conventions (input, expected_output, output) and updates agent + tests accordingly.

Changes:

Renamed CaseRecord fields (case → input, groundtruth → expected_output, analysis → output).
Updated AML investigation agent logic to read/write/analyze using the new field names.
Updated AML investigation test cases to assert against the renamed fields.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
implementations/aml_investigation/agent.py	Updates analysis pipeline, resume logic, and metrics computation to use `input/expected_output/output`.
aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py	Renames the `CaseRecord` schema and updates case-building helpers accordingly.
aieng-eval-agents/tests/aieng/agent_evals/aml_investigation/data/test_cases.py	Updates tests to reference the renamed `CaseRecord` fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-06T15:19:49Z

aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py

+    input: CaseFile = Field(..., description="Metadata for the laundering case.")
+    expected_output: GroundTruth = Field(..., description="Ground truth information for the laundering case.")
+    output: AnalystOutput | None = Field(
        default=None,
        description="Optional analyst output for the laundering case. Typically populated after investigation.",
    )


Renaming these serialized fields will break parsing of any existing JSON/JSONL persisted with the old keys (case, groundtruth, analysis). Since the agent reads prior results via CaseRecord.model_validate_json(...) for resume behavior, legacy lines will fail validation and be skipped. Consider adding Pydantic v2 compatibility via validation_alias (e.g., AliasChoices('input', 'case'), AliasChoices('expected_output', 'groundtruth'), AliasChoices('output', 'analysis')) so existing artifacts remain readable while emitting the new field names.

Copilot · 2026-02-06T15:19:49Z

implementations/aml_investigation/agent.py

    input_records = _load_records(input_path)
-    existing_results = {record.case.case_id: record for record in _load_records(output_path)}
-    to_run = [r for r in input_records if existing_results.get(r.case.case_id, r).analysis is None]
+    existing_results = {record.input.case_id: record for record in _load_records(output_path)}


With the CaseRecord field rename, _load_records(output_path) will skip any legacy JSONL rows (old schema) as invalid, which can cause existing_results to be incomplete and make the resume logic re-run cases unnecessarily (and potentially overwrite/duplicate outputs). After adding schema aliases (or a migration path), it would also help to make legacy parsing failures more explicit here (e.g., fail fast or log a clear warning/count) so users don’t silently lose resume behavior.

Suggested change

existing_results = {record.input.case_id: record for record in _load_records(output_path)}

# Load existing results from the output file, and detect any lines that could not be parsed

existing_records = list(_load_records(output_path))

existing_results = {record.input.case_id: record for record in existing_records}

# Compare parsed records against total lines to surface potential legacy/invalid rows

if output_path.exists():

try:

with output_path.open("r", encoding="utf-8") as f:

total_lines = sum(1 for _ in f)

except OSError:

total_lines = None

if total_lines is not None and total_lines > len(existing_records):

logger.warning(

"Detected %d/%d records in %s that could not be parsed. "

"These may be legacy or invalid rows, and resume behavior may be affected.",

total_lines - len(existing_records),

total_lines,

output_path,

)

aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py

Rename fields in CaseRecord for consistency with langfuse evaluators

809e001

fcogidi requested review from amrit110 and lotif February 6, 2026 15:17

fcogidi self-assigned this Feb 6, 2026

fcogidi added the refactor Refactor or clean up code structure label Feb 6, 2026

fcogidi requested a review from Copilot February 6, 2026 15:18

Copilot AI reviewed Feb 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename fields in `CaseRecord` for consistency with langfuse evaluators#38

Rename fields in `CaseRecord` for consistency with langfuse evaluators#38
fcogidi wants to merge 1 commit intomainfrom
fco/rename_fields

fcogidi commented Feb 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 6, 2026

Uh oh!

Copilot AI Feb 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-    existing_results = {record.input.case_id: record for record in _load_records(output_path)}
+    # Load existing results from the output file, and detect any lines that could not be parsed
+    existing_records = list(_load_records(output_path))
+    existing_results = {record.input.case_id: record for record in existing_records}
+    # Compare parsed records against total lines to surface potential legacy/invalid rows
+    if output_path.exists():
+        try:
+            with output_path.open("r", encoding="utf-8") as f:
+                total_lines = sum(1 for _ in f)
+        except OSError:
+            total_lines = None
+        if total_lines is not None and total_lines > len(existing_records):
+            logger.warning(
+                "Detected %d/%d records in %s that could not be parsed. "
+                "These may be legacy or invalid rows, and resume behavior may be affected.",
+                total_lines - len(existing_records),
+                total_lines,
+                output_path,
+            )

Conversation

fcogidi commented Feb 6, 2026

Summary

Type of Change

Changes Made

Testing

Screenshots/Recordings

Related Issues

Deployment Notes

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant