Skip to content

feat(eval): code grader multimodal — structured Content in CodeGraderInput#841

Closed
christso wants to merge 3 commits intofeat/825-template-varsfrom
feat/821-code-grader-mm
Closed

feat(eval): code grader multimodal — structured Content in CodeGraderInput#841
christso wants to merge 3 commits intofeat/825-template-varsfrom
feat/821-code-grader-mm

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented Mar 29, 2026

Closes #821

Changes

Schema: typed Content blocks in @agentv/eval

  • Added ContentTextSchema, ContentImageSchema, ContentFileSchema, ContentSchema (Zod discriminated union on type)
  • Updated MessageSchema.content from loose string | Record | Record[] to typed string | Content[]
  • ContentImage uses path (file path), never inline base64 — matches wire format contract
  • Exported Content schemas and inferred types from @agentv/eval

Payload builder: image materialization in code-evaluator.ts

  • Added materializeContentForGrader() — converts ContentImage blocks for code grader consumption:
    • Data URI images (data:image/png;base64,...) → decoded to temp file, replaced with file path
    • Path/URL imagessource carried through as path field
    • Text/file blocks → passed through unchanged
    • String content → passed through unchanged (zero-copy fast path)
  • Lazy temp dir creation (agentv-img-*) — only allocated when images exist
  • Temp dir cleaned up in finally block alongside file-backed output cleanup

Tests

  • 11 unit tests for materializeContentForGrader (null/undefined, text-only, data URIs, paths, JPEG extension, multiple images, ContentFile preservation, field preservation)
  • 3 integration tests for CodeEvaluator multimodal flow (text-only, image materialization, temp cleanup)
  • 7 schema validation tests (ContentSchema, MessageSchema content variants, CodeGraderInput with Content[])

Depends on

christso and others added 3 commits March 29, 2026 02:34
Update Claude and Pi providers to preserve non-text content blocks
(images) in Message.content instead of discarding them via
extractTextContent(). This enables multimodal content to flow from
provider response through to evaluators.

Changes:
- Create shared claude-content.ts with toContentArray() and
  extractTextContent() used by all 3 Claude providers
- Update claude-cli, claude-sdk, claude providers to use
  structuredContent ?? textContent pattern
- Add toPiContentArray() to pi-utils.ts for Pi provider
- Update pi-coding-agent convertAgentMessage() to preserve
  structured content
- Add 23 unit tests covering content preservation, backward
  compat, and end-to-end multimodal flow

Text-only responses still produce plain strings (no unnecessary
wrapping). extractTextContent() remains available for backward
compatibility.

Closes #818

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Input

- Add ContentTextSchema, ContentImageSchema, ContentFileSchema, ContentSchema
  as Zod discriminated union in packages/eval/src/schemas.ts
- Update MessageSchema.content to accept string | Content[] (typed blocks)
- Add materializeContentForGrader() in code-evaluator.ts:
  - Data URI images decoded and written to temp files (path, not base64)
  - Non-URI images pass source through as path field
  - Text/file blocks unchanged; string content unchanged
- Lazy temp dir creation for image files, cleaned up in finally block
- Export Content schemas and types from @agentv/eval
- Add comprehensive unit tests for schema validation and materialization
- Add integration tests for CodeEvaluator with multimodal output

Closes #821

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@christso christso closed this Mar 29, 2026
@christso christso changed the title feat(eval): LLM grader multimodal — auto-append images to judge message feat(eval): code grader multimodal — structured Content in CodeGraderInput Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant