Skip to content

feat(eval): LLM grader multimodal — auto-append images to judge message#842

Merged
christso merged 3 commits intomainfrom
feat/820-llm-grader-mm
Mar 29, 2026
Merged

feat(eval): LLM grader multimodal — auto-append images to judge message#842
christso merged 3 commits intomainfrom
feat/820-llm-grader-mm

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented Mar 29, 2026

Closes #820
Closes #822

Closes #820

Summary

Adds multimodal support to the LLM grader evaluator. When agent output contains ContentImage blocks in assistant messages, they are automatically extracted and appended as image content parts to the judge model message, enabling vision-capable models to grade multimodal output.

Design

Follows Inspect AI's model_scoring_prompt() pattern:

  1. Extract text completion as before → populate template variables
  2. Extract image blocks from context.output assistant messages
  3. Build a multi-part user message: [{type: 'text', text: rendered_prompt}, ...image_content_parts]
  4. Send to vision-capable judge model via Vercel AI SDK

No template syntax changes needed. Existing templates work identically. Images are transparently appended after the rendered text prompt.

Changes

  • extractImageBlocks() — scans assistant messages for ContentImage blocks
  • toAiSdkImageParts() — converts ContentImage to Vercel AI SDK ImagePart format
  • runWithRetry() — accepts optional images; uses multi-part messages array when images are present, plain text prompt when not (fully backward compatible)
  • All three LLM evaluation paths (evaluateFreeform, evaluateWithRubrics, evaluateWithScoreRanges) extract images from context.output and pass to runWithRetry

Tests

10 new tests covering:

  • extractImageBlocks unit tests (empty, text-only, single image, multiple images, non-assistant filtered, file blocks filtered)
  • Integration tests via mocked generateText (text-only → plain prompt, images → multi-part messages, multiple images, user/tool images ignored)

All 353 existing tests pass.

Depends on

@christso christso force-pushed the feat/825-template-vars branch from e822d86 to f29d165 Compare March 29, 2026 04:34
Base automatically changed from feat/825-template-vars to main March 29, 2026 04:34
christso and others added 3 commits March 29, 2026 04:36
…th industry patterns

- {{output}}, {{input}}, {{expected_output}} now resolve to human-readable
  text instead of JSON.stringify'd message arrays
- Deprecated _text aliases ({{input_text}}, {{output_text}},
  {{expected_output_text}}) still work but emit a stderr warning
- Removed outputText, inputText, expectedOutputText from CodeGraderInput
  schema — code graders should extract text from Message.content using
  getTextContent() from @agentv/core
- Removed EnrichedCodeGraderInput type (no longer needed)
- Updated default evaluator template to use new variable names
- Updated prompt-validator to accept both new and deprecated variable names

Closes #825

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add multimodal support to the LLM grader evaluator. When agent output
contains ContentImage blocks in assistant messages, they are automatically
extracted and appended as image content parts to the judge model message.

Changes:
- extractImageBlocks(): scans assistant messages for ContentImage blocks
- toAiSdkImageParts(): converts ContentImage to Vercel AI SDK ImagePart
- runWithRetry(): accepts optional images; uses multi-part messages array
  when images are present, plain text prompt when not (backward compatible)
- evaluateFreeform/evaluateWithRubrics/evaluateWithScoreRanges: extract
  images from context.output and pass to runWithRetry

Follows Inspect AI's model_scoring_prompt() pattern: no template syntax
changes needed — images are transparently appended after the rendered text.

Closes #820

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Input

- Add ContentTextSchema, ContentImageSchema, ContentFileSchema, ContentSchema
  as Zod discriminated union in packages/eval/src/schemas.ts
- Update MessageSchema.content to accept string | Content[] (typed blocks)
- Add materializeContentForGrader() in code-evaluator.ts:
  - Data URI images decoded and written to temp files (path, not base64)
  - Non-URI images pass source through as path field
  - Text/file blocks unchanged; string content unchanged
- Lazy temp dir creation for image files, cleaned up in finally block
- Export Content schemas and types from @agentv/eval
- Add comprehensive unit tests for schema validation and materialization
- Add integration tests for CodeEvaluator with multimodal output

Closes #821

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@christso christso force-pushed the feat/820-llm-grader-mm branch from ce3d35c to edb4178 Compare March 29, 2026 04:36
@christso christso merged commit adaf10e into main Mar 29, 2026
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: edb4178
Status:⚡️  Build in progress...

View logs

@christso christso deleted the feat/820-llm-grader-mm branch March 29, 2026 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant