feat(eval): LLM grader multimodal — auto-append images to judge message by christso · Pull Request #842 · EntityProcess/agentv

christso · 2026-03-29T03:10:20Z

Closes #820
Closes #822

Closes #820

Summary

Adds multimodal support to the LLM grader evaluator. When agent output contains ContentImage blocks in assistant messages, they are automatically extracted and appended as image content parts to the judge model message, enabling vision-capable models to grade multimodal output.

Design

Follows Inspect AI's model_scoring_prompt() pattern:

Extract text completion as before → populate template variables
Extract image blocks from context.output assistant messages
Build a multi-part user message: [{type: 'text', text: rendered_prompt}, ...image_content_parts]
Send to vision-capable judge model via Vercel AI SDK

No template syntax changes needed. Existing templates work identically. Images are transparently appended after the rendered text prompt.

Changes

extractImageBlocks() — scans assistant messages for ContentImage blocks
toAiSdkImageParts() — converts ContentImage to Vercel AI SDK ImagePart format
runWithRetry() — accepts optional images; uses multi-part messages array when images are present, plain text prompt when not (fully backward compatible)
All three LLM evaluation paths (evaluateFreeform, evaluateWithRubrics, evaluateWithScoreRanges) extract images from context.output and pass to runWithRetry

Tests

10 new tests covering:

extractImageBlocks unit tests (empty, text-only, single image, multiple images, non-assistant filtered, file blocks filtered)
Integration tests via mocked generateText (text-only → plain prompt, images → multi-part messages, multiple images, user/tool images ignored)

All 353 existing tests pass.

Depends on

feat(eval): simplify template variables — drop _text suffix, align with industry patterns #839 (feat/825-template-vars)
feat(core): preserve multimodal content blocks in provider responses #833 (feat/818-provider-preserve)

…th industry patterns - {{output}}, {{input}}, {{expected_output}} now resolve to human-readable text instead of JSON.stringify'd message arrays - Deprecated _text aliases ({{input_text}}, {{output_text}}, {{expected_output_text}}) still work but emit a stderr warning - Removed outputText, inputText, expectedOutputText from CodeGraderInput schema — code graders should extract text from Message.content using getTextContent() from @agentv/core - Removed EnrichedCodeGraderInput type (no longer needed) - Updated default evaluator template to use new variable names - Updated prompt-validator to accept both new and deprecated variable names Closes #825 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add multimodal support to the LLM grader evaluator. When agent output contains ContentImage blocks in assistant messages, they are automatically extracted and appended as image content parts to the judge model message. Changes: - extractImageBlocks(): scans assistant messages for ContentImage blocks - toAiSdkImageParts(): converts ContentImage to Vercel AI SDK ImagePart - runWithRetry(): accepts optional images; uses multi-part messages array when images are present, plain text prompt when not (backward compatible) - evaluateFreeform/evaluateWithRubrics/evaluateWithScoreRanges: extract images from context.output and pass to runWithRetry Follows Inspect AI's model_scoring_prompt() pattern: no template syntax changes needed — images are transparently appended after the rendered text. Closes #820 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…Input - Add ContentTextSchema, ContentImageSchema, ContentFileSchema, ContentSchema as Zod discriminated union in packages/eval/src/schemas.ts - Update MessageSchema.content to accept string | Content[] (typed blocks) - Add materializeContentForGrader() in code-evaluator.ts: - Data URI images decoded and written to temp files (path, not base64) - Non-URI images pass source through as path field - Text/file blocks unchanged; string content unchanged - Lazy temp dir creation for image files, cleaned up in finally block - Export Content schemas and types from @agentv/eval - Add comprehensive unit tests for schema validation and materialization - Add integration tests for CodeEvaluator with multimodal output Closes #821 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

cloudflare-workers-and-pages · 2026-03-29T04:36:29Z

Deploying agentv with Cloudflare Pages

Latest commit:	`edb4178`
Status:	⚡️ Build in progress...

View logs

christso force-pushed the feat/825-template-vars branch from e822d86 to f29d165 Compare March 29, 2026 04:34

Base automatically changed from feat/825-template-vars to main March 29, 2026 04:34

christso and others added 3 commits March 29, 2026 04:36

christso force-pushed the feat/820-llm-grader-mm branch from ce3d35c to edb4178 Compare March 29, 2026 04:36

christso merged commit adaf10e into main Mar 29, 2026

christso deleted the feat/820-llm-grader-mm branch March 29, 2026 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): LLM grader multimodal — auto-append images to judge message#842

feat(eval): LLM grader multimodal — auto-append images to judge message#842
christso merged 3 commits intomainfrom
feat/820-llm-grader-mm

christso commented Mar 29, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Changes

Tests

Depends on

Uh oh!

cloudflare-workers-and-pages bot commented Mar 29, 2026

Deploying agentv with Cloudflare Pages

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Mar 29, 2026 •

edited

Loading