feat(eval): LLM grader multimodal — auto-append images to judge message#842
Merged
feat(eval): LLM grader multimodal — auto-append images to judge message#842
Conversation
e822d86 to
f29d165
Compare
…th industry patterns
- {{output}}, {{input}}, {{expected_output}} now resolve to human-readable
text instead of JSON.stringify'd message arrays
- Deprecated _text aliases ({{input_text}}, {{output_text}},
{{expected_output_text}}) still work but emit a stderr warning
- Removed outputText, inputText, expectedOutputText from CodeGraderInput
schema — code graders should extract text from Message.content using
getTextContent() from @agentv/core
- Removed EnrichedCodeGraderInput type (no longer needed)
- Updated default evaluator template to use new variable names
- Updated prompt-validator to accept both new and deprecated variable names
Closes #825
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add multimodal support to the LLM grader evaluator. When agent output contains ContentImage blocks in assistant messages, they are automatically extracted and appended as image content parts to the judge model message. Changes: - extractImageBlocks(): scans assistant messages for ContentImage blocks - toAiSdkImageParts(): converts ContentImage to Vercel AI SDK ImagePart - runWithRetry(): accepts optional images; uses multi-part messages array when images are present, plain text prompt when not (backward compatible) - evaluateFreeform/evaluateWithRubrics/evaluateWithScoreRanges: extract images from context.output and pass to runWithRetry Follows Inspect AI's model_scoring_prompt() pattern: no template syntax changes needed — images are transparently appended after the rendered text. Closes #820 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Input - Add ContentTextSchema, ContentImageSchema, ContentFileSchema, ContentSchema as Zod discriminated union in packages/eval/src/schemas.ts - Update MessageSchema.content to accept string | Content[] (typed blocks) - Add materializeContentForGrader() in code-evaluator.ts: - Data URI images decoded and written to temp files (path, not base64) - Non-URI images pass source through as path field - Text/file blocks unchanged; string content unchanged - Lazy temp dir creation for image files, cleaned up in finally block - Export Content schemas and types from @agentv/eval - Add comprehensive unit tests for schema validation and materialization - Add integration tests for CodeEvaluator with multimodal output Closes #821 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ce3d35c to
edb4178
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #820
Closes #822
Closes #820
Summary
Adds multimodal support to the LLM grader evaluator. When agent output contains
ContentImageblocks in assistant messages, they are automatically extracted and appended as image content parts to the judge model message, enabling vision-capable models to grade multimodal output.Design
Follows Inspect AI's
model_scoring_prompt()pattern:context.outputassistant messages[{type: 'text', text: rendered_prompt}, ...image_content_parts]No template syntax changes needed. Existing templates work identically. Images are transparently appended after the rendered text prompt.
Changes
extractImageBlocks()— scans assistant messages forContentImageblockstoAiSdkImageParts()— convertsContentImageto Vercel AI SDKImagePartformatrunWithRetry()— accepts optionalimages; uses multi-partmessagesarray when images are present, plain textpromptwhen not (fully backward compatible)evaluateFreeform,evaluateWithRubrics,evaluateWithScoreRanges) extract images fromcontext.outputand pass torunWithRetryTests
10 new tests covering:
extractImageBlocksunit tests (empty, text-only, single image, multiple images, non-assistant filtered, file blocks filtered)generateText(text-only → plain prompt, images → multi-part messages, multiple images, user/tool images ignored)All 353 existing tests pass.
Depends on
feat/825-template-vars)feat/818-provider-preserve)