Skip to content

ROX-32888: Add evaluation docs and script to update it#99

Merged
mtodor merged 1 commit intomainfrom
mtodor/ROX-32888-add-eval-docs-and-update-script
Apr 2, 2026
Merged

ROX-32888: Add evaluation docs and script to update it#99
mtodor merged 1 commit intomainfrom
mtodor/ROX-32888-add-eval-docs-and-update-script

Conversation

@mtodor
Copy link
Copy Markdown
Collaborator

@mtodor mtodor commented Mar 31, 2026

Description

Create standardized documentation for LLM model evaluation results with the StackRox MCP server (docs/model-evaluation.md). The document covers evaluation methodology using the mcpchecker framework, per-model results tables with assertion pass/fail and token usage, and instructions for running evaluations and updating the documentation.

A companion automation script (scripts/update-model-evaluation.sh) parses mcpchecker JSON output using jq and generates/updates model result sections in the documentation using HTML comment markers for idempotent block replacement.

New files:

  • docs/model-evaluation.md — Evaluation results and methodology documentation
  • scripts/update-model-evaluation.sh — Script to auto-update docs from mcpchecker JSON output

Validation

  • shellcheck scripts/update-model-evaluation.sh passes
  • Script tested: insert path (new model) and update path (existing model) both work
  • Script generates correct markdown table from mcpchecker JSON
  • make shell-lint passes
  • Documentation renders correctly on GitHub

AI-assisted development prompts

This PR was developed with AI assistance. Below are the prompts used:

Prompt: "Research the following task: Create documentation for LLM model evaluation results with the MCP server. Establish a standardized format for documenting which LLM models work well with the MCP tools. Provide instructions for updating evaluation results. Optionally, create a script to automate the documentation update process after evaluation runs. -> Create bead(s) required to do this task."
Response: Researched the codebase (MCP tools, eval infrastructure, mcpchecker setup), created 4 beads: 1 feature epic + 3 tasks (create docs, add README link, create automation script) with dependencies.

Prompt: "Outline stackrox-mcp-h9nu plan"
Response: Outlined document structure, key decisions (reference mcpchecker, map metrics to assertion fields, per-model sections), and implementation steps.

Prompt: "Show me an example of document. Just keep headers, no content."
Response: Provided skeleton with headers: Overview, Evaluation Methodology, MCP Tools Under Evaluation, Test Scenarios, Evaluation Metrics, Model Results, How to Run Evaluations, Recommendations.

Prompt: "Here is evaluation results example: mt_nogit_tmp/mcpchecker-stackrox-mcp-e2e-out.json - please investigate it and consider if doc should be formulated differently. We are lacking some advanced metrics. And I would avoid having listing additionally tools, etc."
Response: Analyzed actual JSON output, identified that metrics are simple pass/fail (not percentages), revised skeleton to be results-table-centric with tools shown inline, dropped standalone tools/metrics/scenarios sections.

Prompt: "this looks good. you can close beads and outline plan for the next one"
Response: Closed docs bead, outlined plans for README link task and automation script task.

Prompt: "Let's refine: stackrox-mcp-bvl4 - are there other alternatives instead of using python3? i.e. jq. mcpchecker also support outputting summary in JSON format. Check .github/workflows/e2e.yml for inspiration. Output to stdout - it would be good if script could automatically update docs using comment markers."
Response: Researched e2e.yml workflow and mcpchecker summary, confirmed jq works for all needed fields, refined plan to use jq + HTML comment markers for idempotent block replacement.

Prompt: "Actually I think we don't need --model-name. Update task"
Response: Removed --model-name parameter, simplified to just --model-id and --results.

Prompt: "Ye start with the implementation"
Response: Created scripts/update-model-evaluation.sh using jq, passed shellcheck, tested insert and update paths, updated docs to reference the script.

Prompt: "Remove: Tools Used and Calls columns. We also want Input tokens and Output tokens. Total tokens are not relevant. Configure the agent model - you can change it to export MODEL_NAME=gpt-5-nano"
Response: Updated script table format to remove Tools Used/Calls, added Input/Output Tokens columns, changed agent model config to use environment variable.

@mtodor mtodor requested a review from janisz as a code owner March 31, 2026 16:32
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 31, 2026

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
361 2 359 12
View the full list of 2 ❄️ flaky test(s)
::policy 1

Flake rate in main: 100.00% (Passed 0 times, Failed 6 times)

Stack Traces | 0s run time
- test violation 1
- test violation 2
- test violation 3
::policy 4

Flake rate in main: 100.00% (Passed 0 times, Failed 6 times)

Stack Traces | 0s run time
- testing multiple alert violation messages 1
- testing multiple alert violation messages 2
- testing multiple alert violation messages 3

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 31, 2026

E2E Test Results

Commit: b8366e6
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✓ list-clusters (assertions: 3/3)
  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-detected-clusters (assertions: 3/3)
  ✓ cve-nonexistent (assertions: 3/3)
  ✓ cve-cluster-does-exist (assertions: 3/3)
  ~ cve-cluster-does-not-exist (assertions: 2/3)
      - ToolsUsed: Required tool not called: server=stackrox-mcp, tool=, pattern=list_clusters
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ cve-cluster-list (assertions: 3/3)
  ✓ cve-log4shell (assertions: 3/3)
  ✓ cve-multiple (assertions: 3/3)
  ✓ rhsa-not-supported (assertions: 2/2)

Tasks:      11/11 passed (100.00%)
Assertions: 31/32 passed (96.88%)
Tokens:     ~55135 (estimate - excludes system prompt & cache)
MCP schemas: ~12738 (included in token total)
Agent used tokens:
  Input:  13687 tokens
  Output: 21224 tokens
Judge used tokens:
  Input:  40350 tokens
  Output: 38542 tokens

Copy link
Copy Markdown
Contributor

@janisz janisz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can run it on CI with label or manuall job dispach?

@mtodor mtodor force-pushed the mtodor/ROX-32888-add-eval-docs-and-update-script branch from 52677d3 to b8366e6 Compare April 1, 2026 16:41
@mtodor mtodor merged commit 077457b into main Apr 2, 2026
11 checks passed
@mtodor mtodor deleted the mtodor/ROX-32888-add-eval-docs-and-update-script branch April 2, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants