ROX-32888: Add evaluation docs and script to update it by mtodor · Pull Request #99 · stackrox/stackrox-mcp

mtodor · 2026-03-31T16:32:04Z

Description

Create standardized documentation for LLM model evaluation results with the StackRox MCP server (docs/model-evaluation.md). The document covers evaluation methodology using the mcpchecker framework, per-model results tables with assertion pass/fail and token usage, and instructions for running evaluations and updating the documentation.

A companion automation script (scripts/update-model-evaluation.sh) parses mcpchecker JSON output using jq and generates/updates model result sections in the documentation using HTML comment markers for idempotent block replacement.

New files:

docs/model-evaluation.md — Evaluation results and methodology documentation
scripts/update-model-evaluation.sh — Script to auto-update docs from mcpchecker JSON output

Validation

shellcheck scripts/update-model-evaluation.sh passes
Script tested: insert path (new model) and update path (existing model) both work
Script generates correct markdown table from mcpchecker JSON
make shell-lint passes
Documentation renders correctly on GitHub

AI-assisted development prompts

This PR was developed with AI assistance. Below are the prompts used:

Prompt: "Research the following task: Create documentation for LLM model evaluation results with the MCP server. Establish a standardized format for documenting which LLM models work well with the MCP tools. Provide instructions for updating evaluation results. Optionally, create a script to automate the documentation update process after evaluation runs. -> Create bead(s) required to do this task."
Response: Researched the codebase (MCP tools, eval infrastructure, mcpchecker setup), created 4 beads: 1 feature epic + 3 tasks (create docs, add README link, create automation script) with dependencies.

Prompt: "Outline stackrox-mcp-h9nu plan"
Response: Outlined document structure, key decisions (reference mcpchecker, map metrics to assertion fields, per-model sections), and implementation steps.

Prompt: "Show me an example of document. Just keep headers, no content."
Response: Provided skeleton with headers: Overview, Evaluation Methodology, MCP Tools Under Evaluation, Test Scenarios, Evaluation Metrics, Model Results, How to Run Evaluations, Recommendations.

Prompt: "Here is evaluation results example: mt_nogit_tmp/mcpchecker-stackrox-mcp-e2e-out.json - please investigate it and consider if doc should be formulated differently. We are lacking some advanced metrics. And I would avoid having listing additionally tools, etc."
Response: Analyzed actual JSON output, identified that metrics are simple pass/fail (not percentages), revised skeleton to be results-table-centric with tools shown inline, dropped standalone tools/metrics/scenarios sections.

Prompt: "this looks good. you can close beads and outline plan for the next one"
Response: Closed docs bead, outlined plans for README link task and automation script task.

Prompt: "Let's refine: stackrox-mcp-bvl4 - are there other alternatives instead of using python3? i.e. jq. mcpchecker also support outputting summary in JSON format. Check .github/workflows/e2e.yml for inspiration. Output to stdout - it would be good if script could automatically update docs using comment markers."
Response: Researched e2e.yml workflow and mcpchecker summary, confirmed jq works for all needed fields, refined plan to use jq + HTML comment markers for idempotent block replacement.

Prompt: "Actually I think we don't need --model-name. Update task"
Response: Removed --model-name parameter, simplified to just --model-id and --results.

Prompt: "Ye start with the implementation"
Response: Created scripts/update-model-evaluation.sh using jq, passed shellcheck, tested insert and update paths, updated docs to reference the script.

Prompt: "Remove: Tools Used and Calls columns. We also want Input tokens and Output tokens. Total tokens are not relevant. Configure the agent model - you can change it to export MODEL_NAME=gpt-5-nano"
Response: Updated script table format to remove Tools Used/Calls, added Input/Output Tokens columns, changed agent model config to use environment variable.

codecov-commenter · 2026-03-31T16:37:06Z

❌ 2 Tests Failed:

Tests completed	Failed	Passed	Skipped
361	2	359	12

View the full list of 2 ❄️ flaky test(s)

::policy 1
Flake rate in main: 100.00% (Passed 0 times, Failed 6 times)
Stack Traces | 0s run time
- test violation 1
- test violation 2
- test violation 3

::policy 4
Flake rate in main: 100.00% (Passed 0 times, Failed 6 times)
Stack Traces | 0s run time
- testing multiple alert violation messages 1
- testing multiple alert violation messages 2
- testing multiple alert violation messages 3

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

github-actions · 2026-03-31T16:42:54Z

E2E Test Results

Commit: b8366e6
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✓ list-clusters (assertions: 3/3)
  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-detected-clusters (assertions: 3/3)
  ✓ cve-nonexistent (assertions: 3/3)
  ✓ cve-cluster-does-exist (assertions: 3/3)
  ~ cve-cluster-does-not-exist (assertions: 2/3)
      - ToolsUsed: Required tool not called: server=stackrox-mcp, tool=, pattern=list_clusters
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ cve-cluster-list (assertions: 3/3)
  ✓ cve-log4shell (assertions: 3/3)
  ✓ cve-multiple (assertions: 3/3)
  ✓ rhsa-not-supported (assertions: 2/2)

Tasks:      11/11 passed (100.00%)
Assertions: 31/32 passed (96.88%)
Tokens:     ~55135 (estimate - excludes system prompt & cache)
MCP schemas: ~12738 (included in token total)
Agent used tokens:
  Input:  13687 tokens
  Output: 21224 tokens
Judge used tokens:
  Input:  40350 tokens
  Output: 38542 tokens

janisz

Maybe we can run it on CI with label or manuall job dispach?

mtodor requested a review from janisz as a code owner March 31, 2026 16:32

mtodor added the code-rabbit label Mar 31, 2026

janisz approved these changes Apr 1, 2026

View reviewed changes

Add evaluation docs and script to update it

b8366e6

mtodor force-pushed the mtodor/ROX-32888-add-eval-docs-and-update-script branch from 52677d3 to b8366e6 Compare April 1, 2026 16:41

mtodor merged commit 077457b into main Apr 2, 2026
11 checks passed

mtodor deleted the mtodor/ROX-32888-add-eval-docs-and-update-script branch April 2, 2026 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROX-32888: Add evaluation docs and script to update it#99

ROX-32888: Add evaluation docs and script to update it#99
mtodor merged 1 commit intomainfrom
mtodor/ROX-32888-add-eval-docs-and-update-script

mtodor commented Mar 31, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Mar 31, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

janisz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mtodor commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Validation

AI-assisted development prompts

Uh oh!

codecov-commenter commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 2 Tests Failed:

Uh oh!

github-actions bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Test Results

Uh oh!

janisz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mtodor commented Mar 31, 2026 •

edited

Loading

codecov-commenter commented Mar 31, 2026 •

edited

Loading

github-actions bot commented Mar 31, 2026 •

edited

Loading