Skip to content

Commit 077457b

Browse files
authored
ROX-32888: Add evaluation docs and script to update it (#99)
1 parent 89fdf89 commit 077457b

File tree

2 files changed

+296
-0
lines changed

2 files changed

+296
-0
lines changed

docs/model-evaluation.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# LLM Model Evaluation Results
2+
3+
## Overview
4+
5+
This document tracks evaluation results of LLM models used with the StackRox MCP server. Evaluations measure how well a model selects the correct MCP tools, passes appropriate parameters, stays within expected tool call bounds, and produces accurate responses.
6+
7+
All evaluations use the [mcpchecker](https://github.com/mcpchecker/mcpchecker) framework against a deterministic WireMock-based mock backend, ensuring reproducible results across runs.
8+
9+
## Evaluation Methodology
10+
11+
### Test Framework
12+
13+
Evaluations are run using **mcpchecker**, configured in [`e2e-tests/mcpchecker/eval.yaml`](../e2e-tests/mcpchecker/eval.yaml). The framework:
14+
15+
1. Sends a natural language prompt to the model under test
16+
2. The model interacts with the MCP server (tool calls, parameter selection)
17+
3. Assertions validate tool usage against expected behavior
18+
4. An LLM judge evaluates response quality against reference answers
19+
20+
### Test Environment
21+
22+
- **Backend**: WireMock mock server with deterministic fixtures (no live StackRox Central required)
23+
- **MCP Config**: [`e2e-tests/mcpchecker/mcp-config-mock.yaml`](../e2e-tests/mcpchecker/mcp-config-mock.yaml)
24+
- **Task definitions**: [`e2e-tests/mcpchecker/tasks/`](../e2e-tests/mcpchecker/tasks/)
25+
26+
### Assertions
27+
28+
Each task defines assertions from the following set:
29+
30+
| Assertion | Description |
31+
|-----------|-------------|
32+
| `toolsUsed` | Required tool(s) must be called, optionally with matching arguments (`argumentsMatch`) |
33+
| `minToolCalls` | Minimum total tool calls across all tools |
34+
| `maxToolCalls` | Maximum total tool calls (prevents runaway tool usage) |
35+
36+
A task passes when **all** its assertions pass **and** the LLM judge approves the response.
37+
38+
## Evaluation Results
39+
40+
<!-- model:gpt-5-mini start -->
41+
42+
### gpt-5-mini — 2026-03-31
43+
44+
**Overall: 10/11 tasks passed (90%)**
45+
46+
#### Task Results
47+
48+
| # | Task | Result | toolsUsed | minCalls | maxCalls | Input Tokens | Output Tokens |
49+
|---|------|--------|-----------|----------|----------|--------------|---------------|
50+
| 1 | list-clusters | Pass | Pass | Pass | Pass | 1728 | 962 |
51+
| 2 | cve-detected-workloads | Pass | Pass | Pass | Pass | 565 | 1187 |
52+
| 3 | cve-detected-clusters | Pass | **Fail** | Pass | Pass | 640 | 1998 |
53+
| 4 | cve-nonexistent | Pass | Pass | Pass | Pass | 1077 | 2605 |
54+
| 5 | cve-cluster-does-exist | **Fail** | Pass | Pass | Pass | 539 | 1285 |
55+
| 6 | cve-cluster-does-not-exist | Pass | **Fail** | Pass | Pass | 1528 | 1324 |
56+
| 7 | cve-clusters-general | Pass | Pass | Pass | Pass | 796 | 2304 |
57+
| 8 | cve-cluster-list | Pass | Pass | Pass | Pass | 488 | 1917 |
58+
| 9 | cve-log4shell | Pass | Pass | Pass | Pass | 1008 | 2936 |
59+
| 10 | cve-multiple | Pass | Pass | Pass | Pass | 1142 | 2493 |
60+
| 11 | rhsa-not-supported | Pass || Pass | Pass | 650 | 2488 |
61+
62+
**Total input tokens**: 10161 | **Total output tokens**: 21499
63+
64+
<!-- model:gpt-5-mini end -->
65+
66+
<!-- model:gpt-5 start -->
67+
68+
### gpt-5 — 2026-03-31
69+
70+
**Overall: 9/11 tasks passed (81%)**
71+
72+
#### Task Results
73+
74+
| # | Task | Result | toolsUsed | minCalls | maxCalls | Input Tokens | Output Tokens |
75+
|---|------|--------|-----------|----------|----------|--------------|---------------|
76+
| 1 | list-clusters | Pass | Pass | Pass | Pass | 1720 | 552 |
77+
| 2 | cve-detected-workloads | Pass | Pass | Pass | Pass | 1589 | 1003 |
78+
| 3 | cve-detected-clusters | Pass | Pass | Pass | Pass | 521 | 1702 |
79+
| 4 | cve-nonexistent | **Fail** | Pass | Pass | Pass | 2406 | 2085 |
80+
| 5 | cve-cluster-does-exist | Pass | Pass | Pass | Pass | 1563 | 1682 |
81+
| 6 | cve-cluster-does-not-exist | **Fail** | **Fail** | Pass | Pass | 504 | 1868 |
82+
| 7 | cve-clusters-general | Pass | Pass | Pass | Pass | 516 | 1477 |
83+
| 8 | cve-cluster-list | Pass | Pass | Pass | Pass | 706 | 1964 |
84+
| 9 | cve-log4shell | Pass | Pass | Pass | Pass | 1008 | 2304 |
85+
| 10 | cve-multiple | Pass | Pass | Pass | Pass | 2166 | 2492 |
86+
| 11 | rhsa-not-supported | Pass || Pass | Pass | 818 | 2187 |
87+
88+
**Total input tokens**: 13517 | **Total output tokens**: 19316
89+
90+
<!-- model:gpt-5 end -->
91+
92+
## How to Run Evaluations
93+
94+
### Prerequisites
95+
96+
- Go 1.25+
97+
- LLM judge credentials configured via environment variables (see below)
98+
99+
### Running an Evaluation
100+
101+
1. **Configure the agent model** via environment variable or in `e2e-tests/mcpchecker/eval.yaml`:
102+
103+
```bash
104+
export MODEL_NAME=gpt-5-nano
105+
```
106+
107+
2. **Set judge environment variables**:
108+
109+
```bash
110+
export JUDGE_TYPE=openai
111+
export JUDGE_API_KEY=<your-key>
112+
export JUDGE_MODEL_NAME=<judge-model>
113+
```
114+
115+
3. **Run the evaluation**:
116+
117+
```bash
118+
make e2e-test
119+
```
120+
121+
4. **Update this document** with the results:
122+
123+
```bash
124+
./scripts/update-model-evaluation.sh \
125+
--model-id <model-id> \
126+
--results e2e-tests/mcpchecker/mcpchecker-stackrox-mcp-e2e-out.json
127+
```
128+
129+
The script generates a markdown section with the task results table and
130+
inserts or updates it in this document using HTML comment markers.
131+
132+
If results for the given `--model-id` already exist, the script replaces
133+
the existing section. Otherwise, it appends a new section.

scripts/update-model-evaluation.sh

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
#!/bin/bash
2+
set -e
3+
4+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
5+
ROOT_DIR="$(dirname "${SCRIPT_DIR}")"
6+
DOC_FILE="${ROOT_DIR}/docs/model-evaluation.md"
7+
8+
# Validate required tools first
9+
if ! command -v jq &> /dev/null; then
10+
echo "Error: jq is required but not installed"
11+
exit 1
12+
fi
13+
14+
usage() {
15+
echo "Usage: $0 --model-id <id> --results <json-file>"
16+
echo ""
17+
echo "Update docs/model-evaluation.md with evaluation results from mcpchecker JSON output."
18+
echo ""
19+
echo "Options:"
20+
echo " --model-id Model identifier (e.g. gpt-5-mini)"
21+
echo " --results Path to mcpchecker JSON results file"
22+
echo " -h, --help Show this help message"
23+
echo ""
24+
echo "Examples:"
25+
echo " $0 --model-id gpt-5 --results e2e-tests/mcpchecker/mcpchecker-stackrox-mcp-e2e-out.json"
26+
exit 1
27+
}
28+
29+
MODEL_ID=""
30+
RESULTS_FILE=""
31+
32+
while [[ $# -gt 0 ]]; do
33+
case "$1" in
34+
--model-id)
35+
MODEL_ID="$2"
36+
shift 2
37+
;;
38+
--results)
39+
RESULTS_FILE="$2"
40+
shift 2
41+
;;
42+
-h|--help)
43+
usage
44+
;;
45+
*)
46+
echo "Error: unknown option '$1'"
47+
usage
48+
;;
49+
esac
50+
done
51+
52+
if [[ -z "${MODEL_ID}" ]]; then
53+
echo "Error: --model-id is required"
54+
usage
55+
fi
56+
57+
if [[ -z "${RESULTS_FILE}" ]]; then
58+
echo "Error: --results is required"
59+
usage
60+
fi
61+
62+
if [[ ! -f "${RESULTS_FILE}" ]]; then
63+
echo "Error: results file not found: ${RESULTS_FILE}"
64+
exit 1
65+
fi
66+
67+
if [[ ! -f "${DOC_FILE}" ]]; then
68+
echo "Error: documentation file not found: ${DOC_FILE}"
69+
exit 1
70+
fi
71+
72+
TODAY=$(date +%Y-%m-%d)
73+
START_MARKER="<!-- model:${MODEL_ID} start -->"
74+
END_MARKER="<!-- model:${MODEL_ID} end -->"
75+
76+
# Generate the markdown block
77+
generate_block() {
78+
local total passed
79+
total=$(jq 'length' "${RESULTS_FILE}")
80+
passed=$(jq '[.[] | select(.taskPassed == true)] | length' "${RESULTS_FILE}")
81+
local pct=$((100 * passed / total))
82+
83+
echo "${START_MARKER}"
84+
echo ""
85+
echo "### ${MODEL_ID}${TODAY}"
86+
echo ""
87+
echo "**Overall: ${passed}/${total} tasks passed (${pct}%)**"
88+
echo ""
89+
echo "#### Task Results"
90+
echo ""
91+
echo "| # | Task | Result | toolsUsed | minCalls | maxCalls | Input Tokens | Output Tokens |"
92+
echo "|---|------|--------|-----------|----------|----------|--------------|---------------|"
93+
94+
# Generate table rows
95+
jq -r '
96+
to_entries[] |
97+
.key as $i |
98+
.value |
99+
($i + 1) as $num |
100+
.taskName as $name |
101+
(if .taskPassed then "Pass" else "**Fail**" end) as $result |
102+
(.assertionResults.toolsUsed // null) as $tu |
103+
(.assertionResults.minToolCalls // null) as $min |
104+
(.assertionResults.maxToolCalls // null) as $max |
105+
(if $tu == null then "\u2014"
106+
elif $tu.passed then "Pass"
107+
else "**Fail**"
108+
end) as $tuStr |
109+
(if $min == null then "\u2014"
110+
elif $min.passed then "Pass"
111+
else "**Fail**"
112+
end) as $minStr |
113+
(if $max == null then "\u2014"
114+
elif $max.passed then "Pass"
115+
else "**Fail**"
116+
end) as $maxStr |
117+
(.tokenEstimate.inputTokens) as $inputTokens |
118+
(.tokenEstimate.outputTokens) as $outputTokens |
119+
"| \($num) | \($name) | \($result) | \($tuStr) | \($minStr) | \($maxStr) | \($inputTokens) | \($outputTokens) |"
120+
' "${RESULTS_FILE}"
121+
122+
echo ""
123+
124+
# Token totals
125+
local input_tokens output_tokens
126+
input_tokens=$(jq '[.[].tokenEstimate.inputTokens] | add' "${RESULTS_FILE}")
127+
output_tokens=$(jq '[.[].tokenEstimate.outputTokens] | add' "${RESULTS_FILE}")
128+
echo "**Total input tokens**: ${input_tokens} | **Total output tokens**: ${output_tokens}"
129+
echo ""
130+
echo "${END_MARKER}"
131+
}
132+
133+
BLOCKFILE=$(mktemp)
134+
TMPFILE=$(mktemp)
135+
cleanup() { rm -f "${BLOCKFILE}" "${TMPFILE}"; }
136+
trap cleanup EXIT
137+
138+
# shellcheck disable=SC2311
139+
generate_block > "${BLOCKFILE}"
140+
141+
if grep -qF "${START_MARKER}" "${DOC_FILE}"; then
142+
# Update existing block: replace lines between markers (inclusive) with new block
143+
awk -v start="${START_MARKER}" -v end="${END_MARKER}" -v blockfile="${BLOCKFILE}" '
144+
$0 == start { skip=1; while ((getline line < blockfile) > 0) print line; next }
145+
$0 == end { skip=0; next }
146+
!skip { print }
147+
' "${DOC_FILE}" > "${TMPFILE}"
148+
mv "${TMPFILE}" "${DOC_FILE}"
149+
150+
echo "Updated existing results for ${MODEL_ID} in ${DOC_FILE}"
151+
else
152+
# Insert new block before "## How to Run Evaluations"
153+
awk -v blockfile="${BLOCKFILE}" '
154+
/^## How to Run Evaluations/ {
155+
while ((getline line < blockfile) > 0) print line
156+
print ""
157+
}
158+
{ print }
159+
' "${DOC_FILE}" > "${TMPFILE}"
160+
mv "${TMPFILE}" "${DOC_FILE}"
161+
162+
echo "Added new results for ${MODEL_ID} to ${DOC_FILE}"
163+
fi

0 commit comments

Comments
 (0)