Commit bed5094
Overhaul judge and criteria for E2E testing with CLI agent reviewers
Major changes:
Judge: Replaced CodebuffClient SDK-based LLM judges with real CLI coding
agents (Claude Code, Codex, Gemini) that run IN the repo. Reviewer agents
can build, run tests, start the dev server, use browser tools, curl
endpoints, check logs — actual E2E verification, not just diff reading.
Structured output via result file (evalbuff-review-result.json) with
fallback to stdout JSON extraction.
Criteria: Shifted from code style (correctness, completeness, pattern
consistency, fluency) to E2E verification levels:
- L1: Builds, existing tests pass, basic completeness
- L2: Feature works E2E (browser/curl/client), logs clean
- L3: Edge cases & error states tested E2E, UI verification
- L4: Cross-component integration, performance, no regressions
- L5: Production readiness (migrations, env vars, error recovery)
Orchestrator: Judge now runs inside withTestRepo callback so reviewer
agents have access to the live repo. CodebuffClient only used for
doc writer (analyzeFailure). Added --reviewers CLI flag.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent 1a754ce commit bed5094
File tree
7 files changed
+507
-291
lines changed- evals/evalbuff
- __tests__
7 files changed
+507
-291
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
29 | | - | |
30 | | - | |
31 | | - | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
38 | | - | |
39 | | - | |
40 | | - | |
41 | | - | |
42 | | - | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
43 | 45 | | |
44 | 46 | | |
45 | 47 | | |
46 | 48 | | |
47 | 49 | | |
48 | | - | |
49 | | - | |
| 50 | + | |
| 51 | + | |
50 | 52 | | |
51 | 53 | | |
52 | 54 | | |
53 | 55 | | |
54 | | - | |
| 56 | + | |
55 | 57 | | |
56 | 58 | | |
57 | 59 | | |
| |||
86 | 88 | | |
87 | 89 | | |
88 | 90 | | |
89 | | - | |
90 | 91 | | |
91 | 92 | | |
92 | 93 | | |
93 | 94 | | |
94 | 95 | | |
95 | 96 | | |
96 | 97 | | |
97 | | - | |
| 98 | + | |
98 | 99 | | |
99 | 100 | | |
100 | 101 | | |
101 | | - | |
102 | | - | |
| 102 | + | |
| 103 | + | |
103 | 104 | | |
104 | 105 | | |
105 | 106 | | |
106 | 107 | | |
107 | 108 | | |
108 | 109 | | |
109 | | - | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
110 | 118 | | |
111 | 119 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
| 58 | + | |
58 | 59 | | |
59 | 60 | | |
| 61 | + | |
60 | 62 | | |
61 | 63 | | |
62 | 64 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
| 59 | + | |
59 | 60 | | |
60 | 61 | | |
| 62 | + | |
61 | 63 | | |
62 | 64 | | |
63 | 65 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
19 | | - | |
| 19 | + | |
20 | 20 | | |
21 | 21 | | |
22 | | - | |
| 22 | + | |
23 | 23 | | |
24 | 24 | | |
25 | | - | |
| 25 | + | |
26 | 26 | | |
27 | 27 | | |
28 | | - | |
| 28 | + | |
29 | 29 | | |
30 | 30 | | |
31 | | - | |
32 | | - | |
| 31 | + | |
| 32 | + | |
33 | 33 | | |
34 | | - | |
| 34 | + | |
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
39 | | - | |
40 | | - | |
| 39 | + | |
| 40 | + | |
41 | 41 | | |
42 | | - | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
43 | 49 | | |
44 | 50 | | |
45 | 51 | | |
46 | 52 | | |
47 | | - | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
48 | 60 | | |
49 | 61 | | |
50 | | - | |
| 62 | + | |
51 | 63 | | |
52 | 64 | | |
53 | 65 | | |
54 | 66 | | |
55 | | - | |
| 67 | + | |
56 | 68 | | |
57 | 69 | | |
58 | | - | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
59 | 77 | | |
60 | 78 | | |
61 | 79 | | |
62 | 80 | | |
63 | | - | |
64 | | - | |
| 81 | + | |
| 82 | + | |
65 | 83 | | |
66 | | - | |
| 84 | + | |
67 | 85 | | |
68 | 86 | | |
69 | 87 | | |
| |||
122 | 140 | | |
123 | 141 | | |
124 | 142 | | |
125 | | - | |
| 143 | + | |
126 | 144 | | |
127 | 145 | | |
128 | 146 | | |
129 | 147 | | |
130 | 148 | | |
131 | | - | |
| 149 | + | |
132 | 150 | | |
133 | 151 | | |
134 | 152 | | |
| |||
138 | 156 | | |
139 | 157 | | |
140 | 158 | | |
141 | | - | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
142 | 162 | | |
143 | 163 | | |
144 | 164 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
| 5 | + | |
6 | 6 | | |
7 | | - | |
| 7 | + | |
8 | 8 | | |
9 | 9 | | |
10 | | - | |
| 10 | + | |
11 | 11 | | |
12 | | - | |
| 12 | + | |
13 | 13 | | |
14 | 14 | | |
15 | | - | |
16 | | - | |
17 | | - | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| |||
0 commit comments