Skip to content

Commit c4a7500

Browse files
committed
Update benchmark methodology and baselines for accuracy
Previous baselines (5 runs) were noisy in the 4-10ms range, causing false regression signals. Changes: - record.sh defaults: 5→10 runs, 2→5 warmup - baselines.md: re-measured with individual 10-run benchmarks, added measurement methodology section - CLAUDE.md: commit gate uses screening + individual verification - bench/README.md: updated cross-language comparison (2026-02-25)
1 parent b4c7077 commit c4a7500

File tree

6 files changed

+311
-252
lines changed

6 files changed

+311
-252
lines changed

.claude/CLAUDE.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -172,11 +172,14 @@ Run before every commit:
172172
- Wasm engine changes go in zwasm repo (`../zwasm/`), not CW
173173
- `bash bench/wasm_bench.sh --quick` — verify wasm benchmarks still work
174174
8. **Non-functional regression** (when changing execution code: src/ core files):
175-
- **Binary size**: `stat -f%z zig-out/bin/cljw` — ≤ 4.8MB
176-
- **Startup**: `hyperfine -N --warmup 3 --runs 5 './zig-out/bin/cljw -e nil'` — ≤ 6ms
175+
- **Binary size**: `ls -la zig-out/bin/cljw` — ≤ 5.0MB
176+
- **Startup**: `hyperfine -N --warmup 5 --runs 10 './zig-out/bin/cljw -e nil'` — ≤ 6ms
177177
- **RSS**: `/usr/bin/time -l ./zig-out/bin/cljw -e nil 2>&1 | grep 'maximum resident'` — ≤ 10MB
178-
- **Benchmarks**: `bash bench/run_bench.sh --quick` — no CW benchmark > 1.2x baseline
179-
- **Hard block**: Do NOT commit if any threshold exceeded.
178+
- **Benchmarks (screening)**: `bash bench/run_bench.sh` — quick sequential check
179+
- **Benchmarks (verify)**: If screening shows >1.2x, re-measure individually:
180+
`bash bench/run_bench.sh --bench=NAME --runs=10 --warmup=5`
181+
Only the individual measurement is authoritative (sequential runs suffer thermal throttling).
182+
- **Hard block**: Do NOT commit if any individual benchmark > 1.2x baseline.
180183
Benchmark regression → stop, profile, fix in place or insert optimization phase first.
181184
- Baselines & policy: `.dev/baselines.md`.
182185
9. **Zone check** (when modifying src/**/*.zig):
@@ -214,9 +217,10 @@ zig build test -- "X" # Specific test only
214217
All measurement uses hyperfine (warmup + multiple runs).
215218

216219
```bash
217-
bash bench/run_bench.sh # All benchmarks (3 runs + 1 warmup)
220+
bash bench/run_bench.sh # All benchmarks (3 runs + 1 warmup) — screening only
218221
bash bench/run_bench.sh --quick # Fast check (1 run, no warmup)
219-
bash bench/record.sh --id="X" --reason="description" # Record to history
222+
bash bench/run_bench.sh --bench=NAME --runs=10 --warmup=5 # Individual (accurate)
223+
bash bench/record.sh --id="X" --reason="description" # Record to history (10 runs)
220224
bash bench/compare_langs.sh --bench=fib_recursive --lang=cw,c,bb # Cross-language
221225
bash bench/wasm_bench.sh --quick # CW interpreter vs wasmtime JIT
222226
```
@@ -225,6 +229,8 @@ History: `bench/history.yaml` — CW native benchmark progression.
225229
Wasm history: `bench/wasm_history.yaml` — CW vs wasmtime wasm benchmark progression.
226230
**Record after every optimization task.** Use task ID as entry id (e.g. "36.7").
227231
**Regression check on execution code changes.** See Commit Gate #8 and `.dev/baselines.md`.
232+
**Baseline accuracy**: Sequential full-suite runs cause thermal throttling.
233+
For accurate baselines, measure each benchmark individually with 10+ runs.
228234

229235
## Notice
230236

.dev/baselines.md

Lines changed: 48 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
# Non-Functional Baselines
22

3-
Measured on: 2026-02-21 (post All-Zig Migration, Phase B.16 + C.1)
3+
Measured on: 2026-02-25 (v0.4.0 + GPA leak fix + JIT register fix)
44
Platform: macOS ARM64 (Apple M4 Pro), Zig 0.15.2
55
Binary: ReleaseSafe
66

77
## Profiles
88

99
| Profile | Binary | Startup | RSS | Notes |
1010
|---------|--------|---------|-----|-------|
11-
| wasm=true (default) | 4.52MB | 4.2ms | 7.6MB | Full feature set |
11+
| wasm=true (default) | 4.76MB | 4.5ms | 7.9MB | Full feature set |
1212
| wasm=false | (not measured) ||| No zwasm dependency |
1313

1414
## Thresholds
@@ -19,10 +19,10 @@ Phase E optimization target: reduce back toward 4.3MB.
1919

2020
| Metric | Baseline | Threshold | Margin | How to measure |
2121
|---------------------|------------|------------|--------|---------------------------------------------|
22-
| Binary size | 4.52 MB | 4.8 MB | +6% | `ls -la zig-out/bin/cljw` (after ReleaseSafe build) |
23-
| Startup time | 4.2 ms | 6.0 ms | 1.4x | `hyperfine -N --warmup 5 --runs 10 './zig-out/bin/cljw -e nil'` |
24-
| RSS (light) | 7.6 MB | 10 MB | +32% | `/usr/bin/time -l ./zig-out/bin/cljw -e nil 2>&1 \| grep 'maximum resident'` |
25-
| Benchmark (any) | see below | 1.2x | +20% | `bash bench/run_bench.sh --quick` |
22+
| Binary size | 4.76 MB | 5.0 MB | +5% | `ls -la zig-out/bin/cljw` (after ReleaseSafe build) |
23+
| Startup time | 4.5 ms | 6.0 ms | 1.3x | `hyperfine -N --warmup 5 --runs 10 './zig-out/bin/cljw -e nil'` |
24+
| RSS (light) | 7.9 MB | 10 MB | +27% | `/usr/bin/time -l ./zig-out/bin/cljw -e nil 2>&1 \| grep 'maximum resident'` |
25+
| Benchmark (any) | see below | 1.2x | +20% | Per-benchmark: `bash bench/run_bench.sh --bench=NAME --runs=10 --warmup=5` |
2626

2727
## `cljw build` Artifact Baselines (2026-02-20)
2828

@@ -50,35 +50,60 @@ If any benchmark exceeds 1.2x baseline:
5050

5151
Never accept "this feature needs to be slower" — find a way to keep it fast.
5252

53-
## Benchmark Baselines (2026-02-21, post All-Zig, hyperfine 5 runs)
53+
## Benchmark Baselines (2026-02-25, individual 10 runs + 5 warmup)
5454

55-
Source: `bench/history.yaml` entry `B.16`.
55+
Source: `bench/history.yaml` entry `v0.4.0-fix`.
5656

5757
| Benchmark | Time (ms) | Ceiling (ms) |
5858
|------------------------|-----------|--------------|
5959
| fib_recursive | 17 | 20 |
6060
| fib_loop | 4 | 5 |
61-
| tak | 7 | 8 |
62-
| arith_loop | 4 | 5 |
63-
| map_filter_reduce | 6 | 7 |
64-
| vector_ops | 6 | 7 |
65-
| map_ops | 5 | 6 |
66-
| list_build | 7 | 8 |
61+
| tak | 8 | 10 |
62+
| arith_loop | 5 | 6 |
63+
| map_filter_reduce | 7 | 8 |
64+
| vector_ops | 7 | 8 |
65+
| map_ops | 6 | 7 |
66+
| list_build | 6 | 7 |
6767
| sieve | 6 | 7 |
68-
| nqueens | 15 | 18 |
69-
| atom_swap | 4 | 5 |
70-
| gc_stress | 30 | 36 |
71-
| lazy_chain | 7 | 8 |
72-
| transduce | 6 | 7 |
73-
| keyword_lookup | 12 | 14 |
74-
| protocol_dispatch | 4 | 5 |
68+
| nqueens | 14 | 17 |
69+
| atom_swap | 6 | 7 |
70+
| gc_stress | 32 | 38 |
71+
| lazy_chain | 6 | 7 |
72+
| transduce | 7 | 8 |
73+
| keyword_lookup | 13 | 16 |
74+
| protocol_dispatch | 5 | 6 |
7575
| nested_update | 10 | 12 |
76-
| string_ops | 26 | 31 |
77-
| multimethod_dispatch | 7 | 8 |
76+
| string_ops | 27 | 32 |
77+
| multimethod_dispatch | 6 | 7 |
7878
| real_workload | 12 | 14 |
7979

8080
Wasm benchmarks excluded from regression gate (higher variance, dominated by zwasm).
8181

82+
## Measurement Methodology
83+
84+
**Baselines must be measured per-benchmark individually** to avoid thermal throttling.
85+
Sequential full-suite runs (`run_bench.sh` without `--bench`) are for quick regression
86+
screening only — do NOT use them to establish or update baselines.
87+
88+
For baseline establishment or suspected regression investigation:
89+
```bash
90+
# Per-benchmark, 10 runs + 5 warmup (accurate)
91+
bash bench/run_bench.sh --bench=NAME --runs=10 --warmup=5
92+
93+
# Or direct hyperfine for raw data with σ
94+
hyperfine -N --warmup 5 --runs 10 './zig-out/bin/cljw bench/benchmarks/NN_NAME/bench.clj'
95+
```
96+
97+
For commit gate regression screening:
98+
```bash
99+
# Quick sequential check (3 runs + 1 warmup) — OK for detecting gross regressions
100+
bash bench/run_bench.sh
101+
```
102+
103+
**Key insight**: In the 4-10ms range, 1-2ms of noise is 20-50% variance. 5 runs is
104+
insufficient — use 10+ runs for baselines. The 1.2x ceiling accounts for normal
105+
measurement noise, not for inaccurate baselines.
106+
82107
## Updating Baselines
83108

84109
Baselines improve (get faster/smaller) → update freely after measurement.

bench/README.md

Lines changed: 29 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -54,8 +54,8 @@ bash bench/wasm_bench.sh --bench=fib
5454
| `--id=ID` | Entry identifier (required) |
5555
| `--reason=TEXT` | Reason for measurement (required) |
5656
| `--bench=NAME` | Single benchmark |
57-
| `--runs=N` | Hyperfine runs (default: 5) |
58-
| `--warmup=N` | Warmup runs (default: 2) |
57+
| `--runs=N` | Hyperfine runs (default: 10) |
58+
| `--warmup=N` | Warmup runs (default: 5) |
5959
| `--overwrite` | Replace existing entry |
6060
| `--delete=ID` | Delete entry |
6161

@@ -188,41 +188,42 @@ bench/
188188
simd/ # SIMD benchmark programs
189189
```
190190

191-
## Latest Clojure Results (2026-02-14)
191+
## Latest Clojure Results (2026-02-25)
192192

193193
Apple M4 Pro, 48GB RAM, macOS 15. hyperfine 5 runs + 2 warmup.
194194
All times in milliseconds. These are **cold start** measurements (process
195195
launch to exit) — languages with heavy runtimes (JVM, V8) pay startup cost.
196196

197-
| Benchmark | CW | Python | Ruby | Node | Java* | C | Zig | TinyGo |
198-
|----------------------|------|--------|------|------|-------|-----|-----|--------|
199-
| fib_recursive | 19 | 17.1 | 37.7 | 25.2 | 20.3 | 1.2 | 1.7 | 3.3 |
200-
| fib_loop | 5 | 12.7 | 37.9 | 21.8 | 20.6 | 3.8 | 0.6 | 2.6 |
201-
| tak | 8 | 13.2 | 33.6 | 24.2 | 21.0 | 1.7 | 2.9 | 2.0 |
202-
| arith_loop | 5 | 60.7 | 54.5 | 25.2 | 21.4 | 1.7 | 1.2 | 1.7 |
203-
| map_filter_reduce | 6 | 13.0 | 35.9 | 23.7 | 21.4 | 1.4 | 1.4 | 2.6 |
204-
| vector_ops | 6 | 13.5 | 31.6 | 22.7 | 24.1 | 1.3 | 1.4 | 2.3 |
205-
| map_ops | 6 | 12.8 | 30.8 | 22.3 | 18.7 | 1.0 | 1.7 | 2.4 |
206-
| list_build | 6 | 14.6 | 34.6 | 25.7 | 21.9 | 1.5 | 1.8 | 2.5 |
207-
| sieve | 6 | 12.2 | 35.8 | 24.4 | 23.8 | 1.4 | 1.2 | 1.6 |
208-
| nqueens | 15 | 16.2 | 51.6 | 23.5 | 20.9 | 0.5 | 0.9 | 1.9 |
209-
| atom_swap | 5 | 11.7 | 36.2 | 24.0 | 20.9 | 1.4 | 2.9 | 3.5 |
210-
| gc_stress | 26 | 30.5 | 41.4 | 26.7 | 30.5 | 2.6 | --- | 20.2 |
211-
| lazy_chain | 7 | 15.4 | 33.0 | 26.1 | 22.2 | 2.6 | 1.6 | 2.2 |
212-
| transduce | 6 | 12.6 | 36.2 | 23.5 | 23.7 | 1.3 | 1.7 | 1.9 |
213-
| keyword_lookup | 11 | 19.4 | 37.0 | 27.4 | 23.9 | 1.6 | 0.0 | 4.9 |
214-
| protocol_dispatch | 6 | 12.7 | 32.8 | 24.3 | 22.0 | 2.3 | 1.7 | 2.2 |
215-
| nested_update | 10 | 12.6 | 32.9 | 24.0 | 23.7 | 0.2 | 1.3 | 3.1 |
216-
| string_ops | 25 | 25.2 | 38.0 | 24.5 | 24.8 | 4.3 | 2.0 | 1.5 |
217-
| multimethod_dispatch | 6 | 13.3 | 33.8 | 24.6 | 20.0 | 2.6 | 0.9 | 2.1 |
218-
| real_workload | 10 | 13.6 | 37.1 | 24.7 | 26.6 | 0.9 | 1.0 | 1.7 |
219-
220-
CW wins vs Java: 20/20, vs Python: 18/20, vs Ruby: 20/20, vs Node: 20/20.
197+
| Benchmark | CW | Python | Ruby | Node | Java* | C | Zig | TinyGo | BB |
198+
|----------------------|------|--------|------|------|-------|-----|-----|--------|------|
199+
| fib_recursive | 16 | 20.1 | 42.9 | 23.5 | 21.2 | 2.5 | 1.9 | 1.8 | 39.7 |
200+
| fib_loop | 5 | 12.5 | 29.1 | 21.5 | 21.0 | 1.4 | 2.9 | 0.9 | 12.7 |
201+
| tak | 8 | 14.1 | 31.8 | 25.3 | 20.5 | 0.6 | 2.8 | 2.9 | 20.9 |
202+
| arith_loop | 5 | 61.5 | 53.3 | 25.2 | 22.3 | 2.1 | 1.5 | 1.9 | 76.7 |
203+
| map_filter_reduce | 6 | 12.9 | 35.4 | 23.8 | 20.8 | 1.9 | 1.7 | 2.4 | 18.8 |
204+
| vector_ops | 7 | 14.9 | 31.5 | 22.6 | 20.5 | 0.3 | 1.7 | 2.6 | 18.1 |
205+
| map_ops | 7 | 12.5 | 31.8 | 26.4 | 21.9 | 2.4 | 2.1 | 1.3 | 12.7 |
206+
| list_build | 8 | 16.2 | 33.8 | 24.9 | 22.2 | 1.0 | 0.2 | 2.2 | 12.4 |
207+
| sieve | 9 | 13.1 | 35.5 | 26.2 | 24.0 | 0.9 | 2.3 | 2.7 | 18.5 |
208+
| nqueens | 15 | 15.9 | 50.7 | 21.1 | 19.5 | 4.6 | 2.2 | 2.5 | 24.5 |
209+
| atom_swap | 8 | 12.2 | 32.5 | 25.8 | 21.5 | 2.1 | 1.6 | 2.2 | 16.6 |
210+
| gc_stress | 35 | 27.3 | 39.1 | 25.6 | 32.9 | 2.4 | --- | 18.8 | 37.1 |
211+
| lazy_chain | 7 | 104.0 | 33.8 | 24.9 | 21.5 | 1.3 | 1.7 | 1.9 | 16.9 |
212+
| transduce | 5 | 13.2 | 34.5 | 26.6 | 21.3 | 1.8 | 1.5 | 1.0 | 16.7 |
213+
| keyword_lookup | 13 | 17.3 | 36.3 | 23.7 | 22.9 | 1.3 | 2.3 | 4.6 | 21.0 |
214+
| protocol_dispatch | 7 | 12.4 | 34.2 | 24.3 | 20.7 | 1.4 | 1.5 | 0.7 | --- |
215+
| nested_update | 12 | 13.6 | 29.0 | 26.5 | 22.9 | 0.8 | 2.2 | 3.8 | 18.4 |
216+
| string_ops | 30 | 24.9 | 39.2 | 27.4 | 23.3 | 8.5 | 2.6 | 1.6 | 21.3 |
217+
| multimethod_dispatch | 8 | 14.9 | 34.5 | 23.2 | 21.1 | 0.9 | 1.8 | 2.3 | 17.7 |
218+
| real_workload | 15 | 13.4 | 36.9 | 23.7 | 31.2 | 1.3 | 1.4 | 5.2 | 18.0 |
219+
220+
CW wins vs Java: 20/20, vs Python: 17/20, vs Ruby: 20/20, vs Node: 20/20, vs BB: 18/19.
221221

222222
\* Java times are dominated by JVM startup (~20ms). Warm JVM execution
223223
is significantly faster. C/Zig/TinyGo are native-compiled (AOT) baselines.
224+
BB = Babashka (GraalVM native-image Clojure).
224225

225-
Note: gc_stress Zig value (462.7ms) omitted — Zig benchmark uses
226+
Note: gc_stress Zig value (493ms) omitted — Zig benchmark uses
226227
`std.AutoArrayHashMap` which is not comparable to GC-managed collections.
227228

228229
## Binary Size Comparison

0 commit comments

Comments
 (0)