fix(bench): warmup + median for queryTimeMs to remove cold-start noise#1133
fix(bench): warmup + median for queryTimeMs to remove cold-start noise#1133carlos-alm wants to merge 3 commits into
Conversation
The build-benchmark queryTimeMs metric was a single-shot cold call to
fnDepsData('buildGraph', dbPath) with no warmup, no median, and
noTests:false. This conflated steady-state query latency with
NAPI/rusqlite/OS-page-cache init (~65ms on macOS) and let fixture-file
growth from new native extractors inflate the measurement.
Local verification on the same 757-file corpus:
v3.9.6: single-shot 78.8ms warmed median 4.0ms
HEAD: single-shot 67.5ms warmed median 2.8ms
HEAD is faster than v3.9.6 on both metrics — the +110% spike that
tripped the publish gate during the R + Solidity merge (#1100, #1102)
was single-shot cold-start variance, not a real regression.
Switch queryTimeMs to 3 warmup runs + median of 5 with noTests:true,
matching the methodology already used by query-benchmark.ts and the
per-target queries.*Ms block in the same script. Update the
KNOWN_REGRESSIONS comment for 3.10.0:Query time to record the
methodology fix; keep the entry in place until 3.11.0+ data captures
the new steady-state.
Closes #1113
|
Claude finished @carlos-alm's task —— View job PR Review CompleteReviewing PR #1133: fix(bench): warmup + median for queryTimeMs to remove cold-start noise Root Cause AnalysisProblem Being Solved: The Why This Approach: This fixes the actual measurement methodology rather than masking symptoms. The author demonstrated that HEAD is actually faster than v3.9.6 when measured properly (warmed median: 2.8ms vs 4.0ms), proving the +110% spike was measurement noise, not real regression. Risk Assessment: ✅ LOW RISK - This is a legitimate methodology fix that aligns the problematic Backlog Compliance
Critical ConcernsNone identified. This is a textbook example of fixing measurement methodology rather than masking problems. Technical Review✅ Configuration Change DetectionCLEAN - No configuration rules relaxed or disabled. Changes improve measurement accuracy without hiding issues. ✅ Test IntegritySOLID - No test assertions weakened. The only change is updating the ✅ Code QualityEXCELLENT - Well-implemented with:
✅ Implementation Details
Final Recommendation
|
Greptile SummaryThis PR fixes the
Confidence Score: 5/5Safe to merge — changes are confined to the benchmark script and a test comment; no production code paths are affected. The warmup + median methodology is already proven in benchQuery for the queries.*Ms block; the new code simply routes queryTimeMs through the same path. QUERY_WARMUP_RUNS is now correctly grouped with the other run-count constants, and the warmup loop inside benchQuery makes each call site independent of call ordering. The regression-guard comment update is accurate and the exemption is appropriately kept in place until new baseline data is gathered. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant B as benchmark.ts
participant bQ as benchQuery()
participant fn as fnDepsData()
Note over B: Full build completes
B->>bQ: "benchQuery(fnDepsData, 'buildGraph', dbPath, {depth:3, noTests:true})"
rect rgb(240, 240, 255)
Note over bQ,fn: Warmup phase (QUERY_WARMUP_RUNS = 3)
loop 3 warmup calls
bQ->>fn: "fn('buildGraph', dbPath, {depth:3, noTests:true})"
fn-->>bQ: result (discarded)
end
end
rect rgb(240, 255, 240)
Note over bQ,fn: Timed phase (QUERY_RUNS = 5)
loop 5 timed calls
bQ->>fn: "fn('buildGraph', dbPath, {depth:3, noTests:true})"
fn-->>bQ: result + timing recorded
end
end
bQ-->>B: round1(median(timings)) → queryTimeMs
Reviews (4): Last reviewed commit: "Merge branch 'main' into fix/1113-query-..." | Re-trigger Greptile |
| // Warmed median of QUERY_RUNS samples with `noTests: true` to match the | ||
| // methodology used by query-benchmark.ts and the per-target `queries.*Ms` | ||
| // block below. Earlier versions of this script measured a single cold call, | ||
| // which conflated steady-state query latency with NAPI/rusqlite/OS-page-cache | ||
| // init costs (~65ms on macOS) and inflated growth from test-fixture files | ||
| // pulled in by new native extractors. See #1113 for the methodology rationale. | ||
| const QUERY_WARMUP_RUNS = 3; | ||
| for (let i = 0; i < QUERY_WARMUP_RUNS; i++) { | ||
| fnDepsData('buildGraph', dbPath, { depth: 3, noTests: true }); | ||
| } |
There was a problem hiding this comment.
Comment slightly overstates
benchQuery parity
The comment says the new methodology matches "the per-target queries.*Ms block below", but benchQuery (used for all queries.*Ms values) has no explicit warmup loop of its own. The implicit warming works because the queryTimeMs warmup runs execute before benchQuery is ever called, but that coupling is fragile — if the order of operations ever changes, queries.*Ms loses its warm-up silently. Consider either adding a warmup inside benchQuery, or rephrasing the comment to clarify that the warmup runs here also prime the DB/NAPI cache for the subsequent queries.*Ms block.
There was a problem hiding this comment.
Fixed in de2212c. Added an explicit warmup loop inside benchQuery itself (3 runs, gated by QUERY_WARMUP_RUNS), so each call site primes its own NAPI/rusqlite/page-cache state. The methodology no longer depends on call ordering, and queryTimeMs now uses benchQuery directly — the comment describing parity with the queries.*Ms block is now accurate.
…1133) Addresses Greptile review feedback: - Move QUERY_WARMUP_RUNS to the run-constants block at the top of the worker section, alongside INCREMENTAL_RUNS and QUERY_RUNS, so all run counts are tunable in one place. - Add the warmup loop inside benchQuery itself instead of relying on the queryTimeMs warmup to implicitly prime caches for later calls. Each call site now warms independently, so the methodology no longer depends on call ordering. - Use benchQuery for the top-level queryTimeMs measurement to eliminate duplicated warmup/timing logic and update the comment to describe the actual parity (benchQuery is also warmed) rather than implicit coupling.
|
Addressed Greptile feedback in de2212c:
|
Codegraph Impact Analysis1 functions changed → 2 callers affected across 1 files
|
Summary
The build-benchmark
queryTimeMsmetric was a single-shot cold call tofnDepsData('buildGraph', dbPath)with no warmup, no median, andnoTests:false. This conflated steady-state query latency with NAPI / rusqlite statement-cache / OS-page-cache init (~65ms on macOS) and let fixture-file growth from new native extractors inflate the measurement.Local verification on the same 757-file corpus (HEAD source tree):
queryTimeMs(old single-shot, noTests:false)queries.fnDepsMs(warmed median, noTests:true, depth 3)HEAD is faster than v3.9.6 on both metrics — the +110% spike (49.6→104.4) that tripped the publish gate during the R + Solidity merge (#1100, #1102) was single-shot cold-start variance on a ~70ms cold call, not a real regression. The exemption
3.10.0:Query timeadded toKNOWN_REGRESSIONSwas the right defensive move at merge time, but the metric itself was the bug.Changes
scripts/benchmark.ts:queryTimeMsnow uses 3 warmup runs + median of 5 withnoTests: true, matching the methodology already used byquery-benchmark.tsand the per-targetqueries.*Msblock in the same script.tests/benchmarks/regression-guard.test.ts: update3.10.0:Query timecomment to describe the methodology fix. Entry kept in place — per the issue's action Bump actions/github-script from 7 to 8 #3, remove only after 3.11.0+ data captures the new steady-state under the updated methodology (expected ~36ms native on this corpus, no longer cold-start-dominated).Backwards-compat note: published 3.10.0 has
queryTimeMs ≈ 50ms(old methodology); after this fix dev/3.11.0+ will land at ~36ms. The per-PR regression gate will see this as an improvement, not a regression — no exemption changes needed.Closes #1113
Test plan
node scripts/benchmark.ts— produces sensiblequeryTimeMs(36.3 ms native, 36.3 ms wasm on local HEAD); the field is still present and in ms units.node scripts/benchmark.ts --npm --version 3.9.6to confirm no real regression.