fix(bench): warmup + median for queryTimeMs to remove cold-start noise by carlos-alm · Pull Request #1133 · optave/ops-codegraph-tool

carlos-alm · 2026-05-15T14:41:20Z

Summary

The build-benchmark queryTimeMs metric was a single-shot cold call to fnDepsData('buildGraph', dbPath) with no warmup, no median, and noTests:false. This conflated steady-state query latency with NAPI / rusqlite statement-cache / OS-page-cache init (~65ms on macOS) and let fixture-file growth from new native extractors inflate the measurement.

Local verification on the same 757-file corpus (HEAD source tree):

	v3.9.6	HEAD	Δ
`queryTimeMs` (old single-shot, noTests:false)	78.8 ms	67.5 ms	−14%
`queries.fnDepsMs` (warmed median, noTests:true, depth 3)	4.0 ms	2.8 ms	−30%
files / nodes / edges	757 / 18,796 / 39,549	757 / 18,828 / 39,682	flat

HEAD is faster than v3.9.6 on both metrics — the +110% spike (49.6→104.4) that tripped the publish gate during the R + Solidity merge (#1100, #1102) was single-shot cold-start variance on a ~70ms cold call, not a real regression. The exemption 3.10.0:Query time added to KNOWN_REGRESSIONS was the right defensive move at merge time, but the metric itself was the bug.

Changes

scripts/benchmark.ts: queryTimeMs now uses 3 warmup runs + median of 5 with noTests: true, matching the methodology already used by query-benchmark.ts and the per-target queries.*Ms block in the same script.
tests/benchmarks/regression-guard.test.ts: update 3.10.0:Query time comment to describe the methodology fix. Entry kept in place — per the issue's action Bump actions/github-script from 7 to 8 #3, remove only after 3.11.0+ data captures the new steady-state under the updated methodology (expected ~36ms native on this corpus, no longer cold-start-dominated).

Backwards-compat note: published 3.10.0 has queryTimeMs ≈ 50ms (old methodology); after this fix dev/3.11.0+ will land at ~36ms. The per-PR regression gate will see this as an improvement, not a regression — no exemption changes needed.

Closes #1113

Test plan

node scripts/benchmark.ts — produces sensible queryTimeMs (36.3 ms native, 36.3 ms wasm on local HEAD); the field is still present and in ms units.
Verified v3.9.6 baseline numbers locally via node scripts/benchmark.ts --npm --version 3.9.6 to confirm no real regression.
CI: the pre-publish benchmark gate compares dev vs 3.10.0 (≈50ms old methodology) — should pass cleanly as an improvement.
Lint check on changed files (no new warnings introduced).

The build-benchmark queryTimeMs metric was a single-shot cold call to fnDepsData('buildGraph', dbPath) with no warmup, no median, and noTests:false. This conflated steady-state query latency with NAPI/rusqlite/OS-page-cache init (~65ms on macOS) and let fixture-file growth from new native extractors inflate the measurement. Local verification on the same 757-file corpus: v3.9.6: single-shot 78.8ms warmed median 4.0ms HEAD: single-shot 67.5ms warmed median 2.8ms HEAD is faster than v3.9.6 on both metrics — the +110% spike that tripped the publish gate during the R + Solidity merge (#1100, #1102) was single-shot cold-start variance, not a real regression. Switch queryTimeMs to 3 warmup runs + median of 5 with noTests:true, matching the methodology already used by query-benchmark.ts and the per-target queries.*Ms block in the same script. Update the KNOWN_REGRESSIONS comment for 3.10.0:Query time to record the methodology fix; keep the entry in place until 3.11.0+ data captures the new steady-state. Closes #1113

claude · 2026-05-15T14:41:35Z

Claude finished @carlos-alm's task —— View job

PR Review Complete

Reviewing PR #1133: fix(bench): warmup + median for queryTimeMs to remove cold-start noise

Root Cause Analysis

Problem Being Solved: The queryTimeMs benchmark metric was showing false regression alerts due to cold-start noise conflating NAPI/rusqlite/OS-page-cache initialization (~65ms) with actual query performance.

Why This Approach: This fixes the actual measurement methodology rather than masking symptoms. The author demonstrated that HEAD is actually faster than v3.9.6 when measured properly (warmed median: 2.8ms vs 4.0ms), proving the +110% spike was measurement noise, not real regression.

Risk Assessment: ✅ LOW RISK - This is a legitimate methodology fix that aligns the problematic queryTimeMs metric with existing proven methodologies already used in query-benchmark.ts and the same script's per-target queries.*Ms measurements.

Backlog Compliance

Zero-dep: ✓ — No new dependencies added
Foundation-aligned: ✓ — Improves measurement accuracy, aligns with zero-infrastructure principle
Problem-fit: 4/5 — Improves benchmark reliability for AI agents by preventing false regression alerts that could block legitimate improvements
Breaking: No — Pure internal methodology improvement, same output format
Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Critical Concerns

None identified. This is a textbook example of fixing measurement methodology rather than masking problems.

Technical Review

✅ Configuration Change Detection

CLEAN - No configuration rules relaxed or disabled. Changes improve measurement accuracy without hiding issues.

✅ Test Integrity

SOLID - No test assertions weakened. The only change is updating the KNOWN_REGRESSIONS comment to document the methodology fix, keeping the exemption in place until 3.11.0+ data validates the new baseline.

✅ Code Quality

EXCELLENT - Well-implemented with:

Proper warmup methodology (3 warmup runs) at scripts/benchmark.ts:142-145
Statistical robustness (median of 5 samples) at scripts/benchmark.ts:146-152
Consistent with existing patterns (matches QUERY_RUNS constant, uses existing median() function)
Thorough documentation explaining the rationale at scripts/benchmark.ts:136-141
Appropriate parameter alignment (noTests: true, depth: 3) matching per-target benchmarks

✅ Implementation Details

Uses existing QUERY_RUNS = 5 constant and median() helper for consistency
Aligns parameters with proven methodology: { depth: 3, noTests: true }
Maintains backward compatibility - same metric name, same units (ms)
Documents expected new baseline (~36ms) for future validation

Final Recommendation

Rating: ⭐⭐⭐⭐⭐ (5/5)
Action: APPROVE
Reasoning: This is an exemplary fix that addresses the root cause of false regression alerts through proper statistical methodology. The implementation is thorough, well-documented, and aligns with existing patterns. The author provided compelling evidence that this reveals the true performance (HEAD is faster), rather than hiding regressions.

greptile-apps · 2026-05-15T14:43:16Z

Greptile Summary

This PR fixes the queryTimeMs benchmark metric in scripts/benchmark.ts by replacing a single cold call to fnDepsData with a warmed median (3 warmup runs + median of 5 samples) using noTests: true, eliminating cold-start noise (~65 ms on macOS) that was causing false regression alerts.

scripts/benchmark.ts: QUERY_WARMUP_RUNS = 3 constant added alongside existing run-count constants; a warmup loop is added inside benchQuery itself so every call site primes its own NAPI/rusqlite/page-cache state independently; queryTimeMs now delegates to benchQuery with { depth: 3, noTests: true }.
tests/benchmarks/regression-guard.test.ts: The 3.10.0:Query time exemption comment is updated to accurately describe the methodology artifact rather than the previously-believed graph-growth cause; the exemption entry is intentionally retained until 3.11.0+ data under the new methodology is captured.

Confidence Score: 5/5

Safe to merge — changes are confined to the benchmark script and a test comment; no production code paths are affected.

The warmup + median methodology is already proven in benchQuery for the queries.*Ms block; the new code simply routes queryTimeMs through the same path. QUERY_WARMUP_RUNS is now correctly grouped with the other run-count constants, and the warmup loop inside benchQuery makes each call site independent of call ordering. The regression-guard comment update is accurate and the exemption is appropriately kept in place until new baseline data is gathered.

No files require special attention.

Important Files Changed

Filename	Overview
scripts/benchmark.ts	Adds QUERY_WARMUP_RUNS constant, warmup loop inside benchQuery, and delegates queryTimeMs to benchQuery — all changes are logically consistent and address the cold-start measurement problem.
tests/benchmarks/regression-guard.test.ts	Comment-only update clarifying that the 3.10.0:Query time exemption was a methodology artifact; no logic changes, exemption entry correctly retained pending new baseline data.

Sequence Diagram

sequenceDiagram
    participant B as benchmark.ts
    participant bQ as benchQuery()
    participant fn as fnDepsData()

    Note over B: Full build completes

    B->>bQ: "benchQuery(fnDepsData, 'buildGraph', dbPath, {depth:3, noTests:true})"
    
    rect rgb(240, 240, 255)
        Note over bQ,fn: Warmup phase (QUERY_WARMUP_RUNS = 3)
        loop 3 warmup calls
            bQ->>fn: "fn('buildGraph', dbPath, {depth:3, noTests:true})"
            fn-->>bQ: result (discarded)
        end
    end

    rect rgb(240, 255, 240)
        Note over bQ,fn: Timed phase (QUERY_RUNS = 5)
        loop 5 timed calls
            bQ->>fn: "fn('buildGraph', dbPath, {depth:3, noTests:true})"
            fn-->>bQ: result + timing recorded
        end
    end

    bQ-->>B: round1(median(timings)) → queryTimeMs

_{Reviews (4): Last reviewed commit: "Merge branch 'main' into fix/1113-query-..." | Re-trigger Greptile}

greptile-apps · 2026-05-15T14:43:23Z

+// Warmed median of QUERY_RUNS samples with `noTests: true` to match the
+// methodology used by query-benchmark.ts and the per-target `queries.*Ms`
+// block below. Earlier versions of this script measured a single cold call,
+// which conflated steady-state query latency with NAPI/rusqlite/OS-page-cache
+// init costs (~65ms on macOS) and inflated growth from test-fixture files
+// pulled in by new native extractors. See #1113 for the methodology rationale.
+const QUERY_WARMUP_RUNS = 3;
+for (let i = 0; i < QUERY_WARMUP_RUNS; i++) {
+	fnDepsData('buildGraph', dbPath, { depth: 3, noTests: true });
+}


Comment slightly overstates benchQuery parity

The comment says the new methodology matches "the per-target queries.*Ms block below", but benchQuery (used for all queries.*Ms values) has no explicit warmup loop of its own. The implicit warming works because the queryTimeMs warmup runs execute before benchQuery is ever called, but that coupling is fragile — if the order of operations ever changes, queries.*Ms loses its warm-up silently. Consider either adding a warmup inside benchQuery, or rephrasing the comment to clarify that the warmup runs here also prime the DB/NAPI cache for the subsequent queries.*Ms block.

Fixed in de2212c. Added an explicit warmup loop inside benchQuery itself (3 runs, gated by QUERY_WARMUP_RUNS), so each call site primes its own NAPI/rusqlite/page-cache state. The methodology no longer depends on call ordering, and queryTimeMs now uses benchQuery directly — the comment describing parity with the queries.*Ms block is now accurate.

…1133) Addresses Greptile review feedback: - Move QUERY_WARMUP_RUNS to the run-constants block at the top of the worker section, alongside INCREMENTAL_RUNS and QUERY_RUNS, so all run counts are tunable in one place. - Add the warmup loop inside benchQuery itself instead of relying on the queryTimeMs warmup to implicitly prime caches for later calls. Each call site now warms independently, so the methodology no longer depends on call ordering. - Use benchQuery for the top-level queryTimeMs measurement to eliminate duplicated warmup/timing logic and update the comment to describe the actual parity (benchQuery is also warmed) rather than implicit coupling.

carlos-alm · 2026-05-15T15:23:21Z

Addressed Greptile feedback in de2212c:

QUERY_WARMUP_RUNS location (scripts/benchmark.ts:93-95): moved to the run-constants block at the top of the worker section, alongside INCREMENTAL_RUNS and QUERY_RUNS, so all run counts are tunable in one place.
benchQuery warmup parity (scripts/benchmark.ts:136-145): added a warmup loop inside benchQuery itself, and reused benchQuery for the top-level queryTimeMs measurement. Each call site now warms independently — no implicit coupling on call ordering. Updated the comment to describe the actual parity rather than the fragile coupling.

carlos-alm · 2026-05-15T15:23:33Z

@greptileai

github-actions · 2026-05-15T15:30:37Z

Codegraph Impact Analysis

1 functions changed → 2 callers affected across 1 files

benchQuery in scripts/benchmark.ts:199 (2 transitive callers)

carlos-alm mentioned this pull request May 15, 2026

follow-up: investigate native Query time regression after R + Solidity native ports #1113

Open

greptile-apps Bot reviewed May 15, 2026

View reviewed changes

Merge branch 'main' into fix/1113-query-time-methodology

3449e0d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bench): warmup + median for queryTimeMs to remove cold-start noise#1133

fix(bench): warmup + median for queryTimeMs to remove cold-start noise#1133
carlos-alm wants to merge 3 commits into
mainfrom
fix/1113-query-time-methodology

carlos-alm commented May 15, 2026

Uh oh!

claude Bot commented May 15, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 15, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 15, 2026

Uh oh!

carlos-alm May 15, 2026

Uh oh!

carlos-alm commented May 15, 2026

Uh oh!

carlos-alm commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented May 15, 2026

Summary

Changes

Test plan

Uh oh!

claude Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Complete

Root Cause Analysis

Backlog Compliance

Critical Concerns

Technical Review

✅ Configuration Change Detection

✅ Test Integrity

✅ Code Quality

✅ Implementation Details

Final Recommendation

Uh oh!

greptile-apps Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm May 15, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm commented May 15, 2026

Uh oh!

carlos-alm commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codegraph Impact Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 15, 2026 •

edited

Loading

greptile-apps Bot commented May 15, 2026 •

edited

Loading

github-actions Bot commented May 15, 2026 •

edited

Loading