Skip to content

fix(bench): warmup + median for queryTimeMs to remove cold-start noise#1133

Open
carlos-alm wants to merge 3 commits into
mainfrom
fix/1113-query-time-methodology
Open

fix(bench): warmup + median for queryTimeMs to remove cold-start noise#1133
carlos-alm wants to merge 3 commits into
mainfrom
fix/1113-query-time-methodology

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

The build-benchmark queryTimeMs metric was a single-shot cold call to fnDepsData('buildGraph', dbPath) with no warmup, no median, and noTests:false. This conflated steady-state query latency with NAPI / rusqlite statement-cache / OS-page-cache init (~65ms on macOS) and let fixture-file growth from new native extractors inflate the measurement.

Local verification on the same 757-file corpus (HEAD source tree):

v3.9.6 HEAD Δ
queryTimeMs (old single-shot, noTests:false) 78.8 ms 67.5 ms −14%
queries.fnDepsMs (warmed median, noTests:true, depth 3) 4.0 ms 2.8 ms −30%
files / nodes / edges 757 / 18,796 / 39,549 757 / 18,828 / 39,682 flat

HEAD is faster than v3.9.6 on both metrics — the +110% spike (49.6→104.4) that tripped the publish gate during the R + Solidity merge (#1100, #1102) was single-shot cold-start variance on a ~70ms cold call, not a real regression. The exemption 3.10.0:Query time added to KNOWN_REGRESSIONS was the right defensive move at merge time, but the metric itself was the bug.

Changes

  • scripts/benchmark.ts: queryTimeMs now uses 3 warmup runs + median of 5 with noTests: true, matching the methodology already used by query-benchmark.ts and the per-target queries.*Ms block in the same script.
  • tests/benchmarks/regression-guard.test.ts: update 3.10.0:Query time comment to describe the methodology fix. Entry kept in place — per the issue's action Bump actions/github-script from 7 to 8 #3, remove only after 3.11.0+ data captures the new steady-state under the updated methodology (expected ~36ms native on this corpus, no longer cold-start-dominated).

Backwards-compat note: published 3.10.0 has queryTimeMs ≈ 50ms (old methodology); after this fix dev/3.11.0+ will land at ~36ms. The per-PR regression gate will see this as an improvement, not a regression — no exemption changes needed.

Closes #1113

Test plan

  • node scripts/benchmark.ts — produces sensible queryTimeMs (36.3 ms native, 36.3 ms wasm on local HEAD); the field is still present and in ms units.
  • Verified v3.9.6 baseline numbers locally via node scripts/benchmark.ts --npm --version 3.9.6 to confirm no real regression.
  • CI: the pre-publish benchmark gate compares dev vs 3.10.0 (≈50ms old methodology) — should pass cleanly as an improvement.
  • Lint check on changed files (no new warnings introduced).

The build-benchmark queryTimeMs metric was a single-shot cold call to
fnDepsData('buildGraph', dbPath) with no warmup, no median, and
noTests:false. This conflated steady-state query latency with
NAPI/rusqlite/OS-page-cache init (~65ms on macOS) and let fixture-file
growth from new native extractors inflate the measurement.

Local verification on the same 757-file corpus:
  v3.9.6:  single-shot 78.8ms  warmed median 4.0ms
  HEAD:    single-shot 67.5ms  warmed median 2.8ms

HEAD is faster than v3.9.6 on both metrics — the +110% spike that
tripped the publish gate during the R + Solidity merge (#1100, #1102)
was single-shot cold-start variance, not a real regression.

Switch queryTimeMs to 3 warmup runs + median of 5 with noTests:true,
matching the methodology already used by query-benchmark.ts and the
per-target queries.*Ms block in the same script. Update the
KNOWN_REGRESSIONS comment for 3.10.0:Query time to record the
methodology fix; keep the entry in place until 3.11.0+ data captures
the new steady-state.

Closes #1113
@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude finished @carlos-alm's task —— View job


PR Review Complete

Reviewing PR #1133: fix(bench): warmup + median for queryTimeMs to remove cold-start noise

Root Cause Analysis

Problem Being Solved: The queryTimeMs benchmark metric was showing false regression alerts due to cold-start noise conflating NAPI/rusqlite/OS-page-cache initialization (~65ms) with actual query performance.

Why This Approach: This fixes the actual measurement methodology rather than masking symptoms. The author demonstrated that HEAD is actually faster than v3.9.6 when measured properly (warmed median: 2.8ms vs 4.0ms), proving the +110% spike was measurement noise, not real regression.

Risk Assessment: ✅ LOW RISK - This is a legitimate methodology fix that aligns the problematic queryTimeMs metric with existing proven methodologies already used in query-benchmark.ts and the same script's per-target queries.*Ms measurements.

Backlog Compliance

  • Zero-dep: ✓ — No new dependencies added
  • Foundation-aligned: ✓ — Improves measurement accuracy, aligns with zero-infrastructure principle
  • Problem-fit: 4/5 — Improves benchmark reliability for AI agents by preventing false regression alerts that could block legitimate improvements
  • Breaking: No — Pure internal methodology improvement, same output format
  • Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Critical Concerns

None identified. This is a textbook example of fixing measurement methodology rather than masking problems.

Technical Review

✅ Configuration Change Detection

CLEAN - No configuration rules relaxed or disabled. Changes improve measurement accuracy without hiding issues.

✅ Test Integrity

SOLID - No test assertions weakened. The only change is updating the KNOWN_REGRESSIONS comment to document the methodology fix, keeping the exemption in place until 3.11.0+ data validates the new baseline.

✅ Code Quality

EXCELLENT - Well-implemented with:

  • Proper warmup methodology (3 warmup runs) at scripts/benchmark.ts:142-145
  • Statistical robustness (median of 5 samples) at scripts/benchmark.ts:146-152
  • Consistent with existing patterns (matches QUERY_RUNS constant, uses existing median() function)
  • Thorough documentation explaining the rationale at scripts/benchmark.ts:136-141
  • Appropriate parameter alignment (noTests: true, depth: 3) matching per-target benchmarks

✅ Implementation Details

  • Uses existing QUERY_RUNS = 5 constant and median() helper for consistency
  • Aligns parameters with proven methodology: { depth: 3, noTests: true }
  • Maintains backward compatibility - same metric name, same units (ms)
  • Documents expected new baseline (~36ms) for future validation

Final Recommendation

  • Rating: ⭐⭐⭐⭐⭐ (5/5)
  • Action: APPROVE
  • Reasoning: This is an exemplary fix that addresses the root cause of false regression alerts through proper statistical methodology. The implementation is thorough, well-documented, and aligns with existing patterns. The author provided compelling evidence that this reveals the true performance (HEAD is faster), rather than hiding regressions.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 15, 2026

Greptile Summary

This PR fixes the queryTimeMs benchmark metric in scripts/benchmark.ts by replacing a single cold call to fnDepsData with a warmed median (3 warmup runs + median of 5 samples) using noTests: true, eliminating cold-start noise (~65 ms on macOS) that was causing false regression alerts.

  • scripts/benchmark.ts: QUERY_WARMUP_RUNS = 3 constant added alongside existing run-count constants; a warmup loop is added inside benchQuery itself so every call site primes its own NAPI/rusqlite/page-cache state independently; queryTimeMs now delegates to benchQuery with { depth: 3, noTests: true }.
  • tests/benchmarks/regression-guard.test.ts: The 3.10.0:Query time exemption comment is updated to accurately describe the methodology artifact rather than the previously-believed graph-growth cause; the exemption entry is intentionally retained until 3.11.0+ data under the new methodology is captured.

Confidence Score: 5/5

Safe to merge — changes are confined to the benchmark script and a test comment; no production code paths are affected.

The warmup + median methodology is already proven in benchQuery for the queries.*Ms block; the new code simply routes queryTimeMs through the same path. QUERY_WARMUP_RUNS is now correctly grouped with the other run-count constants, and the warmup loop inside benchQuery makes each call site independent of call ordering. The regression-guard comment update is accurate and the exemption is appropriately kept in place until new baseline data is gathered.

No files require special attention.

Important Files Changed

Filename Overview
scripts/benchmark.ts Adds QUERY_WARMUP_RUNS constant, warmup loop inside benchQuery, and delegates queryTimeMs to benchQuery — all changes are logically consistent and address the cold-start measurement problem.
tests/benchmarks/regression-guard.test.ts Comment-only update clarifying that the 3.10.0:Query time exemption was a methodology artifact; no logic changes, exemption entry correctly retained pending new baseline data.

Sequence Diagram

sequenceDiagram
    participant B as benchmark.ts
    participant bQ as benchQuery()
    participant fn as fnDepsData()

    Note over B: Full build completes

    B->>bQ: "benchQuery(fnDepsData, 'buildGraph', dbPath, {depth:3, noTests:true})"
    
    rect rgb(240, 240, 255)
        Note over bQ,fn: Warmup phase (QUERY_WARMUP_RUNS = 3)
        loop 3 warmup calls
            bQ->>fn: "fn('buildGraph', dbPath, {depth:3, noTests:true})"
            fn-->>bQ: result (discarded)
        end
    end

    rect rgb(240, 255, 240)
        Note over bQ,fn: Timed phase (QUERY_RUNS = 5)
        loop 5 timed calls
            bQ->>fn: "fn('buildGraph', dbPath, {depth:3, noTests:true})"
            fn-->>bQ: result + timing recorded
        end
    end

    bQ-->>B: round1(median(timings)) → queryTimeMs
Loading

Reviews (4): Last reviewed commit: "Merge branch 'main' into fix/1113-query-..." | Re-trigger Greptile

Comment thread scripts/benchmark.ts Outdated
Comment on lines +136 to +145
// Warmed median of QUERY_RUNS samples with `noTests: true` to match the
// methodology used by query-benchmark.ts and the per-target `queries.*Ms`
// block below. Earlier versions of this script measured a single cold call,
// which conflated steady-state query latency with NAPI/rusqlite/OS-page-cache
// init costs (~65ms on macOS) and inflated growth from test-fixture files
// pulled in by new native extractors. See #1113 for the methodology rationale.
const QUERY_WARMUP_RUNS = 3;
for (let i = 0; i < QUERY_WARMUP_RUNS; i++) {
fnDepsData('buildGraph', dbPath, { depth: 3, noTests: true });
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Comment slightly overstates benchQuery parity

The comment says the new methodology matches "the per-target queries.*Ms block below", but benchQuery (used for all queries.*Ms values) has no explicit warmup loop of its own. The implicit warming works because the queryTimeMs warmup runs execute before benchQuery is ever called, but that coupling is fragile — if the order of operations ever changes, queries.*Ms loses its warm-up silently. Consider either adding a warmup inside benchQuery, or rephrasing the comment to clarify that the warmup runs here also prime the DB/NAPI cache for the subsequent queries.*Ms block.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in de2212c. Added an explicit warmup loop inside benchQuery itself (3 runs, gated by QUERY_WARMUP_RUNS), so each call site primes its own NAPI/rusqlite/page-cache state. The methodology no longer depends on call ordering, and queryTimeMs now uses benchQuery directly — the comment describing parity with the queries.*Ms block is now accurate.

…1133)

Addresses Greptile review feedback:

- Move QUERY_WARMUP_RUNS to the run-constants block at the top of the worker
  section, alongside INCREMENTAL_RUNS and QUERY_RUNS, so all run counts are
  tunable in one place.
- Add the warmup loop inside benchQuery itself instead of relying on the
  queryTimeMs warmup to implicitly prime caches for later calls. Each call
  site now warms independently, so the methodology no longer depends on
  call ordering.
- Use benchQuery for the top-level queryTimeMs measurement to eliminate
  duplicated warmup/timing logic and update the comment to describe the
  actual parity (benchQuery is also warmed) rather than implicit coupling.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed Greptile feedback in de2212c:

  • QUERY_WARMUP_RUNS location (scripts/benchmark.ts:93-95): moved to the run-constants block at the top of the worker section, alongside INCREMENTAL_RUNS and QUERY_RUNS, so all run counts are tunable in one place.
  • benchQuery warmup parity (scripts/benchmark.ts:136-145): added a warmup loop inside benchQuery itself, and reused benchQuery for the top-level queryTimeMs measurement. Each call site now warms independently — no implicit coupling on call ordering. Updated the comment to describe the actual parity rather than the fragile coupling.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

Codegraph Impact Analysis

1 functions changed2 callers affected across 1 files

  • benchQuery in scripts/benchmark.ts:199 (2 transitive callers)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

follow-up: investigate native Query time regression after R + Solidity native ports

1 participant