Skip to content

Incorporate statistical significance testing to benchmarks#614

Open
Micky774 wants to merge 2 commits into
devfrom
zain/bench/stats
Open

Incorporate statistical significance testing to benchmarks#614
Micky774 wants to merge 2 commits into
devfrom
zain/bench/stats

Conversation

@Micky774

@Micky774 Micky774 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Description

This PR adds support for @Arech8's benchstats repo for general significance testing (primarily using the Brunner Munzel test). This is important for distinguishing large-but-noisy and small-but-significant changes.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

New: compare_results.py --stats mode

  • Compares two per-sample CSVs (from --csv-samples) using a statistical test (Brunner-Munzel by default) via the benchstats package, instead of point-estimate ratios.
  • Marks each (config, label) as faster/slower/not-significant; exits 1 on a significant timing regression for CI gating.
  • New flags: --alpha (default 0.001), --method, --always-show-pvalues, --export-to (.txt/.svg/.html).
  • Reports time (main metric, drives exit code) alongside throughput/bandwidth (TFLOPS / GB/s) as a secondary metric.

New: parser_TEsamples.py

  • benchstats parser turning the samples CSV into the {benchmark: {metric: ndarray}} structure.
  • Exposes time_s (always) and a throughput metric keyed by unit; warns when <10 samples.
  • Drops throughput-less composite rows (e.g. Forward+Backward) so the metric set stays uniform.

utils.py — sampling & throughput

  • New --min-samples N flag (default 12) wired globally via run_benchmarks so all benchmark scripts inherit it without edits.
  • time_func tops up torch autorange shortfalls with additional equal-sized timing blocks (new _RawSamples measurement stand-in) instead of re-averaging.
  • Samples CSV gains throughput and unit columns; per-sample throughput derived as thr_mean * ms_mean / sample_ms.
  • Fix: time_ms now uses already-per-run sample times (removed the erroneous extra / number_per_run division).

Supporting

  • New requirements.txt pinning benchstats>=3.4.
  • README.md documents --stats, --min-samples, and the samples-CSV columns.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Micky774 and others added 2 commits June 5, 2026 16:31
Adds a statistically-rigorous comparison path to the microbenchmark suite,
on top of the existing throughput-ratio comparison.

- utils.py: time_func gains a repetitions param; --repetitions flag (default
  15) threaded through run_benchmarks so all bench scripts collect enough
  per-sample data with no per-script edits. repetitions<=1 preserves the
  original single-measurement behavior.
- parser_TEsamples.py: benchstats ParserBase reading the --csv-samples CSV
  into {bench_name: {time_ms: ndarray}}.
- compare_results.py: new --stats mode running a Brunner-Munzel test (via
  benchstats) on two samples CSVs; exits 1 on a significant regression for
  CI gating. Default ratio path unchanged; benchstats lazy-imported.
- requirements.txt: benchstats>=3.4.
- README: document the repetitions knob and --stats workflow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Refines the benchstats integration:

- compare_results.py --stats now reports calculated throughput/bandwidth
  (TFLOPS / GB/s) alongside timing. Time (seconds) is the main metric and
  drives the exit code; throughput is a secondary metric. The throughput
  unit suffix is blanked (the column header names the unit) and time is fed
  in seconds so benchstats' auto-scaling (ms/us/ns) is correct.
- utils.py: --csv-samples now also writes per-sample throughput + unit
  columns, derived from the sample time (blank for samples-only records).
- parser_TEsamples.py exposes time_s (main) and a unit-keyed throughput
  metric, dropping throughput-less composite rows so the metric set stays
  uniform for benchstats' renderer.
- utils.py: replace --repetitions with --min-samples (default 12). Instead
  of re-running autorange and averaging, time_func keeps the raw per-block
  samples and tops up any shortfall with additional equal-sized blocks.
  Also fix the samples writer to not re-divide already-per-run times.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Arech8

Arech8 commented Jun 9, 2026

Copy link
Copy Markdown

Wow! I'm super happy to see benchstats is used in the wild, especially in such a deep and well done integration! Thanks for implementing that and pinging me, Meekail!

I'm still a long way to write some blog post/presentation about reliable benchmarking, but can share some things to consider for the future, that are very helpful to combat the noise. Noise is going to be a primary thing affecting comparison results after the correct results comparison procedure, like you did, is in place:

  • unless the code to benchmark runs inside a tight loop in the production setting, it's very beneficial to randomly interleave timed executions of different benchmarks. I.e, if A, B and C are single benchmarked runs of different functions, then if in production setting it doesn't run as A-A-A-... and so on, i.e. in a tight loop, it's also not correct to measure it as A-A-A-..., B-B-B..., C-C-C... First of all, data and instruction caches affect results quite heavily and second - a concurrent noise occuring during one of A-A-A... and not occuring during B-B-B (which is what typically happens even on local machines without other tenants) will skew the results heavily in favor of B. A better approach is to run individual functions in randomly interleaved order, like C-A-B-B-A-B-C-B-.... this helps a lot to even out the noise. The problem is that none of the benchmark runners I know of are doing that, except of my own from benchstats.qbench module... The runner you're using IIUC is Triton's benchmark module, which essentially unfolds to the same quite biased A-A-A-..., B-B-B..., C-C-C...

    The benchmark runner in benchstats.qbench module does interleaved running by default, however, it might be a bit problematic to reuse it directly for your setting, since that runner currently doesn't have a mode to let a caller chose how to time execution. Instead it uses perf_counter_ns() wall clock time, which isn't always what you want when measuring kernels performance (and have to carefully enforce synchronization for inputs and outputs in an async setting). I have a long standing plan to fix that, but since I primarily work with JAX, there's a whole load of other issues leading me to postpone this plan until better times... PyTorch/Triton combination is more flexible and it's easy to extract a kernel runtime from it, I should expedite implementation of that plan...

  • If there should be no serial dependency in run times of code A (i.e. if there are no internal caches, non-deterministic behaviour and so on, and if A1 is the first run and A2 is the second run, then all differences in runtimes should only happen due to noise), then any run Ai is expected to be no better or no worse then any other run Aj, right? Then it's beneficial to randomly permute obtained run-times and compute PValue for it, then repeat the permutation and compute PValue again, and so on - to gather a set of P-Values. Then judge a statistical significance by "any/none" criteria, i.e. "if none of PValues show significance, assume there's really no significance, otherwise there's significance or results aren't reliable". This helps to strengthen outcomes and make them way less prone to noise again (provided that the assumption about absence of dependencies, which are greatly diminished by the prev. bullet point, holds). This "P-Value bootstrapping" is also implemented in the benchmark runner in benchstats.qbench by default... Damn... The ability to reuse it could shrink your integration code & effort in the order of magnitude and turn it into a very concise integration like here I use for benchmarking kernels in jax-triton.... I need to fix it...

And just a note about process: for a non-parametric test that make no assumptions about underlying data distribution, such as Brunner Munzel Test (and in practice this also generally works with Mann–Whitney U test, but never with T-Test or similar), it's safe to use a tiny number of iterations, like 2-3-4 (as long as times are aggregated within a repetition with mean/average, which is the only correct choice almost always). Maybe even 1 would work nicely, I just only tested them up to 2 😁 . Then even with 100-200 repetitions, a total execution budget would be still rather small, but sensitivity of the test will be very high.

@Micky774

Micky774 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Wow! I'm super happy to see benchstats is used in the wild, especially in such a deep and well done integration! Thanks for implementing that and pinging me, Meekail!

It's a great project, and I appreciate your advocacy for statistical testing of benchmark results -- I think it's important for a healthy ecosystem!

  • unless the code to benchmark runs inside a tight loop in the production setting, it's very beneficial to randomly interleave timed executions of different benchmarks. I.e, if A, B and C are single benchmarked runs of different functions, then if in production setting it doesn't run as A-A-A-... and so on, i.e. in a tight loop, it's also not correct to measure it as A-A-A-..., B-B-B..., C-C-C... First of all, data and instruction caches affect results quite heavily and second - a concurrent noise occuring during one of A-A-A... and not occuring during B-B-B (which is what typically happens even on local machines without other tenants) will skew the results heavily in favor of B. A better approach is to run individual functions in randomly interleaved order, like C-A-B-B-A-B-C-B-.... this helps a lot to even out the noise. The problem is that none of the benchmark runners I know of are doing that, except of my own from benchstats.qbench module... The runner you're using IIUC is Triton's benchmark module, which essentially unfolds to the same quite biased A-A-A-..., B-B-B..., C-C-C...

Right, specifically the cache invalidation was a feature that I planned on introducing afterwards, but the interleaving having the advantage of mitigating local machine-level noise is a great point that isn't easily solved without interleaving. Unfortunately interleaving is a non-trivial addition and breaks our current benchmarking flow, but IMO it is worth trying to figure out how to implement on our end.

The benchmark runner in benchstats.qbench module does interleaved running by default, however, it might be a bit problematic to reuse it directly for your setting, since that runner currently doesn't have a mode to let a caller chose how to time execution. Instead it uses perf_counter_ns() wall clock time, which isn't always what you want when measuring kernels performance (and have to carefully enforce synchronization for inputs and outputs in an async setting). I have a long standing plan to fix that, but since I primarily work with JAX, there's a whole load of other issues leading me to postpone this plan until better times... PyTorch/Triton combination is more flexible and it's easy to extract a kernel runtime from it, I should expedite implementation of that plan...

I'd love to see such an update, but as you mentioned right now we're soft-locked onto the pytorch timing solution.

  • If there should be no serial dependency in run times of code A (i.e. if there are no internal caches, non-deterministic behaviour and so on, and if A1 is the first run and A2 is the second run, then all differences in runtimes should only happen due to noise), then any run Ai is expected to be no better or no worse then any other run Aj, right? Then it's beneficial to randomly permute obtained run-times and compute PValue for it, then repeat the permutation and compute PValue again, and so on - to gather a set of P-Values. Then judge a statistical significance by "any/none" criteria, i.e. "if none of PValues show significance, assume there's really no significance, otherwise there's significance or results aren't reliable". This helps to strengthen outcomes and make them way less prone to noise again (provided that the assumption about absence of dependencies, which are greatly diminished by the prev. bullet point, holds). This "P-Value bootstrapping" is also implemented in the benchmark runner in benchstats.qbench by default... Damn... The ability to reuse it could shrink your integration code & effort in the order of magnitude and turn it into a very concise integration like here I use for benchmarking kernels in jax-triton.... I need to fix it...

Just wanted to mention that bootstrapping and permutation testing are such miracles and are part of what I adore of statistics 💯 .

And just a note about process: for a non-parametric test that make no assumptions about underlying data distribution, such as Brunner Munzel Test (and in practice this also generally works with Mann–Whitney U test, but never with T-Test or similar), it's safe to use a tiny number of iterations, like 2-3-4 (as long as times are aggregated within a repetition with mean/average, which is the only correct choice almost always). Maybe even 1 would work nicely, I just only tested them up to 2 😁 . Then even with 100-200 repetitions, a total execution budget would be still rather small, but sensitivity of the test will be very high.

That's a good thing to note. I was wondering what the stability would look like empirically, since eventually we'll want to optimize for runtime in a CI environment. Thanks for the additional info!

Overall, thank you very much for your project, and for taking the time to speak on our current benchmarking implementation. This PR is still much of a prototype PR, but I'm hopeful we can get it integrated and move to some more meaningful results. You've keenly spotted a few challenges we'll have even w/ or w/o this specific approach. Now all that's left is for me to figure out the least painful way to address them 😄

@Arech8

Arech8 commented Jun 9, 2026

Copy link
Copy Markdown

advocacy for statistical testing of benchmark results -- I think it's important for a healthy ecosystem!

Yes, absolutely. Using unreliable benchmarks is like building on top of ever-shifting sands: it's only a matter of time when such benchmarks misfire. Using statistical tools for comparison is an important step towards repeatable results and reliable conclusions.

Right, specifically the cache invalidation was a feature that I planned on introducing afterwards ...

I'm not familiar with Torch's benchmarks runner, so can't tell yet exactly, however a few things to note here: Triton's benchmarking method do_bench() is also based on Torch and it utilizes cache clearing. On both platforms the implementation is identical, - it just allocates and zeros a 256MiB chunk of data, so assuming it's a correct way to clear cache, it's super easy to port. I wish clearing caches on CPUs were so simple! 😁 (for the context - it's incredibly hard and costly to ensure CPU caches are clear, as CPUs are latency focused and their caching algorithms are much smarter)

the interleaving having the advantage of mitigating local machine-level noise is a great point that isn't easily solved without interleaving.

Precisely that!

Unfortunately interleaving is a non-trivial addition and breaks our current benchmarking flow, but IMO it is worth trying to figure out how to implement on our end.

One of the advantages of Torch over JAX is relative openness and maturity of internals. In Torch it's very simple to implement a benchmarks runner that measures precisely a kernel run time. For example, here's how Triton does it with CUDA/HIP events, exported by Torch. Doing that isn't fun in JAX, since JAX (currently) just doesn't expose anything needed for that 😁

If things go right, somewhat soon I might return back to Silo to a team working with Torch and then, if @jcaraban agrees, I could pay much more attention to Torch benchmarking issues, and will definitely implement support in benchstats for custom timers (still in my spare time, sorry, as benchstats is my personal passion project and I'm not going to give it away by working on it in during AMD work time). Then all of that would be achievable with just a few lines of configuration code...

I was wondering what the stability would look like empirically, since eventually we'll want to optimize for runtime in a CI environment.

So, I was using the Jax-Triton benchmarking script I linked above for maybe...at least several hundred times already on a shared machine with multiple tenants working in parallel. Once I implemented interleaving (and this happened pretty early) and later pvalue bootstrapping, I've always used them and, especially after interleaving, I'm getting very consistent benchmarking results: if the results were found significant once, they are basically always significant. Basically - because it still depends on the background noise level and the difference in magnitude of runtimes. But I don't think I recall any failures for differences from 1-2% and higher, even though individual runtimes may fluctuate wildly, and depending on the machine load, even though average runtimes (across all executions of a benchmark in a session) might fluctuate x2+ times between some session - results are always the same. So I'm very happy with it. I've never seen before so solid reproducibility of complex benchmarks even on quiesced machines. I basically even ceased to worry about parallel tenants producing noise, even though this is typically the first thing you have to do before even starting to work on the machine quiescing... But please don't get me wrong - there are very many variables to care about to get solid results - and as long as you aren't using precisely the same in all minor details approach, "your mileage might vary". Hence the only thing I'm confidently expecting is that your results will be definitely more reliable, especially with interleaving.

Now all that's left is for me to figure out the least painful way to address them 😄

You might want to have a look at links above to Triton benchmarks runner for inspiration. This should be not hard to (re-)implement independently in Torch, though I hope I'll deliver custom timers to benchstats.qbench sooner and you just won't need to write a custom runner (but I really have no idea about requirements to your runner and might be wildly off guessing here, you know better).

@Micky774

Micky774 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@matthiasdiener here's an example simulation displaying the effect of transient noise. Results:

Results

==============================================================================
EXPT 0  CONTROL (no transient)   (true_ratio B/A = 1.000, N=30/kernel, trials=4000)
==============================================================================
   amp | policy      |  FPR@.05  FPR@.001 | median ratio    ratio p5..p95
-------------------------------------------------------------------------
  ctrl | sequential  |    0.048     0.002 |       0.9996  0.9842..1.0154
  ctrl | interleaved |    0.056     0.002 |       1.0003  0.9846..1.0158
-------------------------------------------------------------------------

==============================================================================
EXPT 1  WARM-UP RAMP (monotonic drift)   (true_ratio B/A = 1.000, N=30/kernel, trials=4000)
==============================================================================
   amp | policy      |  FPR@.05  FPR@.001 | median ratio    ratio p5..p95
-------------------------------------------------------------------------
  ctrl | sequential  |    0.049     0.000 |       0.9999  0.9844..1.0156
  ctrl | interleaved |    0.051     0.002 |       1.0002  0.9846..1.0157
-------------------------------------------------------------------------
  0.05 | sequential  |    0.864     0.414 |       1.0250  1.0093..1.0414
  0.05 | interleaved |    0.031     0.001 |       1.0009  0.9851..1.0172
-------------------------------------------------------------------------
  0.10 | sequential  |    1.000     0.993 |       1.0496  1.0332..1.0674
  0.10 | interleaved |    0.009     0.000 |       1.0015  0.9840..1.0206
-------------------------------------------------------------------------
  0.20 | sequential  |    1.000     1.000 |       1.0970  1.0768..1.1171
  0.20 | interleaved |    0.000     0.000 |       1.0027  0.9805..1.0253
-------------------------------------------------------------------------

==============================================================================
EXPT 2  STEP THROTTLE (random onset, 50% duration)   (true_ratio B/A = 1.000, N=30/kernel, trials=4000)
==============================================================================
   amp | policy      |  FPR@.05  FPR@.001 | median ratio    ratio p5..p95
-------------------------------------------------------------------------
  ctrl | sequential  |    0.052     0.002 |       1.0002  0.9843..1.0159
  ctrl | interleaved |    0.057     0.003 |       0.9999  0.9842..1.0156
-------------------------------------------------------------------------
  0.05 | sequential  |    0.625     0.405 |       1.0007  0.9518..1.0508
  0.05 | interleaved |    0.012     0.000 |       1.0001  0.9825..1.0176
-------------------------------------------------------------------------
  0.10 | sequential  |    0.698     0.530 |       1.0000  0.9080..1.1005
  0.10 | interleaved |    0.001     0.000 |       1.0003  0.9776..1.0235
-------------------------------------------------------------------------
  0.20 | sequential  |    0.646     0.467 |       0.9925  0.8323..1.2006
  0.20 | interleaved |    0.000     0.000 |       1.0004  0.9728..1.0269
-------------------------------------------------------------------------
  0.40 | sequential  |    0.652     0.482 |       1.0055  0.7138..1.4004
  0.40 | interleaved |    0.000     0.000 |       0.9996  0.9731..1.0269
-------------------------------------------------------------------------

==============================================================================
EXPT 3  REAL 5% SPEEDUP (B truly 5% FASTER) under warm-up ramp
        rate = fraction calling B SLOWER *with significance* (false regression)   (true_ratio B/A = 0.950, N=30/kernel, trials=4000)
==============================================================================
   amp | policy      |  FPR@.05  FPR@.001 | median ratio    ratio p5..p95
-------------------------------------------------------------------------
  ctrl | sequential  |    0.000     0.000 |       0.9500  0.9352..0.9650
  ctrl | interleaved |    0.000     0.000 |       0.9501  0.9350..0.9652
-------------------------------------------------------------------------
  0.05 | sequential  |    0.000     0.000 |       0.9737  0.9585..0.9893
  0.05 | interleaved |    0.000     0.000 |       0.9508  0.9353..0.9661
-------------------------------------------------------------------------
  0.10 | sequential  |    0.005     0.000 |       0.9974  0.9815..1.0133
  0.10 | interleaved |    0.000     0.000 |       0.9514  0.9340..0.9689
-------------------------------------------------------------------------
  0.20 | sequential  |    0.993     0.767 |       1.0421  1.0236..1.0608
  0.20 | interleaved |    0.000     0.000 |       0.9531  0.9319..0.9745
-------------------------------------------------------------------------

==============================================================================
DISTRIBUTION OF MEASURED RATIO  (median_B/median_A), warm-up ramp amp=0.20, null (true=1.0)
==============================================================================
  Truth = 1.000.  Sequential is shifted (BIAS); interleaved is centered.
  sequential  (biased: B always runs hotter)
    +0.860 | 
    +0.879 | 
    +0.898 | 
    +0.917 | 
    +0.936 | 
    +0.955 | 
    +0.974 | 
    +0.993 | 
    +1.012 | 
    +1.031 | 
    +1.050 | 
    +1.069 | #######
    +1.088 | ##############################################
    +1.107 | ###########################################
    +1.126 | ######
    +1.145 | 
    +1.164 | 
    +1.183 | 
    +1.202 | 
    +1.221 | 
    +1.240 | 
  interleaved (centered on truth)
    +0.860 | 
    +0.879 | 
    +0.898 | 
    +0.917 | 
    +0.936 | 
    +0.955 | 
    +0.974 | ########
    +0.993 | #############################################
    +1.012 | ##############################################
    +1.031 | #########
    +1.050 | 
    +1.069 | 
    +1.088 | 
    +1.107 | 
    +1.126 | 
    +1.145 | 
    +1.164 | 
    +1.183 | 
    +1.202 | 
    +1.221 | 
    +1.240 | 

@github-actions

Copy link
Copy Markdown

Claude Walkthrough

Intent. Make microbenchmark comparisons distinguish real timing regressions from measurement noise. Adds a --stats mode to compare_results.py that runs a Brunner-Munzel test on per-sample CSVs via the benchstats package, so CI can gate on statistical significance instead of point-estimate ratios.

Key changes.

  • New --stats mode and CLI flags (--alpha, --method, --always-show-pvalues, --export-to) in benchmarks/microbenchmarks/compare_results.py:62 (run_stats) that exits 1 on a significant timing regression.
  • New benchstats parser benchmarks/microbenchmarks/parser_TEsamples.py:46 turning a samples CSV into the {benchmark: {metric: ndarray}} shape, exposing time_s plus a unit-keyed throughput metric.
  • New --min-samples N flag (default 12) and a top-up loop in benchmarks/microbenchmarks/utils.py:107 (time_func) that pads torch autorange's few blocks to enough samples for a meaningful statistical test.
  • Samples CSV gains throughput and unit columns, plus a fix to per-sample time_ms in benchmarks/microbenchmarks/utils.py:417.
  • New benchmarks/microbenchmarks/requirements.txt pinning benchstats>=3.4; README.md documents the new flags and CSV schema.

Walkthrough.

  • compare_results.py — keeps the existing ratio mode untouched and adds run_stats(args) as a separate top-level path. It imports benchstats lazily (so the dependency is only needed when --stats is used), uses parser_TEsamples to read both CSVs, pins time_s as the main metric (the one that drives the exit code), and blanks the default s suffix on secondary throughput metrics via style_overrides because their unit is already in the column header. An import rich.table # noqa: F401 is intentional — a comment notes benchstats 3.4.0 renders use rich.table.Table without importing it themselves.
  • parser_TEsamples.py (new) — subclasses benchstats.common.ParserBase. Benchmark names are built by joining all non-reserved parameter columns plus label with |. Two metrics are exposed: time_s (always; converted from ms because benchstats' renderer auto-scales assuming a seconds base unit) and the throughput metric, keyed by its unit string (TFLOPS, GB/s, …) so the column header reads naturally. The parser warns when a benchmark has <10 samples. Because benchstats requires every benchmark to expose the same metric set, throughput-less composite rows (e.g. Forward+Backward) are dropped from the comparison whenever any other benchmark in the file carries throughput — the raw samples stay in the CSV for other tooling. The class name matches the file name so it can also be loaded via the benchstats CLI's --files_parser option.
  • utils.pytime_func gains a min_samples arg; when torch autorange returns fewer raw blocks than requested, it calls timer.timeit(number) repeatedly to top up rather than re-running the whole adaptive measurement (which would re-average rather than give independent samples). Results are wrapped in a tiny _RawSamples stand-in that mimics the Measurement attributes (times, number_per_run, mean) downstream callers use. A module-global _ACTIVE_MIN_SAMPLES is set by run_benchmarks from the CLI flag, so every existing bench_fn inherits the knob without per-script edits. The samples-CSV writer now emits a throughput column derived as thr_mean * ms_mean / sample_ms (valid because throughput is a deterministic function of time at fixed config) and a unit column; samples_only rows leave both blank. Drive-by fix: sr["time_ms"] no longer divides by number_per_runmeasurement.times entries are already per-run, so the old column was off by a factor of number_per_run.
  • README.md — documents the new samples-CSV columns, --min-samples, the recommended-≥10-samples threshold, the --stats workflow end-to-end (install deps, run on both checkouts, compare), and how the main vs secondary metric distinction plays into exit codes.

Testing. No automated tests added. Validation is by running the new --stats mode by hand against samples CSVs from two benchmark checkouts; the README documents the expected workflow.

Notes for reviewers.

  • time_ms semantics in the samples CSV change: prior CSVs had the column divided by number_per_run, so they are not directly comparable to new ones. Any external consumers of the old column should be re-checked.
  • The default --alpha=0.001 is aggressive (low false-positive rate), which suits a CI gate but may mask small regressions; reviewers may want to confirm that fits the intended use.
  • --min-samples defaults to 12 across all benchmark scripts (set via the module-level global), which will lengthen runtimes for benchmarks that previously stopped at the autorange minimum. The flag can be lowered (e.g. 2) to effectively disable top-up.
  • The throughput-derivation thr = thr_mean * ms_mean / sample_ms only holds if throughput is a strict function of time at fixed config — true for the existing TFLOPS/GB/s metrics but worth keeping in mind if new metric types are added.
  • benchstats pulls in rich, scipy, and numpy as transitive deps; the import is gated behind --stats, but anyone running pip install -r benchmarks/microbenchmarks/requirements.txt will get them all.

Generated by Claude. To request a code review, comment /claude review.

def run_stats(args):
"""Compare two samples CSVs with a statistical test via benchstats.

Timing (``time_ms``) is the main metric and drives the exit code; throughput

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc/code mismatch: this docstring says the main metric is time_ms, but the code below sets main_metrics = ["time_s"] (line 80) and the parser exposes the time samples under the key time_s. Update the docstring (or the constant) so the two agree — otherwise readers grepping for time_ms as the gating metric will get the wrong picture.

import os

import rich.table # noqa: F401 benchstats 3.4.0 render uses rich.table.Table without importing it
from parser_TEsamples import parser_TEsamples

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from parser_TEsamples import parser_TEsamples is an unqualified import that only resolves when the current working directory contains parser_TEsamples.py. Running python /abs/path/to/compare_results.py --stats ... from anywhere else (e.g. a CI working dir, or from the repo root) will fail with ModuleNotFoundError. Consider inserting the script's own directory into sys.path at the top of run_stats (or module-load time), e.g.

import os, sys
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))

so the script is invocable from any cwd — which matters for CI usage where the gate may not chdir into benchmarks/microbenchmarks/ first.

elif export_fmt == "html":
console.save_html(args.export_to)

if cr.at_least_one_differs:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at_least_one_differs does not distinguish regressions from improvements. From the benchstats source (compare.py), the flag is set whenever a main-metric comparison comes back as < or >:

if main_metrics is None or metric_name in main_metrics:
    at_least_one_differs = at_least_one_differs or less_positive or greater_positive

So a candidate that is significantly faster will also trigger return 1, failing CI on a real improvement. The PR description and the README both frame --stats as a CI regression gate ("exits 1 on a significant timing regression"), but this implementation exits 1 on either direction.

If the intent is true regression gating, walk cr.results and check the per-metric direction marker (> means candidate slower for a "lower is better" metric like time_s), and only set the failure bit on that. The inline warning text at line 124 is technically accurate ("at least one significant difference") but contradicts the surrounding documentation — pick one behavior and align both.

@github-actions

Copy link
Copy Markdown

Claude review summary

Reviewed the PR head (e592a1c) against HEAD^1 (6ac5195). Scope: 5 files under benchmarks/microbenchmarks/ adding a --stats mode to compare_results.py, a new parser_TEsamples.py benchstats parser, and a --min-samples knob in the shared runner.

Verdict: Implementation is clean and well-documented; the new parser and the autorange top-up are sensibly designed. No correctness issues in the timing math itself (the time_ms "remove the extra / number_per_run division" fix is correct — Measurement.times is already per-call). Three minor issues raised inline on compare_results.py:

  • L65 docstring says the main metric is time_ms, but the code uses time_s.
  • L75 unqualified from parser_TEsamples import … requires the script's directory to be cwd; fragile for CI invocation.
  • L123 cr.at_least_one_differs fires for any significant difference (including a significant speedup), which contradicts the "regression gate" framing in the PR description / README. Either narrow the gate to the regression direction, or adjust the docs.

Copyright headers: OK — all three modified/added Python files carry the required AMD 2026 header; README.md and requirements.txt are skipped per convention.

Comment on lines +319 to +320
"with additional equal-sized blocks. Use a small value (e.g. 2) to "
"effectively disable top-up. Default: 12."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, couldn't even a low top up number be effectively a top-up?

Comment on lines +319 to +320
"with additional equal-sized blocks. Use a small value (e.g. 2) to "
"effectively disable top-up. Default: 12."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why '12' as the default value?

"""
import os

import rich.table # noqa: F401 benchstats 3.4.0 render uses rich.table.Table without importing it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug in benchstats? If so, can you report it?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchstats imports rich and uses a fully qualified rich.table.Table() to instantiate the only table object instance it uses. Besides that, its code is tested and was used over "a million times" already in a spectra environments. So I'd also like to understand why the rationale is tied to benchstats.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, looks like I managed to mess this up. Thanks for noticing! Please use version 3.4.1 that has this imports fixed.

Things shouldn't be that way. In the future please don't hesitate to ping me if you find something strange.

"At least one significant timing difference was detected (exit 1)."
)
return 1
return 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the exit code is the best way to relay the information - non-0 exit codes usually indicate something has failed, while for benchmarks, a significant timing difference is often expected.

Also, the granularity is a bit coarse - the exit code doesn't indicate in which 'direction' the difference goes.

Comment on lines +131 to +132
if min_samples is None:
min_samples = _ACTIVE_MIN_SAMPLES

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest keeping the min_samples mechanism off by default, so we can avoid revalidating the benchmarks in the default case.

speedups, and emits a markdown <details> block (optionally appending a
summary row to --summary-file).

2. --stats mode: compares two *samples* CSVs (written with --csv-samples) using

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--stats sounds a bit too generic, how about --statistical-test or so?

Comment on lines +144 to +150
# averaged over number_per_run runs) until enough raw samples are collected.
times = list(m.times) # per-run seconds
number = m.number_per_run
while len(times) < min_samples:
times.append(timer.timeit(number).mean)
mean_s = sum(times) / len(times)
return mean_s * 1e3, _RawSamples(times=times, number_per_run=number, mean=mean_s)

@matthiasdiener matthiasdiener Jun 16, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could monkeypatch the measurement loop instead, something like:

original_measurement_loop = timer._threaded_measurement_loop

def _threaded_measurement_loop(...):
    def stop_hook_with_min_executions(times):
        return len(times) * number >= min_number_executions and stop_hook(times)

    return original_measurement_loop(..., stop_hook_with_min_executions, ...)

timer._threaded_measurement_loop = _threaded_measurement_loop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants