Incorporate statistical significance testing to benchmarks by Micky774 · Pull Request #614 · ROCm/TransformerEngine

Micky774 · 2026-06-08T14:58:07Z

Description

This PR adds support for @Arech8's benchstats repo for general significance testing (primarily using the Brunner Munzel test). This is important for distinguishing large-but-noisy and small-but-significant changes.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

New: compare_results.py --stats mode

Compares two per-sample CSVs (from --csv-samples) using a statistical test (Brunner-Munzel by default) via the benchstats package, instead of point-estimate ratios.
Marks each (config, label) as faster/slower/not-significant; exits 1 on a significant timing regression for CI gating.
New flags: --alpha (default 0.001), --method, --always-show-pvalues, --export-to (.txt/.svg/.html).
Reports time (main metric, drives exit code) alongside throughput/bandwidth (TFLOPS / GB/s) as a secondary metric.

New: parser_TEsamples.py

benchstats parser turning the samples CSV into the {benchmark: {metric: ndarray}} structure.
Exposes time_s (always) and a throughput metric keyed by unit; warns when <10 samples.
Drops throughput-less composite rows (e.g. Forward+Backward) so the metric set stays uniform.

utils.py — sampling & throughput

New --min-samples N flag (default 12) wired globally via run_benchmarks so all benchmark scripts inherit it without edits.
time_func tops up torch autorange shortfalls with additional equal-sized timing blocks (new _RawSamples measurement stand-in) instead of re-averaging.
Samples CSV gains throughput and unit columns; per-sample throughput derived as thr_mean * ms_mean / sample_ms.
Fix: time_ms now uses already-per-run sample times (removed the erroneous extra / number_per_run division).

Supporting

New requirements.txt pinning benchstats>=3.4.
README.md documents --stats, --min-samples, and the samples-CSV columns.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Adds a statistically-rigorous comparison path to the microbenchmark suite, on top of the existing throughput-ratio comparison. - utils.py: time_func gains a repetitions param; --repetitions flag (default 15) threaded through run_benchmarks so all bench scripts collect enough per-sample data with no per-script edits. repetitions<=1 preserves the original single-measurement behavior. - parser_TEsamples.py: benchstats ParserBase reading the --csv-samples CSV into {bench_name: {time_ms: ndarray}}. - compare_results.py: new --stats mode running a Brunner-Munzel test (via benchstats) on two samples CSVs; exits 1 on a significant regression for CI gating. Default ratio path unchanged; benchstats lazy-imported. - requirements.txt: benchstats>=3.4. - README: document the repetitions knob and --stats workflow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Refines the benchstats integration: - compare_results.py --stats now reports calculated throughput/bandwidth (TFLOPS / GB/s) alongside timing. Time (seconds) is the main metric and drives the exit code; throughput is a secondary metric. The throughput unit suffix is blanked (the column header names the unit) and time is fed in seconds so benchstats' auto-scaling (ms/us/ns) is correct. - utils.py: --csv-samples now also writes per-sample throughput + unit columns, derived from the sample time (blank for samples-only records). - parser_TEsamples.py exposes time_s (main) and a unit-keyed throughput metric, dropping throughput-less composite rows so the metric set stays uniform for benchstats' renderer. - utils.py: replace --repetitions with --min-samples (default 12). Instead of re-running autorange and averaging, time_func keeps the raw per-block samples and tops up any shortfall with additional equal-sized blocks. Also fix the samples writer to not re-divide already-per-run times. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Arech8 · 2026-06-09T09:10:12Z

Wow! I'm super happy to see benchstats is used in the wild, especially in such a deep and well done integration! Thanks for implementing that and pinging me, Meekail!

I'm still a long way to write some blog post/presentation about reliable benchmarking, but can share some things to consider for the future, that are very helpful to combat the noise. Noise is going to be a primary thing affecting comparison results after the correct results comparison procedure, like you did, is in place:

unless the code to benchmark runs inside a tight loop in the production setting, it's very beneficial to randomly interleave timed executions of different benchmarks. I.e, if A, B and C are single benchmarked runs of different functions, then if in production setting it doesn't run as A-A-A-... and so on, i.e. in a tight loop, it's also not correct to measure it as A-A-A-..., B-B-B..., C-C-C... First of all, data and instruction caches affect results quite heavily and second - a concurrent noise occuring during one of A-A-A... and not occuring during B-B-B (which is what typically happens even on local machines without other tenants) will skew the results heavily in favor of B. A better approach is to run individual functions in randomly interleaved order, like C-A-B-B-A-B-C-B-.... this helps a lot to even out the noise. The problem is that none of the benchmark runners I know of are doing that, except of my own from benchstats.qbench module... The runner you're using IIUC is Triton's benchmark module, which essentially unfolds to the same quite biased A-A-A-..., B-B-B..., C-C-C...

The benchmark runner in benchstats.qbench module does interleaved running by default, however, it might be a bit problematic to reuse it directly for your setting, since that runner currently doesn't have a mode to let a caller chose how to time execution. Instead it uses perf_counter_ns() wall clock time, which isn't always what you want when measuring kernels performance (and have to carefully enforce synchronization for inputs and outputs in an async setting). I have a long standing plan to fix that, but since I primarily work with JAX, there's a whole load of other issues leading me to postpone this plan until better times... PyTorch/Triton combination is more flexible and it's easy to extract a kernel runtime from it, I should expedite implementation of that plan...
If there should be no serial dependency in run times of code A (i.e. if there are no internal caches, non-deterministic behaviour and so on, and if A1 is the first run and A2 is the second run, then all differences in runtimes should only happen due to noise), then any run Ai is expected to be no better or no worse then any other run Aj, right? Then it's beneficial to randomly permute obtained run-times and compute PValue for it, then repeat the permutation and compute PValue again, and so on - to gather a set of P-Values. Then judge a statistical significance by "any/none" criteria, i.e. "if none of PValues show significance, assume there's really no significance, otherwise there's significance or results aren't reliable". This helps to strengthen outcomes and make them way less prone to noise again (provided that the assumption about absence of dependencies, which are greatly diminished by the prev. bullet point, holds). This "P-Value bootstrapping" is also implemented in the benchmark runner in benchstats.qbench by default... Damn... The ability to reuse it could shrink your integration code & effort in the order of magnitude and turn it into a very concise integration like here I use for benchmarking kernels in jax-triton.... I need to fix it...

And just a note about process: for a non-parametric test that make no assumptions about underlying data distribution, such as Brunner Munzel Test (and in practice this also generally works with Mann–Whitney U test, but never with T-Test or similar), it's safe to use a tiny number of iterations, like 2-3-4 (as long as times are aggregated within a repetition with mean/average, which is the only correct choice almost always). Maybe even 1 would work nicely, I just only tested them up to 2 😁 . Then even with 100-200 repetitions, a total execution budget would be still rather small, but sensitivity of the test will be very high.

Micky774 · 2026-06-09T11:29:12Z

Wow! I'm super happy to see benchstats is used in the wild, especially in such a deep and well done integration! Thanks for implementing that and pinging me, Meekail!

It's a great project, and I appreciate your advocacy for statistical testing of benchmark results -- I think it's important for a healthy ecosystem!

unless the code to benchmark runs inside a tight loop in the production setting, it's very beneficial to randomly interleave timed executions of different benchmarks. I.e, if A, B and C are single benchmarked runs of different functions, then if in production setting it doesn't run as A-A-A-... and so on, i.e. in a tight loop, it's also not correct to measure it as A-A-A-..., B-B-B..., C-C-C... First of all, data and instruction caches affect results quite heavily and second - a concurrent noise occuring during one of A-A-A... and not occuring during B-B-B (which is what typically happens even on local machines without other tenants) will skew the results heavily in favor of B. A better approach is to run individual functions in randomly interleaved order, like C-A-B-B-A-B-C-B-.... this helps a lot to even out the noise. The problem is that none of the benchmark runners I know of are doing that, except of my own from benchstats.qbench module... The runner you're using IIUC is Triton's benchmark module, which essentially unfolds to the same quite biased A-A-A-..., B-B-B..., C-C-C...

Right, specifically the cache invalidation was a feature that I planned on introducing afterwards, but the interleaving having the advantage of mitigating local machine-level noise is a great point that isn't easily solved without interleaving. Unfortunately interleaving is a non-trivial addition and breaks our current benchmarking flow, but IMO it is worth trying to figure out how to implement on our end.

The benchmark runner in benchstats.qbench module does interleaved running by default, however, it might be a bit problematic to reuse it directly for your setting, since that runner currently doesn't have a mode to let a caller chose how to time execution. Instead it uses perf_counter_ns() wall clock time, which isn't always what you want when measuring kernels performance (and have to carefully enforce synchronization for inputs and outputs in an async setting). I have a long standing plan to fix that, but since I primarily work with JAX, there's a whole load of other issues leading me to postpone this plan until better times... PyTorch/Triton combination is more flexible and it's easy to extract a kernel runtime from it, I should expedite implementation of that plan...

I'd love to see such an update, but as you mentioned right now we're soft-locked onto the pytorch timing solution.

If there should be no serial dependency in run times of code A (i.e. if there are no internal caches, non-deterministic behaviour and so on, and if A1 is the first run and A2 is the second run, then all differences in runtimes should only happen due to noise), then any run Ai is expected to be no better or no worse then any other run Aj, right? Then it's beneficial to randomly permute obtained run-times and compute PValue for it, then repeat the permutation and compute PValue again, and so on - to gather a set of P-Values. Then judge a statistical significance by "any/none" criteria, i.e. "if none of PValues show significance, assume there's really no significance, otherwise there's significance or results aren't reliable". This helps to strengthen outcomes and make them way less prone to noise again (provided that the assumption about absence of dependencies, which are greatly diminished by the prev. bullet point, holds). This "P-Value bootstrapping" is also implemented in the benchmark runner in benchstats.qbench by default... Damn... The ability to reuse it could shrink your integration code & effort in the order of magnitude and turn it into a very concise integration like here I use for benchmarking kernels in jax-triton.... I need to fix it...

Just wanted to mention that bootstrapping and permutation testing are such miracles and are part of what I adore of statistics 💯 .

And just a note about process: for a non-parametric test that make no assumptions about underlying data distribution, such as Brunner Munzel Test (and in practice this also generally works with Mann–Whitney U test, but never with T-Test or similar), it's safe to use a tiny number of iterations, like 2-3-4 (as long as times are aggregated within a repetition with mean/average, which is the only correct choice almost always). Maybe even 1 would work nicely, I just only tested them up to 2 😁 . Then even with 100-200 repetitions, a total execution budget would be still rather small, but sensitivity of the test will be very high.

That's a good thing to note. I was wondering what the stability would look like empirically, since eventually we'll want to optimize for runtime in a CI environment. Thanks for the additional info!

Overall, thank you very much for your project, and for taking the time to speak on our current benchmarking implementation. This PR is still much of a prototype PR, but I'm hopeful we can get it integrated and move to some more meaningful results. You've keenly spotted a few challenges we'll have even w/ or w/o this specific approach. Now all that's left is for me to figure out the least painful way to address them 😄

Arech8 · 2026-06-09T12:41:04Z

advocacy for statistical testing of benchmark results -- I think it's important for a healthy ecosystem!

Yes, absolutely. Using unreliable benchmarks is like building on top of ever-shifting sands: it's only a matter of time when such benchmarks misfire. Using statistical tools for comparison is an important step towards repeatable results and reliable conclusions.

Right, specifically the cache invalidation was a feature that I planned on introducing afterwards ...

I'm not familiar with Torch's benchmarks runner, so can't tell yet exactly, however a few things to note here: Triton's benchmarking method do_bench() is also based on Torch and it utilizes cache clearing. On both platforms the implementation is identical, - it just allocates and zeros a 256MiB chunk of data, so assuming it's a correct way to clear cache, it's super easy to port. I wish clearing caches on CPUs were so simple! 😁 (for the context - it's incredibly hard and costly to ensure CPU caches are clear, as CPUs are latency focused and their caching algorithms are much smarter)

the interleaving having the advantage of mitigating local machine-level noise is a great point that isn't easily solved without interleaving.

Precisely that!

Unfortunately interleaving is a non-trivial addition and breaks our current benchmarking flow, but IMO it is worth trying to figure out how to implement on our end.

One of the advantages of Torch over JAX is relative openness and maturity of internals. In Torch it's very simple to implement a benchmarks runner that measures precisely a kernel run time. For example, here's how Triton does it with CUDA/HIP events, exported by Torch. Doing that isn't fun in JAX, since JAX (currently) just doesn't expose anything needed for that 😁

If things go right, somewhat soon I might return back to Silo to a team working with Torch and then, if @jcaraban agrees, I could pay much more attention to Torch benchmarking issues, and will definitely implement support in benchstats for custom timers (still in my spare time, sorry, as benchstats is my personal passion project and I'm not going to give it away by working on it in during AMD work time). Then all of that would be achievable with just a few lines of configuration code...

I was wondering what the stability would look like empirically, since eventually we'll want to optimize for runtime in a CI environment.

So, I was using the Jax-Triton benchmarking script I linked above for maybe...at least several hundred times already on a shared machine with multiple tenants working in parallel. Once I implemented interleaving (and this happened pretty early) and later pvalue bootstrapping, I've always used them and, especially after interleaving, I'm getting very consistent benchmarking results: if the results were found significant once, they are basically always significant. Basically - because it still depends on the background noise level and the difference in magnitude of runtimes. But I don't think I recall any failures for differences from 1-2% and higher, even though individual runtimes may fluctuate wildly, and depending on the machine load, even though average runtimes (across all executions of a benchmark in a session) might fluctuate x2+ times between some session - results are always the same. So I'm very happy with it. I've never seen before so solid reproducibility of complex benchmarks even on quiesced machines. I basically even ceased to worry about parallel tenants producing noise, even though this is typically the first thing you have to do before even starting to work on the machine quiescing... But please don't get me wrong - there are very many variables to care about to get solid results - and as long as you aren't using precisely the same in all minor details approach, "your mileage might vary". Hence the only thing I'm confidently expecting is that your results will be definitely more reliable, especially with interleaving.

Now all that's left is for me to figure out the least painful way to address them 😄

You might want to have a look at links above to Triton benchmarks runner for inspiration. This should be not hard to (re-)implement independently in Torch, though I hope I'll deliver custom timers to benchstats.qbench sooner and you just won't need to write a custom runner (but I really have no idea about requirements to your runner and might be wildly off guessing here, you know better).

Micky774 · 2026-06-09T19:54:17Z

@matthiasdiener here's an example simulation displaying the effect of transient noise. Results:

Results

==============================================================================
EXPT 0  CONTROL (no transient)   (true_ratio B/A = 1.000, N=30/kernel, trials=4000)
==============================================================================
   amp | policy      |  FPR@.05  FPR@.001 | median ratio    ratio p5..p95
-------------------------------------------------------------------------
  ctrl | sequential  |    0.048     0.002 |       0.9996  0.9842..1.0154
  ctrl | interleaved |    0.056     0.002 |       1.0003  0.9846..1.0158
-------------------------------------------------------------------------

==============================================================================
EXPT 1  WARM-UP RAMP (monotonic drift)   (true_ratio B/A = 1.000, N=30/kernel, trials=4000)
==============================================================================
   amp | policy      |  FPR@.05  FPR@.001 | median ratio    ratio p5..p95
-------------------------------------------------------------------------
  ctrl | sequential  |    0.049     0.000 |       0.9999  0.9844..1.0156
  ctrl | interleaved |    0.051     0.002 |       1.0002  0.9846..1.0157
-------------------------------------------------------------------------
  0.05 | sequential  |    0.864     0.414 |       1.0250  1.0093..1.0414
  0.05 | interleaved |    0.031     0.001 |       1.0009  0.9851..1.0172
-------------------------------------------------------------------------
  0.10 | sequential  |    1.000     0.993 |       1.0496  1.0332..1.0674
  0.10 | interleaved |    0.009     0.000 |       1.0015  0.9840..1.0206
-------------------------------------------------------------------------
  0.20 | sequential  |    1.000     1.000 |       1.0970  1.0768..1.1171
  0.20 | interleaved |    0.000     0.000 |       1.0027  0.9805..1.0253
-------------------------------------------------------------------------

==============================================================================
EXPT 2  STEP THROTTLE (random onset, 50% duration)   (true_ratio B/A = 1.000, N=30/kernel, trials=4000)
==============================================================================
   amp | policy      |  FPR@.05  FPR@.001 | median ratio    ratio p5..p95
-------------------------------------------------------------------------
  ctrl | sequential  |    0.052     0.002 |       1.0002  0.9843..1.0159
  ctrl | interleaved |    0.057     0.003 |       0.9999  0.9842..1.0156
-------------------------------------------------------------------------
  0.05 | sequential  |    0.625     0.405 |       1.0007  0.9518..1.0508
  0.05 | interleaved |    0.012     0.000 |       1.0001  0.9825..1.0176
-------------------------------------------------------------------------
  0.10 | sequential  |    0.698     0.530 |       1.0000  0.9080..1.1005
  0.10 | interleaved |    0.001     0.000 |       1.0003  0.9776..1.0235
-------------------------------------------------------------------------
  0.20 | sequential  |    0.646     0.467 |       0.9925  0.8323..1.2006
  0.20 | interleaved |    0.000     0.000 |       1.0004  0.9728..1.0269
-------------------------------------------------------------------------
  0.40 | sequential  |    0.652     0.482 |       1.0055  0.7138..1.4004
  0.40 | interleaved |    0.000     0.000 |       0.9996  0.9731..1.0269
-------------------------------------------------------------------------

==============================================================================
EXPT 3  REAL 5% SPEEDUP (B truly 5% FASTER) under warm-up ramp
        rate = fraction calling B SLOWER *with significance* (false regression)   (true_ratio B/A = 0.950, N=30/kernel, trials=4000)
==============================================================================
   amp | policy      |  FPR@.05  FPR@.001 | median ratio    ratio p5..p95
-------------------------------------------------------------------------
  ctrl | sequential  |    0.000     0.000 |       0.9500  0.9352..0.9650
  ctrl | interleaved |    0.000     0.000 |       0.9501  0.9350..0.9652
-------------------------------------------------------------------------
  0.05 | sequential  |    0.000     0.000 |       0.9737  0.9585..0.9893
  0.05 | interleaved |    0.000     0.000 |       0.9508  0.9353..0.9661
-------------------------------------------------------------------------
  0.10 | sequential  |    0.005     0.000 |       0.9974  0.9815..1.0133
  0.10 | interleaved |    0.000     0.000 |       0.9514  0.9340..0.9689
-------------------------------------------------------------------------
  0.20 | sequential  |    0.993     0.767 |       1.0421  1.0236..1.0608
  0.20 | interleaved |    0.000     0.000 |       0.9531  0.9319..0.9745
-------------------------------------------------------------------------

==============================================================================
DISTRIBUTION OF MEASURED RATIO  (median_B/median_A), warm-up ramp amp=0.20, null (true=1.0)
==============================================================================
  Truth = 1.000.  Sequential is shifted (BIAS); interleaved is centered.
  sequential  (biased: B always runs hotter)
    +0.860 | 
    +0.879 | 
    +0.898 | 
    +0.917 | 
    +0.936 | 
    +0.955 | 
    +0.974 | 
    +0.993 | 
    +1.012 | 
    +1.031 | 
    +1.050 | 
    +1.069 | #######
    +1.088 | ##############################################
    +1.107 | ###########################################
    +1.126 | ######
    +1.145 | 
    +1.164 | 
    +1.183 | 
    +1.202 | 
    +1.221 | 
    +1.240 | 
  interleaved (centered on truth)
    +0.860 | 
    +0.879 | 
    +0.898 | 
    +0.917 | 
    +0.936 | 
    +0.955 | 
    +0.974 | ########
    +0.993 | #############################################
    +1.012 | ##############################################
    +1.031 | #########
    +1.050 | 
    +1.069 | 
    +1.088 | 
    +1.107 | 
    +1.126 | 
    +1.145 | 
    +1.164 | 
    +1.183 | 
    +1.202 | 
    +1.221 | 
    +1.240 |

github-actions · 2026-06-16T18:43:54Z

Claude Walkthrough

Intent. Make microbenchmark comparisons distinguish real timing regressions from measurement noise. Adds a --stats mode to compare_results.py that runs a Brunner-Munzel test on per-sample CSVs via the benchstats package, so CI can gate on statistical significance instead of point-estimate ratios.

Key changes.

New --stats mode and CLI flags (--alpha, --method, --always-show-pvalues, --export-to) in benchmarks/microbenchmarks/compare_results.py:62 (run_stats) that exits 1 on a significant timing regression.
New benchstats parser benchmarks/microbenchmarks/parser_TEsamples.py:46 turning a samples CSV into the {benchmark: {metric: ndarray}} shape, exposing time_s plus a unit-keyed throughput metric.
New --min-samples N flag (default 12) and a top-up loop in benchmarks/microbenchmarks/utils.py:107 (time_func) that pads torch autorange's few blocks to enough samples for a meaningful statistical test.
Samples CSV gains throughput and unit columns, plus a fix to per-sample time_ms in benchmarks/microbenchmarks/utils.py:417.
New benchmarks/microbenchmarks/requirements.txt pinning benchstats>=3.4; README.md documents the new flags and CSV schema.

Walkthrough.

compare_results.py — keeps the existing ratio mode untouched and adds run_stats(args) as a separate top-level path. It imports benchstats lazily (so the dependency is only needed when --stats is used), uses parser_TEsamples to read both CSVs, pins time_s as the main metric (the one that drives the exit code), and blanks the default s suffix on secondary throughput metrics via style_overrides because their unit is already in the column header. An import rich.table # noqa: F401 is intentional — a comment notes benchstats 3.4.0 renders use rich.table.Table without importing it themselves.
parser_TEsamples.py (new) — subclasses benchstats.common.ParserBase. Benchmark names are built by joining all non-reserved parameter columns plus label with |. Two metrics are exposed: time_s (always; converted from ms because benchstats' renderer auto-scales assuming a seconds base unit) and the throughput metric, keyed by its unit string (TFLOPS, GB/s, …) so the column header reads naturally. The parser warns when a benchmark has <10 samples. Because benchstats requires every benchmark to expose the same metric set, throughput-less composite rows (e.g. Forward+Backward) are dropped from the comparison whenever any other benchmark in the file carries throughput — the raw samples stay in the CSV for other tooling. The class name matches the file name so it can also be loaded via the benchstats CLI's --files_parser option.
utils.py — time_func gains a min_samples arg; when torch autorange returns fewer raw blocks than requested, it calls timer.timeit(number) repeatedly to top up rather than re-running the whole adaptive measurement (which would re-average rather than give independent samples). Results are wrapped in a tiny _RawSamples stand-in that mimics the Measurement attributes (times, number_per_run, mean) downstream callers use. A module-global _ACTIVE_MIN_SAMPLES is set by run_benchmarks from the CLI flag, so every existing bench_fn inherits the knob without per-script edits. The samples-CSV writer now emits a throughput column derived as thr_mean * ms_mean / sample_ms (valid because throughput is a deterministic function of time at fixed config) and a unit column; samples_only rows leave both blank. Drive-by fix: sr["time_ms"] no longer divides by number_per_run — measurement.times entries are already per-run, so the old column was off by a factor of number_per_run.
README.md — documents the new samples-CSV columns, --min-samples, the recommended-≥10-samples threshold, the --stats workflow end-to-end (install deps, run on both checkouts, compare), and how the main vs secondary metric distinction plays into exit codes.

Testing. No automated tests added. Validation is by running the new --stats mode by hand against samples CSVs from two benchmark checkouts; the README documents the expected workflow.

Notes for reviewers.

time_ms semantics in the samples CSV change: prior CSVs had the column divided by number_per_run, so they are not directly comparable to new ones. Any external consumers of the old column should be re-checked.
The default --alpha=0.001 is aggressive (low false-positive rate), which suits a CI gate but may mask small regressions; reviewers may want to confirm that fits the intended use.
--min-samples defaults to 12 across all benchmark scripts (set via the module-level global), which will lengthen runtimes for benchmarks that previously stopped at the autorange minimum. The flag can be lowered (e.g. 2) to effectively disable top-up.
The throughput-derivation thr = thr_mean * ms_mean / sample_ms only holds if throughput is a strict function of time at fixed config — true for the existing TFLOPS/GB/s metrics but worth keeping in mind if new metric types are added.
benchstats pulls in rich, scipy, and numpy as transitive deps; the import is gated behind --stats, but anyone running pip install -r benchmarks/microbenchmarks/requirements.txt will get them all.

Generated by Claude. To request a code review, comment /claude review.

github-actions · 2026-06-16T18:48:17Z

+def run_stats(args):
+    """Compare two samples CSVs with a statistical test via benchstats.
+
+    Timing (``time_ms``) is the main metric and drives the exit code; throughput


Doc/code mismatch: this docstring says the main metric is time_ms, but the code below sets main_metrics = ["time_s"] (line 80) and the parser exposes the time samples under the key time_s. Update the docstring (or the constant) so the two agree — otherwise readers grepping for time_ms as the gating metric will get the wrong picture.

github-actions · 2026-06-16T18:48:26Z

+    import os
+
+    import rich.table  # noqa: F401  benchstats 3.4.0 render uses rich.table.Table without importing it
+    from parser_TEsamples import parser_TEsamples


from parser_TEsamples import parser_TEsamples is an unqualified import that only resolves when the current working directory contains parser_TEsamples.py. Running python /abs/path/to/compare_results.py --stats ... from anywhere else (e.g. a CI working dir, or from the repo root) will fail with ModuleNotFoundError. Consider inserting the script's own directory into sys.path at the top of run_stats (or module-load time), e.g.

import os, sys sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))

so the script is invocable from any cwd — which matters for CI usage where the gate may not chdir into benchmarks/microbenchmarks/ first.

github-actions · 2026-06-16T18:48:31Z

+        elif export_fmt == "html":
+            console.save_html(args.export_to)
+
+    if cr.at_least_one_differs:


at_least_one_differs does not distinguish regressions from improvements. From the benchstats source (compare.py), the flag is set whenever a main-metric comparison comes back as < or >:

if main_metrics is None or metric_name in main_metrics: at_least_one_differs = at_least_one_differs or less_positive or greater_positive

So a candidate that is significantly faster will also trigger return 1, failing CI on a real improvement. The PR description and the README both frame --stats as a CI regression gate ("exits 1 on a significant timing regression"), but this implementation exits 1 on either direction.

If the intent is true regression gating, walk cr.results and check the per-metric direction marker (> means candidate slower for a "lower is better" metric like time_s), and only set the failure bit on that. The inline warning text at line 124 is technically accurate ("at least one significant difference") but contradicts the surrounding documentation — pick one behavior and align both.

github-actions · 2026-06-16T18:48:47Z

Claude review summary

Reviewed the PR head (e592a1c) against HEAD^1 (6ac5195). Scope: 5 files under benchmarks/microbenchmarks/ adding a --stats mode to compare_results.py, a new parser_TEsamples.py benchstats parser, and a --min-samples knob in the shared runner.

Verdict: Implementation is clean and well-documented; the new parser and the autorange top-up are sensibly designed. No correctness issues in the timing math itself (the time_ms "remove the extra / number_per_run division" fix is correct — Measurement.times is already per-call). Three minor issues raised inline on compare_results.py:

L65 docstring says the main metric is time_ms, but the code uses time_s.
L75 unqualified from parser_TEsamples import … requires the script's directory to be cwd; fragile for CI invocation.
L123 cr.at_least_one_differs fires for any significant difference (including a significant speedup), which contradicts the "regression gate" framing in the PR description / README. Either narrow the gate to the regression direction, or adjust the docs.

Copyright headers: OK — all three modified/added Python files carry the required AMD 2026 header; README.md and requirements.txt are skipped per convention.

matthiasdiener · 2026-06-16T20:30:12Z

+            "with additional equal-sized blocks. Use a small value (e.g. 2) to "
+            "effectively disable top-up. Default: 12."


In the current implementation, couldn't even a low top up number be effectively a top-up?

matthiasdiener · 2026-06-16T20:30:22Z

+            "with additional equal-sized blocks. Use a small value (e.g. 2) to "
+            "effectively disable top-up. Default: 12."


Why '12' as the default value?

matthiasdiener · 2026-06-16T20:34:41Z

+    """
+    import os
+
+    import rich.table  # noqa: F401  benchstats 3.4.0 render uses rich.table.Table without importing it


Is this a bug in benchstats? If so, can you report it?

Benchstats imports rich and uses a fully qualified rich.table.Table() to instantiate the only table object instance it uses. Besides that, its code is tested and was used over "a million times" already in a spectra environments. So I'd also like to understand why the rationale is tied to benchstats.

All right, looks like I managed to mess this up. Thanks for noticing! Please use version 3.4.1 that has this imports fixed.

Things shouldn't be that way. In the future please don't hesitate to ping me if you find something strange.

matthiasdiener · 2026-06-16T20:35:36Z

+            "At least one significant timing difference was detected (exit 1)."
+        )
+        return 1
+    return 0


Not sure the exit code is the best way to relay the information - non-0 exit codes usually indicate something has failed, while for benchmarks, a significant timing difference is often expected.

Also, the granularity is a bit coarse - the exit code doesn't indicate in which 'direction' the difference goes.

matthiasdiener · 2026-06-16T20:45:58Z

+    if min_samples is None:
+        min_samples = _ACTIVE_MIN_SAMPLES


I would suggest keeping the min_samples mechanism off by default, so we can avoid revalidating the benchmarks in the default case.

matthiasdiener · 2026-06-16T21:39:30Z

+   speedups, and emits a markdown <details> block (optionally appending a
+   summary row to --summary-file).
+
+2. --stats mode: compares two *samples* CSVs (written with --csv-samples) using


--stats sounds a bit too generic, how about --statistical-test or so?

matthiasdiener · 2026-06-16T21:43:21Z

+    # averaged over number_per_run runs) until enough raw samples are collected.
+    times = list(m.times)  # per-run seconds
+    number = m.number_per_run
+    while len(times) < min_samples:
+        times.append(timer.timeit(number).mean)
+    mean_s = sum(times) / len(times)
+    return mean_s * 1e3, _RawSamples(times=times, number_per_run=number, mean=mean_s)


We could monkeypatch the measurement loop instead, something like:

original_measurement_loop = timer._threaded_measurement_loop def _threaded_measurement_loop(...): def stop_hook_with_min_executions(times): return len(times) * number >= min_number_executions and stop_hook(times) return original_measurement_loop(..., stop_hook_with_min_executions, ...) timer._threaded_measurement_loop = _threaded_measurement_loop

Micky774 and others added 2 commits June 5, 2026 16:31

Micky774 requested a review from matthiasdiener June 8, 2026 14:58

Micky774 requested review from ipanfilo, wangye805 and wenchenvincent as code owners June 8, 2026 14:58

wenchenvincent added claude-review claude-summary labels Jun 16, 2026

github-actions Bot removed the claude-summary label Jun 16, 2026

github-actions Bot reviewed Jun 16, 2026

View reviewed changes

github-actions Bot removed the claude-review label Jun 16, 2026

matthiasdiener reviewed Jun 16, 2026

View reviewed changes

		"with additional equal-sized blocks. Use a small value (e.g. 2) to "
		"effectively disable top-up. Default: 12."

Conversation

Micky774 commented Jun 8, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

Arech8 commented Jun 9, 2026

Uh oh!

Micky774 commented Jun 9, 2026

Uh oh!

Arech8 commented Jun 9, 2026

Uh oh!

Micky774 commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Claude Walkthrough

Uh oh!

github-actions Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 16, 2026

Claude review summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

matthiasdiener Jun 16, 2026 •

edited

Loading