Skip to content

perf(bb/msm): cache dx + idx in ba_fused_super pref_scratch#23520

Draft
AztecBot wants to merge 1 commit into
zw/msm-webgpu-experiments-v2from
cb/a8b61d54bddd
Draft

perf(bb/msm): cache dx + idx in ba_fused_super pref_scratch#23520
AztecBot wants to merge 1 commit into
zw/msm-webgpu-experiments-v2from
cb/a8b61d54bddd

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

Summary

Adds a fusedSuperOpt knob (default off) to the ba_fused_super_bench kernel — the bin-packed pair-tree bucket-accumulate that runs for every level of the MSM, with level 0 being the largest single dispatch in the pipeline.

The optimisation has two parts:

  1. Cache dx in pref_scratch. Forward stores dx[k] alongside prefix[k] (slot grows 8 → 16 u32, PREF_STRIDE 2 → 4). The inverse pass then loads dx_back from the cache instead of re-running 2 × load_active_x + 1 fr_sub.
  2. Pre-load chunk_plan indices into a function-local array<u32, S> at kernel entry; forward, inverse, and backward all reuse them instead of re-reading the index buffer.

Both are gated by one Mustache flag (super_opt) so the original code path is byte-identical when the knob is off. Wired through MsmConfig.fusedSuperOpt, the host prefScratchBuf / estimateMem sizing, the shader manager, and a ?superopt=1 URL knob on the dev bench page.

Measurements (n=2^16, c=13, S=8)

Platform fused baseline fused super_opt Δ
macOS Sequoia / M-class 31.3 ms 25.0 ms −20 %
Samsung S25 Ultra (Adreno 830) 32.3 ms 32.3 ms ~0 %

The macOS win lands across the higher levels (l1+), where the inverse-pass dx recomputation and chunk_plan re-reads are a meaningful fraction of per-thread time. Level 0 itself is unchanged on macOS (~12 ms before/after) — the pool-indexed loads aren't the bottleneck there either.

On Adreno 830 fused_l0 is compute-bound on the safegcd inverse + montgomery products: removing the indirect storage loads doesn't move the needle, and the doubled pref_scratch write traffic exactly cancels the read savings. Default is left off so the macOS-specific savings stay opt-in until a per-device default lands. Enable with ?superopt=1 or config.fusedSuperOpt = true.

Why not split the inverse loop too?

A third optimisation — splitting k==0 out of the inverse loop into an epilogue so the per-iter divergent branch goes away — was implemented but reverted: Metal's WGSL compiler dropped the last inv = mont(inv, dx_back) update when it became loop-tail-with-only-out-of-loop-use, producing a wrong inv_dx[0]. Keeping the if/else branch (where the update sits inside the else block) is treated as load-bearing by the compiler. The if/else is uniform across the wave at runtime so the cost is negligible.

Test plan

  • Open ?superopt=0 and ?superopt=1 on the dev bench page (macOS + S25), verify result x matches across runs
  • Confirm fused median drops ~20 % on macOS and doesn't regress on S25
  • Run the existing ?autorun=msm-cross-check flow on both platforms with ?superopt=1 to confirm the MSM result still cross-checks against the multi-threaded WASM path

Created by claudebox · group: slackbot

Adds a `fusedSuperOpt` knob (default off) to the ba_fused_super_bench
kernel that:

1. Caches the per-slot `dx` alongside the prefix product in
   pref_scratch. The inverse pass loads `dx_back` instead of re-running
   `2 × load_active_x + 1 fr_sub`. Doubles pref_scratch slot from 8 →
   16 u32.
2. Pre-loads the 2S chunk_plan indices into a function-local
   `array<u32, S>` at kernel entry; the forward, inverse, and backward
   passes all reuse them instead of re-reading the index buffer.

Wired through `MsmConfig.fusedSuperOpt`, the host pref_scratch /
estimateMem sizing, the shader manager, and a `?superopt=1` URL knob
on the dev bench page.

Measured on n=2^16 with the dev bench (older surface): macOS / M-class
GPU sees `fused` total drop 31.3 → 25.0 ms (−20 %), driven mostly by
the higher levels where the inverse + backward re-reads dominate.
Adreno 830 (Samsung S25 Ultra) sees no change — fused_l0 there is
compute-bound on the safegcd inverse + mont products, not on the
indirect storage loads that this opt eliminates. Default left off so
the macOS-specific savings stay opt-in until a per-device default lands.
@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant