perf(bb/msm): cache dx + idx in ba_fused_super pref_scratch by AztecBot · Pull Request #23520 · AztecProtocol/aztec-packages

AztecBot · 2026-05-22T22:14:53Z

Summary

Adds a fusedSuperOpt knob (default off) to the ba_fused_super_bench kernel — the bin-packed pair-tree bucket-accumulate that runs for every level of the MSM, with level 0 being the largest single dispatch in the pipeline.

The optimisation has two parts:

Cache dx in pref_scratch. Forward stores dx[k] alongside prefix[k] (slot grows 8 → 16 u32, PREF_STRIDE 2 → 4). The inverse pass then loads dx_back from the cache instead of re-running 2 × load_active_x + 1 fr_sub.
Pre-load chunk_plan indices into a function-local array<u32, S> at kernel entry; forward, inverse, and backward all reuse them instead of re-reading the index buffer.

Both are gated by one Mustache flag (super_opt) so the original code path is byte-identical when the knob is off. Wired through MsmConfig.fusedSuperOpt, the host prefScratchBuf / estimateMem sizing, the shader manager, and a ?superopt=1 URL knob on the dev bench page.

Measurements (n=2^16, c=13, S=8)

Platform	`fused` baseline	`fused` super_opt	Δ
macOS Sequoia / M-class	31.3 ms	25.0 ms	−20 %
Samsung S25 Ultra (Adreno 830)	32.3 ms	32.3 ms	~0 %

The macOS win lands across the higher levels (l1+), where the inverse-pass dx recomputation and chunk_plan re-reads are a meaningful fraction of per-thread time. Level 0 itself is unchanged on macOS (~12 ms before/after) — the pool-indexed loads aren't the bottleneck there either.

On Adreno 830 fused_l0 is compute-bound on the safegcd inverse + montgomery products: removing the indirect storage loads doesn't move the needle, and the doubled pref_scratch write traffic exactly cancels the read savings. Default is left off so the macOS-specific savings stay opt-in until a per-device default lands. Enable with ?superopt=1 or config.fusedSuperOpt = true.

Why not split the inverse loop too?

A third optimisation — splitting k==0 out of the inverse loop into an epilogue so the per-iter divergent branch goes away — was implemented but reverted: Metal's WGSL compiler dropped the last inv = mont(inv, dx_back) update when it became loop-tail-with-only-out-of-loop-use, producing a wrong inv_dx[0]. Keeping the if/else branch (where the update sits inside the else block) is treated as load-bearing by the compiler. The if/else is uniform across the wave at runtime so the cost is negligible.

Test plan

Open ?superopt=0 and ?superopt=1 on the dev bench page (macOS + S25), verify result x matches across runs
Confirm fused median drops ~20 % on macOS and doesn't regress on S25
Run the existing ?autorun=msm-cross-check flow on both platforms with ?superopt=1 to confirm the MSM result still cross-checks against the multi-threaded WASM path

Created by claudebox · group: slackbot

Adds a `fusedSuperOpt` knob (default off) to the ba_fused_super_bench kernel that: 1. Caches the per-slot `dx` alongside the prefix product in pref_scratch. The inverse pass loads `dx_back` instead of re-running `2 × load_active_x + 1 fr_sub`. Doubles pref_scratch slot from 8 → 16 u32. 2. Pre-loads the 2S chunk_plan indices into a function-local `array<u32, S>` at kernel entry; the forward, inverse, and backward passes all reuse them instead of re-reading the index buffer. Wired through `MsmConfig.fusedSuperOpt`, the host pref_scratch / estimateMem sizing, the shader manager, and a `?superopt=1` URL knob on the dev bench page. Measured on n=2^16 with the dev bench (older surface): macOS / M-class GPU sees `fused` total drop 31.3 → 25.0 ms (−20 %), driven mostly by the higher levels where the inverse + backward re-reads dominate. Adreno 830 (Samsung S25 Ultra) sees no change — fused_l0 there is compute-bound on the safegcd inverse + mont products, not on the indirect storage loads that this opt eliminates. Default left off so the macOS-specific savings stay opt-in until a per-device default lands.

AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bb/msm): cache dx + idx in ba_fused_super pref_scratch#23520

perf(bb/msm): cache dx + idx in ba_fused_super pref_scratch#23520
AztecBot wants to merge 1 commit into
zw/msm-webgpu-experiments-v2from
cb/a8b61d54bddd

AztecBot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 22, 2026

Summary

Measurements (n=2^16, c=13, S=8)

Why not split the inverse loop too?

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant