perf(bb/msm): cache dx + idx in ba_fused_super pref_scratch#23520
Draft
AztecBot wants to merge 1 commit into
Draft
perf(bb/msm): cache dx + idx in ba_fused_super pref_scratch#23520AztecBot wants to merge 1 commit into
AztecBot wants to merge 1 commit into
Conversation
Adds a `fusedSuperOpt` knob (default off) to the ba_fused_super_bench kernel that: 1. Caches the per-slot `dx` alongside the prefix product in pref_scratch. The inverse pass loads `dx_back` instead of re-running `2 × load_active_x + 1 fr_sub`. Doubles pref_scratch slot from 8 → 16 u32. 2. Pre-loads the 2S chunk_plan indices into a function-local `array<u32, S>` at kernel entry; the forward, inverse, and backward passes all reuse them instead of re-reading the index buffer. Wired through `MsmConfig.fusedSuperOpt`, the host pref_scratch / estimateMem sizing, the shader manager, and a `?superopt=1` URL knob on the dev bench page. Measured on n=2^16 with the dev bench (older surface): macOS / M-class GPU sees `fused` total drop 31.3 → 25.0 ms (−20 %), driven mostly by the higher levels where the inverse + backward re-reads dominate. Adreno 830 (Samsung S25 Ultra) sees no change — fused_l0 there is compute-bound on the safegcd inverse + mont products, not on the indirect storage loads that this opt eliminates. Default left off so the macOS-specific savings stay opt-in until a per-device default lands.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
fusedSuperOptknob (default off) to theba_fused_super_benchkernel — the bin-packed pair-tree bucket-accumulate that runs for every level of the MSM, with level 0 being the largest single dispatch in the pipeline.The optimisation has two parts:
dxin pref_scratch. Forward storesdx[k]alongsideprefix[k](slot grows 8 → 16 u32,PREF_STRIDE2 → 4). The inverse pass then loadsdx_backfrom the cache instead of re-running2 × load_active_x + 1 fr_sub.array<u32, S>at kernel entry; forward, inverse, and backward all reuse them instead of re-reading the index buffer.Both are gated by one Mustache flag (
super_opt) so the original code path is byte-identical when the knob is off. Wired throughMsmConfig.fusedSuperOpt, the hostprefScratchBuf/estimateMemsizing, the shader manager, and a?superopt=1URL knob on the dev bench page.Measurements (n=2^16, c=13, S=8)
fusedbaselinefusedsuper_optThe macOS win lands across the higher levels (l1+), where the inverse-pass
dxrecomputation and chunk_plan re-reads are a meaningful fraction of per-thread time. Level 0 itself is unchanged on macOS (~12 ms before/after) — the pool-indexed loads aren't the bottleneck there either.On Adreno 830
fused_l0is compute-bound on the safegcd inverse + montgomery products: removing the indirect storage loads doesn't move the needle, and the doubled pref_scratch write traffic exactly cancels the read savings. Default is left off so the macOS-specific savings stay opt-in until a per-device default lands. Enable with?superopt=1orconfig.fusedSuperOpt = true.Why not split the inverse loop too?
A third optimisation — splitting
k==0out of the inverse loop into an epilogue so the per-iter divergent branch goes away — was implemented but reverted: Metal's WGSL compiler dropped the lastinv = mont(inv, dx_back)update when it became loop-tail-with-only-out-of-loop-use, producing a wronginv_dx[0]. Keeping theif/elsebranch (where the update sits inside theelseblock) is treated as load-bearing by the compiler. The if/else is uniform across the wave at runtime so the cost is negligible.Test plan
?superopt=0and?superopt=1on the dev bench page (macOS + S25), verifyresult xmatches across runsfusedmedian drops ~20 % on macOS and doesn't regress on S25?autorun=msm-cross-checkflow on both platforms with?superopt=1to confirm the MSM result still cross-checks against the multi-threaded WASM pathCreated by claudebox · group:
slackbot