perf(bb/msm): streamed-Yuval karat-stream montmul knob#23521
Closed
AztecBot wants to merge 1 commit into
Closed
Conversation
Add a `MsmConfig.streamedYuval` switch that renders the Karatsuba+Yuval
montgomery_product body with reduce iters 0..9 inlined between the `lo`
and `hi`/`cr` half-products. After `lo` closes, t[0..9] are final (lo
writes t[k]+= for k=0..18; hi and cr only touch t[10..38]), so iters
0..9 can run there and t[0..9] die before hi/cr's 27-schoolbook +
~15-limb live set lands — about 10 fewer accumulator GPRs alive
through the high-pressure half-products.
The knob is off by default (no behaviour change on existing callers).
Threaded through `ShaderManager` via a new optional `streamed_yuval`
ctor param; only the production MSM `ShaderManager` in `MsmV2.create`
flips it. Same arithmetic and same template — `multiply_body` carries
the inlined iters and the mustache `yuval_iters` section starts at
i=10 instead of 0.
Measured on BrowserStack real devices at n=2^18 × k=100 (field-mul
microbench) and n=2^16 c=13 (full MSM via the dev page):
S25 Ultra field-mul: 48.3 ms -> 47.1 ms (-2.5 %)
S25 Ultra MSM 2^16: 65.4 ms -> 60.7 ms median (-7.2 %)
S25 Ultra MSM 2^16: 58.0 ms -> 59.0 ms min (within noise)
M2 / Chrome MSM 2^16: 40.4 ms streamed vs ~40 ms baseline (flat,
not register-bound, no regression)
Shader compile cost on Adreno is ~3 % lower with the streamed body
(register allocator's job is easier when t[0..9] are dead before
hi/cr). No correctness change: validated bit-exact against the
host BigInt reference at 64 random scalar/point pairs on both Metal
and Adreno paths.
Collaborator
Author
|
Automatically closing this stale claudebox draft PR (no updates for 5+ days). Re-open if still needed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
MsmConfig.streamedYuvalswitch that renders the Karatsuba+Yuvalmontgomery_productbody with reduce iters 0..9 inlined between theloandhi/crhalf-products. Afterlocloses,t[0..9]are final (lo writest[k]+=for k=0..18; hi and cr only toucht[10..38]), so iters 0..9 can run there andt[0..9]die before hi/cr's 27-schoolbook + ~15-limb live set lands — roughly 10 fewer accumulator GPRs alive through the high-pressure half-products.Off by default. Threaded through
ShaderManagervia a new optionalstreamed_yuvalctor param; only the production MSMShaderManagerinMsmV2.createflips it. Same arithmetic and same template —multiply_bodycarries the inlined iters and the mustacheyuval_iterssection starts ati=10instead of 0.Measurements
Adreno is the only target where the register-pressure win materialises — Metal is comfortably under the GPR cliff already, so the flat M2 result is expected.
Correctness
Validated bit-exact against the host BigInt reference at 64 random scalar/point pairs on both Metal (M2) and Adreno (S25) paths, both via the field-mul microbench (
bench-field-mul.html) and via the dev-page MSM cross-check. Identicalxcoordinate, identicaly.Implementation notes
shader_manager.ts: new optional 5th ctor argstreamed_yuval = false; passed through torenderKaratYuvalMont({ streamed: streamed_yuval }). The renderer now accepts{ streamed?: boolean }and, when streamed, emits 10 inline{ t_mask, carry; t[i+1..i+20] += ... }blocks after thelohalf-product. The mustache template'syuval_iterssection is rendered withyuvalStart = 10so the remaining iters (10..18), standard reduce, and final drain are unchanged.msm_v2.ts:MsmConfig.streamedYuvalfield; threaded to the singleMsmV2.createShaderManagerinstantiation that compiles the MSM pipelines (theMsmV2Poolshader manager only compiles the point-conversion shader and doesn't referencemont_product_src, so I left that call site alone).Why on Adreno specifically
The S25's Adreno 750 has a tight GPR-per-wave budget; the karat (grouped Karatsuba) body sits in a regime where the cr block's ~67-GPR live set lives right at the edge of half-occupancy. Killing the 10 t[0..9] slots before cr runs drops the peak by ~10 GPRs, giving the register allocator slack the scheduler can use. M2's Metal driver allocates ~2× the registers per wave at this size and isn't binding on this kernel — the streamed body is a no-op there, which is exactly what the numbers show.
How to A/B locally
Production callers pick up the knob via
MsmConfig.streamedYuval = true. The bench page wiring +autorun=msm-benchharness used to produce these numbers lives onzw/msm-webgpu-experiments-v2ahead of this commit; I can land it as a separate follow-up PR if you want the bench autorun + base64-cfgURL scheme in tree too.Test plan
zw/msm-webgpu-experiments-v2; rebench n=2^18 / n=2^20 on S25 before flipping the production defaultMsmConfig.streamedYuvaldefault totrueinMsmV2.create(separate PR)Created by claudebox · group:
slackbot