Skip to content

perf(bb/msm): streamed-Yuval karat-stream montmul knob#23521

Closed
AztecBot wants to merge 1 commit into
zw/msm-webgpu-experiments-v2from
cb/760f89a7aba1
Closed

perf(bb/msm): streamed-Yuval karat-stream montmul knob#23521
AztecBot wants to merge 1 commit into
zw/msm-webgpu-experiments-v2from
cb/760f89a7aba1

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

Summary

Adds a MsmConfig.streamedYuval switch that renders the Karatsuba+Yuval montgomery_product body with reduce iters 0..9 inlined between the lo and hi/cr half-products. After lo closes, t[0..9] are final (lo writes t[k]+= for k=0..18; hi and cr only touch t[10..38]), so iters 0..9 can run there and t[0..9] die before hi/cr's 27-schoolbook + ~15-limb live set lands — roughly 10 fewer accumulator GPRs alive through the high-pressure half-products.

Off by default. Threaded through ShaderManager via a new optional streamed_yuval ctor param; only the production MSM ShaderManager in MsmV2.create flips it. Same arithmetic and same template — multiply_body carries the inlined iters and the mustache yuval_iters section starts at i=10 instead of 0.

Measurements

Workload Device Baseline Streamed Δ
Field-mul microbench (n=2^18 × k=100) S25 Ultra / Chrome 148 / Adreno 750 48.3 ms 47.1 ms −2.5 %
Full MSM n=2^16, c=13 (5 reps median) S25 Ultra / Chrome 148 / Adreno 750 65.4 ms 60.7 ms −7.2 %
Full MSM n=2^16, c=13 (5 reps min) S25 Ultra / Chrome 148 / Adreno 750 58.0 ms 59.0 ms flat (noise)
Full MSM n=2^16, c=13 (3 reps median) M2 / Sequoia / Chrome 148 / Metal ~40.0 ms 40.4 ms flat
Adreno shader compile (all MSM pipelines) S25 Ultra 5712 ms 5554 ms −2.8 %

Adreno is the only target where the register-pressure win materialises — Metal is comfortably under the GPR cliff already, so the flat M2 result is expected.

Correctness

Validated bit-exact against the host BigInt reference at 64 random scalar/point pairs on both Metal (M2) and Adreno (S25) paths, both via the field-mul microbench (bench-field-mul.html) and via the dev-page MSM cross-check. Identical x coordinate, identical y.

Implementation notes

  • shader_manager.ts: new optional 5th ctor arg streamed_yuval = false; passed through to renderKaratYuvalMont({ streamed: streamed_yuval }). The renderer now accepts { streamed?: boolean } and, when streamed, emits 10 inline { t_mask, carry; t[i+1..i+20] += ... } blocks after the lo half-product. The mustache template's yuval_iters section is rendered with yuvalStart = 10 so the remaining iters (10..18), standard reduce, and final drain are unchanged.
  • msm_v2.ts: MsmConfig.streamedYuval field; threaded to the single MsmV2.create ShaderManager instantiation that compiles the MSM pipelines (the MsmV2Pool shader manager only compiles the point-conversion shader and doesn't reference mont_product_src, so I left that call site alone).

Why on Adreno specifically

The S25's Adreno 750 has a tight GPR-per-wave budget; the karat (grouped Karatsuba) body sits in a regime where the cr block's ~67-GPR live set lives right at the edge of half-occupancy. Killing the 10 t[0..9] slots before cr runs drops the peak by ~10 GPRs, giving the register allocator slack the scheduler can use. M2's Metal driver allocates ~2× the registers per wave at this size and isn't binding on this kernel — the streamed body is a no-op there, which is exactly what the numbers show.

How to A/B locally

Production callers pick up the knob via MsmConfig.streamedYuval = true. The bench page wiring + autorun=msm-bench harness used to produce these numbers lives on zw/msm-webgpu-experiments-v2 ahead of this commit; I can land it as a separate follow-up PR if you want the bench autorun + base64-cfg URL scheme in tree too.

Test plan

  • Land on zw/msm-webgpu-experiments-v2; rebench n=2^18 / n=2^20 on S25 before flipping the production default
  • Spot-check Metal/Intel paths show no regression (expected: flat, M2 was already saturated)
  • If the n=2^18+ numbers also hold, flip MsmConfig.streamedYuval default to true in MsmV2.create (separate PR)

Created by claudebox · group: slackbot

Add a `MsmConfig.streamedYuval` switch that renders the Karatsuba+Yuval
montgomery_product body with reduce iters 0..9 inlined between the `lo`
and `hi`/`cr` half-products. After `lo` closes, t[0..9] are final (lo
writes t[k]+= for k=0..18; hi and cr only touch t[10..38]), so iters
0..9 can run there and t[0..9] die before hi/cr's 27-schoolbook +
~15-limb live set lands — about 10 fewer accumulator GPRs alive
through the high-pressure half-products.

The knob is off by default (no behaviour change on existing callers).
Threaded through `ShaderManager` via a new optional `streamed_yuval`
ctor param; only the production MSM `ShaderManager` in `MsmV2.create`
flips it. Same arithmetic and same template — `multiply_body` carries
the inlined iters and the mustache `yuval_iters` section starts at
i=10 instead of 0.

Measured on BrowserStack real devices at n=2^18 × k=100 (field-mul
microbench) and n=2^16 c=13 (full MSM via the dev page):

  S25 Ultra field-mul:  48.3 ms -> 47.1 ms   (-2.5 %)
  S25 Ultra MSM 2^16:   65.4 ms -> 60.7 ms median   (-7.2 %)
  S25 Ultra MSM 2^16:   58.0 ms -> 59.0 ms min      (within noise)
  M2 / Chrome MSM 2^16: 40.4 ms streamed vs ~40 ms baseline   (flat,
                        not register-bound, no regression)

Shader compile cost on Adreno is ~3 % lower with the streamed body
(register allocator's job is easier when t[0..9] are dead before
hi/cr). No correctness change: validated bit-exact against the
host BigInt reference at 64 random scalar/point pairs on both Metal
and Adreno paths.
@AztecBot AztecBot added ci-barretenberg Run all barretenberg/cpp checks. claudebox Owned by claudebox. it can push to this PR. labels May 22, 2026
@AztecBot
Copy link
Copy Markdown
Collaborator Author

Automatically closing this stale claudebox draft PR (no updates for 5+ days). Re-open if still needed.

@AztecBot AztecBot closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-barretenberg Run all barretenberg/cpp checks. claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant