perf(bb/msm): streamed-Yuval karat-stream montmul knob by AztecBot · Pull Request #23521 · AztecProtocol/aztec-packages

AztecBot · 2026-05-22T22:16:41Z

Summary

Adds a MsmConfig.streamedYuval switch that renders the Karatsuba+Yuval montgomery_product body with reduce iters 0..9 inlined between the lo and hi/cr half-products. After lo closes, t[0..9] are final (lo writes t[k]+= for k=0..18; hi and cr only touch t[10..38]), so iters 0..9 can run there and t[0..9] die before hi/cr's 27-schoolbook + ~15-limb live set lands — roughly 10 fewer accumulator GPRs alive through the high-pressure half-products.

Off by default. Threaded through ShaderManager via a new optional streamed_yuval ctor param; only the production MSM ShaderManager in MsmV2.create flips it. Same arithmetic and same template — multiply_body carries the inlined iters and the mustache yuval_iters section starts at i=10 instead of 0.

Measurements

Workload	Device	Baseline	Streamed	Δ
Field-mul microbench (n=2^18 × k=100)	S25 Ultra / Chrome 148 / Adreno 750	48.3 ms	47.1 ms	−2.5 %
Full MSM n=2^16, c=13 (5 reps median)	S25 Ultra / Chrome 148 / Adreno 750	65.4 ms	60.7 ms	−7.2 %
Full MSM n=2^16, c=13 (5 reps min)	S25 Ultra / Chrome 148 / Adreno 750	58.0 ms	59.0 ms	flat (noise)
Full MSM n=2^16, c=13 (3 reps median)	M2 / Sequoia / Chrome 148 / Metal	~40.0 ms	40.4 ms	flat
Adreno shader compile (all MSM pipelines)	S25 Ultra	5712 ms	5554 ms	−2.8 %

Adreno is the only target where the register-pressure win materialises — Metal is comfortably under the GPR cliff already, so the flat M2 result is expected.

Correctness

Validated bit-exact against the host BigInt reference at 64 random scalar/point pairs on both Metal (M2) and Adreno (S25) paths, both via the field-mul microbench (bench-field-mul.html) and via the dev-page MSM cross-check. Identical x coordinate, identical y.

Implementation notes

shader_manager.ts: new optional 5th ctor arg streamed_yuval = false; passed through to renderKaratYuvalMont({ streamed: streamed_yuval }). The renderer now accepts { streamed?: boolean } and, when streamed, emits 10 inline { t_mask, carry; t[i+1..i+20] += ... } blocks after the lo half-product. The mustache template's yuval_iters section is rendered with yuvalStart = 10 so the remaining iters (10..18), standard reduce, and final drain are unchanged.
msm_v2.ts: MsmConfig.streamedYuval field; threaded to the single MsmV2.create ShaderManager instantiation that compiles the MSM pipelines (the MsmV2Pool shader manager only compiles the point-conversion shader and doesn't reference mont_product_src, so I left that call site alone).

Why on Adreno specifically

The S25's Adreno 750 has a tight GPR-per-wave budget; the karat (grouped Karatsuba) body sits in a regime where the cr block's ~67-GPR live set lives right at the edge of half-occupancy. Killing the 10 t[0..9] slots before cr runs drops the peak by ~10 GPRs, giving the register allocator slack the scheduler can use. M2's Metal driver allocates ~2× the registers per wave at this size and isn't binding on this kernel — the streamed body is a no-op there, which is exactly what the numbers show.

How to A/B locally

Production callers pick up the knob via MsmConfig.streamedYuval = true. The bench page wiring + autorun=msm-bench harness used to produce these numbers lives on zw/msm-webgpu-experiments-v2 ahead of this commit; I can land it as a separate follow-up PR if you want the bench autorun + base64-cfg URL scheme in tree too.

Test plan

Land on zw/msm-webgpu-experiments-v2; rebench n=2^18 / n=2^20 on S25 before flipping the production default
Spot-check Metal/Intel paths show no regression (expected: flat, M2 was already saturated)
If the n=2^18+ numbers also hold, flip MsmConfig.streamedYuval default to true in MsmV2.create (separate PR)

Created by claudebox · group: slackbot

Add a `MsmConfig.streamedYuval` switch that renders the Karatsuba+Yuval montgomery_product body with reduce iters 0..9 inlined between the `lo` and `hi`/`cr` half-products. After `lo` closes, t[0..9] are final (lo writes t[k]+= for k=0..18; hi and cr only touch t[10..38]), so iters 0..9 can run there and t[0..9] die before hi/cr's 27-schoolbook + ~15-limb live set lands — about 10 fewer accumulator GPRs alive through the high-pressure half-products. The knob is off by default (no behaviour change on existing callers). Threaded through `ShaderManager` via a new optional `streamed_yuval` ctor param; only the production MSM `ShaderManager` in `MsmV2.create` flips it. Same arithmetic and same template — `multiply_body` carries the inlined iters and the mustache `yuval_iters` section starts at i=10 instead of 0. Measured on BrowserStack real devices at n=2^18 × k=100 (field-mul microbench) and n=2^16 c=13 (full MSM via the dev page): S25 Ultra field-mul: 48.3 ms -> 47.1 ms (-2.5 %) S25 Ultra MSM 2^16: 65.4 ms -> 60.7 ms median (-7.2 %) S25 Ultra MSM 2^16: 58.0 ms -> 59.0 ms min (within noise) M2 / Chrome MSM 2^16: 40.4 ms streamed vs ~40 ms baseline (flat, not register-bound, no regression) Shader compile cost on Adreno is ~3 % lower with the streamed body (register allocator's job is easier when t[0..9] are dead before hi/cr). No correctness change: validated bit-exact against the host BigInt reference at 64 random scalar/point pairs on both Metal and Adreno paths.

AztecBot · 2026-05-28T07:27:16Z

Automatically closing this stale claudebox draft PR (no updates for 5+ days). Re-open if still needed.

AztecBot added ci-barretenberg Run all barretenberg/cpp checks. claudebox Owned by claudebox. it can push to this PR. labels May 22, 2026

AztecBot closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bb/msm): streamed-Yuval karat-stream montmul knob#23521

perf(bb/msm): streamed-Yuval karat-stream montmul knob#23521
AztecBot wants to merge 1 commit into
zw/msm-webgpu-experiments-v2from
cb/760f89a7aba1

AztecBot commented May 22, 2026

Uh oh!

AztecBot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 22, 2026

Summary

Measurements

Correctness

Implementation notes

Why on Adreno specifically

How to A/B locally

Test plan

Uh oh!

AztecBot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant