perf(bb/msm): stream-walker — amortize the field inversion via large S + private pref_scratch by AztecBot · Pull Request #23736 · AztecProtocol/aztec-packages

AztecBot · 2026-05-30T02:41:33Z

Goal & lever

Make the stream-walker MSM accumulator faster on laptop/mobile GPUs without regressing its memory, by attacking the cost the software profile flagged as ~47 % of the accumulate kernel (PR #23732): the safegcd field inversion. The walker does one inversion per batch of S affine adds, so cost-per-add = |inv| / S. Raise S to amortize, and lift PR #23726's per-invocation var<private> pref_scratch so the 16 KB workgroup budget no longer pins S small.

Change (correct, landed)

pref_scratch → var<private> (mustache {{#pref_private}}, default on; workgroup path preserved). Frees the 16 KB workgroup limiter (TPB 64→128), adds no storage binding (stays within the 10-per-stage floor Mali/Adreno/SwiftShader enforce), removes the cap that pinned S small.
walkerS / walkerMaxWg knobs (MsmConfig + URL). NUM_THREADS = walkerMaxWg·256; partials_buf ∝ NUM_THREADS·S. Holding walkerS·walkerMaxWg = 256 (default 8·32) keeps peak memory flat while raising S cuts the inversion count (= total_adds/S).
WGSL regenerated (_generated/shaders.ts in diff).
Correctness-harness fix (dev/msm-webgpu/msm-correctness.ts): the synthetic points were a random arithmetic progression on G, whose linear relation makes MSM partial sums collide in x (affine-add ÷0) → an off-curve GPU result that PR perf(bb/msm): stream-walker pref_scratch → private memory (frees workgroup occupancy limiter), TPB 64→128 #23726 mis-read as a pipeline correctness blocker. With independent random points (?indep=1) the GPU output matches @noble/curves exactly (on-curve, bit-identical) at logn 8/10/12/14/16. Added WASM/SRS-free timed profiling (?reps=, ?ssweep=, ?sweepmaxwg=) that runs on real devices too.

Real-hardware results — the lever does NOT clear the bar on Apple

Apple M2 · Chrome 148 · logn=16 · independent points · reps=8 · coupled (memory-flat):

walkerS	walkerMaxWg	`stream_walker`	wall	algo-buffers	noble cross-check
8 (baseline)	32	32.19 ms	55.0 ms	47.39 MiB	✅ PASS (gpu == ref)
16	16	33.33 ms	57.0 ms	47.39 MiB	✅ PASS
32	8	30.73 ms	54.3 ms	47.39 MiB	✅ PASS

→ walker phase ~flat (−4.5 % at best), inside run-to-run noise. Memory flat (no regression). Not a significant win. (S=24/maxWg=11 returned off-curve on Metal — an odd, non-recommended config that passes under SwiftShader; recommend clean S ∈ {8,16,32}.)

Decoupled (walkerMaxWg=32 fixed, 8192 threads) — to test if the coupled mode's parallelism loss masked the win: S=8 runs (logn=17 wall 88.7 ms), but S=16 device-losts / hangs on Metal at both logn=16 and 17 (8192 threads × large per-invocation private arrays at TPB=128 → register pressure). So the only stable high-S mode on Apple is the coupled one — which is the marginal result above.

Adreno (Galaxy S25 Ultra, Snapdragon 8 Elite): the MsmV2 walker pipeline Device is lost during the first build/run, at the S=8 baseline — a pre-existing pipeline-vs-Adreno incompatibility, not the S-knob; neither baseline nor change is measurable there. (Also: BrowserStack's Android worker drops URL query params that the macOS worker honours.)

Mali (Pixel 9 Pro XL): not run — given Adreno device-losts the baseline pipeline and Apple is already marginal, the strict bar can't be met this session; flagged for the mobile-stability follow-up below.

Local SwiftShader (software WebGPU, logn=14): showed stream_walker 529→272 ms (−49 %) for S=8→32 at flat memory. This is a software-emulation artifact — SwiftShader makes the integer-heavy safegcd disproportionately expensive, so amortizing it helped hugely there but barely on real Metal. Recorded here only to flag that SwiftShader timings do not predict hardware for this kernel.

Honest status — NOT done

The bar (significant time win on Apple and Adreno and Mali, no memory regression) is not met. The real-hardware evidence shows the inversion-amortization lever is marginal on Apple M2 (the safegcd inversion is not the ~47 % real-HW cost the software profile implied, and the only stable high-S mode trades away the parallelism that would amortize it), and the walker pipeline is unstable on Adreno (device-lost) independent of this change.

What this PR does deliver: a correct, behaviour-preserving implementation (private pref_scratch + tunable walkerS/walkerMaxWg, validated against @noble/curves on real Apple M2 at logn=16 for S ∈ {8,16,32}), and a fix to the local correctness harness (?indep=1) that unblocks GPU-less cross-checking for the whole effort.

Suggested next directions (the inversion-amortization path is largely exhausted on Apple): (1) the operator's second lever — drop/trim Montgomery form in the batched-inversion path (untouched here, potentially larger); (2) mobile pipeline stability (resolve the Adreno device-lost) before any mobile perf claim is possible; (3) per-arch register-pressure tuning so high-S can keep more threads resident without device-lost.

Base: stream-walker-impl.

…mortize inversion Lift PR #23726's per-invocation var<private> pref_scratch (frees the 16 KB workgroup occupancy limiter, TPB 64→128) and add walkerS / walkerMaxWg knobs. NUM_THREADS is capped at walkerMaxWg*256 and partials_buf scales as NUM_THREADS*S, so holding walkerS*walkerMaxWg constant keeps peak memory flat while a larger S amortizes the per-batch safegcd inversion (cost/add=|inv|/S).

…knob sweep msm-correctness.ts: ?indep=1 (independent random points — fixes the spurious off-curve from the arithmetic-progression points), ?reps=N timed profiling (per-phase incl. stream_walker, no WASM/SRS so it profiles real devices too), walkerS/walkerMaxWg knobs, /progress heartbeats. run-browserstack.mjs: EXTRA_Q passthrough + --page correctness. Local SwiftShader logn=14: stream_walker 529->358->272ms for S=8/16/32 (memory flat 18.9->18.8 MiB), noble cross-check PASS at every S.

…l for BrowserStack ?ssweep=8,16,24,32 benchmarks each walkerS in one page load (coupled walkerMaxWg=256/S keeps partials flat), so one BrowserStack seat maps the whole curve. bs-serve.mjs keeps vite+cloudflared up to drive many MCP workers.

…sweeps

AztecBot added 2 commits May 30, 2026 02:23

AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026

AztecBot added 2 commits May 30, 2026 02:46

test(bb/msm): add ?sweepmaxwg= to isolate decoupled (fixed-thread) S …

d74e087

…sweeps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bb/msm): stream-walker — amortize the field inversion via large S + private pref_scratch#23736

perf(bb/msm): stream-walker — amortize the field inversion via large S + private pref_scratch#23736
AztecBot wants to merge 4 commits into
stream-walker-implfrom
cb/msm-walker-inv-amortize

AztecBot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal & lever

Change (correct, landed)

Real-hardware results — the lever does NOT clear the bar on Apple

Honest status — NOT done

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AztecBot commented May 30, 2026 •

edited

Loading