perf(bb/msm): stream-walker — amortize the field inversion via large S + private pref_scratch#23736
Draft
AztecBot wants to merge 4 commits into
Draft
perf(bb/msm): stream-walker — amortize the field inversion via large S + private pref_scratch#23736AztecBot wants to merge 4 commits into
AztecBot wants to merge 4 commits into
Conversation
…mortize inversion Lift PR #23726's per-invocation var<private> pref_scratch (frees the 16 KB workgroup occupancy limiter, TPB 64→128) and add walkerS / walkerMaxWg knobs. NUM_THREADS is capped at walkerMaxWg*256 and partials_buf scales as NUM_THREADS*S, so holding walkerS*walkerMaxWg constant keeps peak memory flat while a larger S amortizes the per-batch safegcd inversion (cost/add=|inv|/S).
…knob sweep msm-correctness.ts: ?indep=1 (independent random points — fixes the spurious off-curve from the arithmetic-progression points), ?reps=N timed profiling (per-phase incl. stream_walker, no WASM/SRS so it profiles real devices too), walkerS/walkerMaxWg knobs, /progress heartbeats. run-browserstack.mjs: EXTRA_Q passthrough + --page correctness. Local SwiftShader logn=14: stream_walker 529->358->272ms for S=8/16/32 (memory flat 18.9->18.8 MiB), noble cross-check PASS at every S.
…l for BrowserStack ?ssweep=8,16,24,32 benchmarks each walkerS in one page load (coupled walkerMaxWg=256/S keeps partials flat), so one BrowserStack seat maps the whole curve. bs-serve.mjs keeps vite+cloudflared up to drive many MCP workers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Goal & lever
Make the stream-walker MSM accumulator faster on laptop/mobile GPUs without regressing its memory, by attacking the cost the software profile flagged as ~47 % of the accumulate kernel (PR #23732): the safegcd field inversion. The walker does one inversion per batch of S affine adds, so cost-per-add
= |inv| / S. Raise S to amortize, and lift PR #23726's per-invocationvar<private>pref_scratchso the 16 KB workgroup budget no longer pins S small.Change (correct, landed)
pref_scratch→var<private>(mustache{{#pref_private}}, default on; workgroup path preserved). Frees the 16 KB workgroup limiter (TPB 64→128), adds no storage binding (stays within the 10-per-stage floor Mali/Adreno/SwiftShader enforce), removes the cap that pinned S small.walkerS/walkerMaxWgknobs (MsmConfig+ URL).NUM_THREADS = walkerMaxWg·256;partials_buf ∝ NUM_THREADS·S. HoldingwalkerS·walkerMaxWg = 256(default8·32) keeps peak memory flat while raising S cuts the inversion count (= total_adds/S)._generated/shaders.tsin diff).dev/msm-webgpu/msm-correctness.ts): the synthetic points were a random arithmetic progression on G, whose linear relation makes MSM partial sums collide in x (affine-add ÷0) → an off-curve GPU result that PR perf(bb/msm): stream-walker pref_scratch → private memory (frees workgroup occupancy limiter), TPB 64→128 #23726 mis-read as a pipeline correctness blocker. With independent random points (?indep=1) the GPU output matches@noble/curvesexactly (on-curve, bit-identical) at logn 8/10/12/14/16. Added WASM/SRS-free timed profiling (?reps=,?ssweep=,?sweepmaxwg=) that runs on real devices too.Real-hardware results — the lever does NOT clear the bar on Apple
Apple M2 · Chrome 148 · logn=16 · independent points · reps=8 · coupled (memory-flat):
stream_walker→ walker phase ~flat (−4.5 % at best), inside run-to-run noise. Memory flat (no regression). Not a significant win. (S=24/maxWg=11 returned off-curve on Metal — an odd, non-recommended config that passes under SwiftShader; recommend clean S ∈ {8,16,32}.)
Decoupled (walkerMaxWg=32 fixed, 8192 threads) — to test if the coupled mode's parallelism loss masked the win: S=8 runs (logn=17 wall 88.7 ms), but S=16 device-losts / hangs on Metal at both logn=16 and 17 (8192 threads × large per-invocation private arrays at TPB=128 → register pressure). So the only stable high-S mode on Apple is the coupled one — which is the marginal result above.
Adreno (Galaxy S25 Ultra, Snapdragon 8 Elite): the MsmV2 walker pipeline
Device is lostduring the first build/run, at the S=8 baseline — a pre-existing pipeline-vs-Adreno incompatibility, not the S-knob; neither baseline nor change is measurable there. (Also: BrowserStack's Android worker drops URL query params that the macOS worker honours.)Mali (Pixel 9 Pro XL): not run — given Adreno device-losts the baseline pipeline and Apple is already marginal, the strict bar can't be met this session; flagged for the mobile-stability follow-up below.
Local SwiftShader (software WebGPU, logn=14): showed
stream_walker529→272 ms (−49 %) for S=8→32 at flat memory. This is a software-emulation artifact — SwiftShader makes the integer-heavy safegcd disproportionately expensive, so amortizing it helped hugely there but barely on real Metal. Recorded here only to flag that SwiftShader timings do not predict hardware for this kernel.Honest status — NOT done
The bar (significant time win on Apple and Adreno and Mali, no memory regression) is not met. The real-hardware evidence shows the inversion-amortization lever is marginal on Apple M2 (the safegcd inversion is not the ~47 % real-HW cost the software profile implied, and the only stable high-S mode trades away the parallelism that would amortize it), and the walker pipeline is unstable on Adreno (device-lost) independent of this change.
What this PR does deliver: a correct, behaviour-preserving implementation (private
pref_scratch+ tunablewalkerS/walkerMaxWg, validated against@noble/curveson real Apple M2 at logn=16 for S ∈ {8,16,32}), and a fix to the local correctness harness (?indep=1) that unblocks GPU-less cross-checking for the whole effort.Suggested next directions (the inversion-amortization path is largely exhausted on Apple): (1) the operator's second lever — drop/trim Montgomery form in the batched-inversion path (untouched here, potentially larger); (2) mobile pipeline stability (resolve the Adreno device-lost) before any mobile perf claim is possible; (3) per-arch register-pressure tuning so high-S can keep more threads resident without device-lost.
Base:
stream-walker-impl.