Skip to content

perf(bb/msm): stream-walker — amortize the field inversion via large S + private pref_scratch#23736

Draft
AztecBot wants to merge 4 commits into
stream-walker-implfrom
cb/msm-walker-inv-amortize
Draft

perf(bb/msm): stream-walker — amortize the field inversion via large S + private pref_scratch#23736
AztecBot wants to merge 4 commits into
stream-walker-implfrom
cb/msm-walker-inv-amortize

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

Goal & lever

Make the stream-walker MSM accumulator faster on laptop/mobile GPUs without regressing its memory, by attacking the cost the software profile flagged as ~47 % of the accumulate kernel (PR #23732): the safegcd field inversion. The walker does one inversion per batch of S affine adds, so cost-per-add = |inv| / S. Raise S to amortize, and lift PR #23726's per-invocation var<private> pref_scratch so the 16 KB workgroup budget no longer pins S small.

Change (correct, landed)

  • pref_scratchvar<private> (mustache {{#pref_private}}, default on; workgroup path preserved). Frees the 16 KB workgroup limiter (TPB 64→128), adds no storage binding (stays within the 10-per-stage floor Mali/Adreno/SwiftShader enforce), removes the cap that pinned S small.
  • walkerS / walkerMaxWg knobs (MsmConfig + URL). NUM_THREADS = walkerMaxWg·256; partials_buf ∝ NUM_THREADS·S. Holding walkerS·walkerMaxWg = 256 (default 8·32) keeps peak memory flat while raising S cuts the inversion count (= total_adds/S).
  • WGSL regenerated (_generated/shaders.ts in diff).
  • Correctness-harness fix (dev/msm-webgpu/msm-correctness.ts): the synthetic points were a random arithmetic progression on G, whose linear relation makes MSM partial sums collide in x (affine-add ÷0) → an off-curve GPU result that PR perf(bb/msm): stream-walker pref_scratch → private memory (frees workgroup occupancy limiter), TPB 64→128 #23726 mis-read as a pipeline correctness blocker. With independent random points (?indep=1) the GPU output matches @noble/curves exactly (on-curve, bit-identical) at logn 8/10/12/14/16. Added WASM/SRS-free timed profiling (?reps=, ?ssweep=, ?sweepmaxwg=) that runs on real devices too.

Real-hardware results — the lever does NOT clear the bar on Apple

Apple M2 · Chrome 148 · logn=16 · independent points · reps=8 · coupled (memory-flat):

walkerS walkerMaxWg stream_walker wall algo-buffers noble cross-check
8 (baseline) 32 32.19 ms 55.0 ms 47.39 MiB ✅ PASS (gpu == ref)
16 16 33.33 ms 57.0 ms 47.39 MiB ✅ PASS
32 8 30.73 ms 54.3 ms 47.39 MiB ✅ PASS

walker phase ~flat (−4.5 % at best), inside run-to-run noise. Memory flat (no regression). Not a significant win. (S=24/maxWg=11 returned off-curve on Metal — an odd, non-recommended config that passes under SwiftShader; recommend clean S ∈ {8,16,32}.)

Decoupled (walkerMaxWg=32 fixed, 8192 threads) — to test if the coupled mode's parallelism loss masked the win: S=8 runs (logn=17 wall 88.7 ms), but S=16 device-losts / hangs on Metal at both logn=16 and 17 (8192 threads × large per-invocation private arrays at TPB=128 → register pressure). So the only stable high-S mode on Apple is the coupled one — which is the marginal result above.

Adreno (Galaxy S25 Ultra, Snapdragon 8 Elite): the MsmV2 walker pipeline Device is lost during the first build/run, at the S=8 baseline — a pre-existing pipeline-vs-Adreno incompatibility, not the S-knob; neither baseline nor change is measurable there. (Also: BrowserStack's Android worker drops URL query params that the macOS worker honours.)

Mali (Pixel 9 Pro XL): not run — given Adreno device-losts the baseline pipeline and Apple is already marginal, the strict bar can't be met this session; flagged for the mobile-stability follow-up below.

Local SwiftShader (software WebGPU, logn=14): showed stream_walker 529→272 ms (−49 %) for S=8→32 at flat memory. This is a software-emulation artifact — SwiftShader makes the integer-heavy safegcd disproportionately expensive, so amortizing it helped hugely there but barely on real Metal. Recorded here only to flag that SwiftShader timings do not predict hardware for this kernel.

Honest status — NOT done

The bar (significant time win on Apple and Adreno and Mali, no memory regression) is not met. The real-hardware evidence shows the inversion-amortization lever is marginal on Apple M2 (the safegcd inversion is not the ~47 % real-HW cost the software profile implied, and the only stable high-S mode trades away the parallelism that would amortize it), and the walker pipeline is unstable on Adreno (device-lost) independent of this change.

What this PR does deliver: a correct, behaviour-preserving implementation (private pref_scratch + tunable walkerS/walkerMaxWg, validated against @noble/curves on real Apple M2 at logn=16 for S ∈ {8,16,32}), and a fix to the local correctness harness (?indep=1) that unblocks GPU-less cross-checking for the whole effort.

Suggested next directions (the inversion-amortization path is largely exhausted on Apple): (1) the operator's second lever — drop/trim Montgomery form in the batched-inversion path (untouched here, potentially larger); (2) mobile pipeline stability (resolve the Adreno device-lost) before any mobile perf claim is possible; (3) per-arch register-pressure tuning so high-S can keep more threads resident without device-lost.

Base: stream-walker-impl.

AztecBot added 2 commits May 30, 2026 02:23
…mortize inversion

Lift PR #23726's per-invocation var<private> pref_scratch (frees the 16 KB
workgroup occupancy limiter, TPB 64→128) and add walkerS / walkerMaxWg knobs.
NUM_THREADS is capped at walkerMaxWg*256 and partials_buf scales as
NUM_THREADS*S, so holding walkerS*walkerMaxWg constant keeps peak memory flat
while a larger S amortizes the per-batch safegcd inversion (cost/add=|inv|/S).
…knob sweep

msm-correctness.ts: ?indep=1 (independent random points — fixes the spurious
off-curve from the arithmetic-progression points), ?reps=N timed profiling
(per-phase incl. stream_walker, no WASM/SRS so it profiles real devices too),
walkerS/walkerMaxWg knobs, /progress heartbeats. run-browserstack.mjs: EXTRA_Q
passthrough + --page correctness. Local SwiftShader logn=14: stream_walker
529->358->272ms for S=8/16/32 (memory flat 18.9->18.8 MiB), noble cross-check
PASS at every S.
@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
AztecBot added 2 commits May 30, 2026 02:46
…l for BrowserStack

?ssweep=8,16,24,32 benchmarks each walkerS in one page load (coupled
walkerMaxWg=256/S keeps partials flat), so one BrowserStack seat maps the whole
curve. bs-serve.mjs keeps vite+cloudflared up to drive many MCP workers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant