perf(bb/msm): WebGPU MSM memory wins — pack chunks+signs, drop plan-ring ping-pong, alias reduction buffers#23532
Draft
AztecBot wants to merge 1 commit into
Draft
perf(bb/msm): WebGPU MSM memory wins — pack chunks+signs, drop plan-ring ping-pong, alias reduction buffers#23532AztecBot wants to merge 1 commit into
AztecBot wants to merge 1 commit into
Conversation
…ing ping-pong, alias reduction buffers
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three independent memory-reducing refactors on top of
zw/msm-webgpu-experiments-v2. Together they take the GPU storage footprint atlogN=17, c=15from 149.3 MiB → 130.1 MiB (−19.2 MiB / −13%) without runtime regression on M2 (Apple Silicon) or S25 Ultra (Adreno 8xx). Cross-check against WASM Pippenger passes on both devices.1. Pack
chunksBuf+signsBufinto one u32decompose_scalars_boothalready produces abucket(≤ 2^14 forc=15, fits in 15 bits) and a 1-bitnegfor every(point, window)slot. Until now those went to two separatearray<u32>storage buffers of sizebatchSlots × 4each. Combined them:with the three downstream readers each pulling the field they need:
transpose_count_tiled/transpose_scatter_tiled:let col = all_csr_col_idx[i] & 0x7fffu;csr_to_v2_active_sums(both INDEX_MODE and non-INDEX_MODE):let neg = (signs[...] >> 15u) & 1u;Host drops
signsBufentirely and removes thesignsbinding from the decompose layout (4 entries instead of 5). Thesignssymbol still appears incsr_to_v2_active_sums(now bound to the packedchunksBufsincesignsBuf = chunksBufin the host).Saves
4 × batchSlotsbytes per MSM (≈ 6.3 MiB at N=2^18 c=15, batchWindows=6).Important debug note: WGSL uniform-controlled shifts are miscompiled on at least Apple Silicon + Adreno
First attempt used
bucket | (neg << c)wherec = params.z(au32uniform with value 15). Cross-check produced wrong, non-deterministic results on both M2 and S25, despite the encoding being semantically identical. Changing to the constantbucket | (neg << 15u)— same value, same bit position — made both devices pass. That's a real toolchain issue worth filing upstream against Dawn / Tint; for now this PR sticks to constant shift amounts.2. Drop the plan-ring ping-pong
chunkPlanRing/scatterPlanRing/carryPlanRingwere each allocated as 2-buffer rings indexed bylv & 1. They are written byplannerBand read by the same level'sfused/carry; each level's WebGPU compute pass ends (with the implicit pass barrier) before the next level'splannerBwrites. No cross-level read/write race exists — the ping-pong is unnecessary.Collapsed each ring to a single buffer (
chunkPlanRing.push(cp, cp); scatterPlanRing.push(sp, sp); carryPlanRing.push(yp, yp);) so existing[ring]indexing keeps working but the three duplicate allocations vanish.Saves
chunkPlanRing[1] + scatterPlanRing[1] + carryPlanRing[1](≈ 10 MiB at N=2^18, less at smaller N).Note:
countsBufs[0/1]andoffsetsBufs[0/1]are NOT collapsed —plannerAdoes in-place read ofcountsBufs[inIdx]while writingcountsBufs[outIdx], so collapsing those would race within the same dispatch.3. Alias reduction-only buffers into batch-loop buffers
redBuf/isPresentBuf/reducePrefScratchare only live during reduction, which runs strictly after the outer batch loop completes.bufA/valIdxBuf/bufBare live during the batch loop but dead by the time reduction runs. Aliased the reduction buffers as offset-0 slices of the batch-loop buffers via{ buffer, offset, size }bindings — same underlying GPU allocation, two non-overlapping logical lifetimes.Sizes verified at prepare time:
bufA.size >= 64·RED_M(= 17.8 MiB at N=2^18)valIdxBuf.size >= 4·RED_M(= 1.1 MiB)bufB.size >= NUM_WINDOWS·REDUCE_WG·MAXC·2·16(= 8.9 MiB)Saves 3 separate allocations (
redBuf+isPresentBuf+reducePrefScratch) entirely.Diagnostics also included
Small things that are useful to keep:
__msm_mem_lastwindow global +console.logat end ofMsmV2.preparereportingprepBuffers.length,totalBytes,numBatches,batchWindows,M1— captured into thememfield ofautorun=msm-cross-checkJSONL output so per-step memory accounting is grep-able from the bench harness.coi=1%26autorun%3D...URL-unpacking helper indev/msm-webgpu/main.tsso BrowserStack mobile sessions (which truncate at the first literal&in the URL) can pass autorun + logn params through thecoivalue.Measured at logN=17, c=15 on fresh BrowserStack workers
0999593b2a6)S25 swing on step 3 (+5%) is within the BS-S25 per-run variance we measured for this workload; final S25 number lands back at parity.
Out of scope (planned follow-ups)
These are real follow-up wins but were either invasive enough to defer or hit performance regressions in this round:
bufAto lv-1 output width (≈ 25 MiB at N=2^18). Needsba_planner_v2_emit+ba_planner_v2_offsetsto accept distinctM_inandM_out(today they only take onewstride). The fused/carry kernels already accept distinctM_old/M_new, so the host plumbing is small once the planner is updated.prefScratchBuf(≈ 17 MiB at N=2^18). Mechanically straightforward, but at WGI=128 × S=8 the per-workgroup shared footprint hits the 32 KiBmaxComputeWorkgroupStorageSizeceiling, forcing 1 workgroup per SM on Adreno and regressing S25 runtime by ≈ 23 %. Needs S=4 or a hybrid layout to bring shared usage under the budget.ba_fused_super_benchso the intermediate level output never lands in global memory.Created by claudebox · group:
slackbot