feat(bb/msm): coop-walker — workgroup-cooperative inversion bucket accumulator#23739
Draft
AztecBot wants to merge 6 commits into
Draft
feat(bb/msm): coop-walker — workgroup-cooperative inversion bucket accumulator#23739AztecBot wants to merge 6 commits into
AztecBot wants to merge 6 commits into
Conversation
Sweep walker vs coop across logns in one page load (one BrowserStack worker per device) for the real-hardware comparison.
Absorbs one-time driver JIT / cold GPU-clock cost so the first logN in the sweep is not anomalously slow.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Re-architecture: cooperative-inversion bucket accumulator (
coop-walker)The stream-walker accumulate kernel is memory-bound / occupancy-limited; its
occupancy is capped by two costs that scale with S (slots per thread): ~150+
live registers and a 16 KB
var<workgroup> pref_scratch(one residentworkgroup on Mali).
coop-walkersets slots-per-thread=1 and shares thebatched inversion across the whole workgroup (interleaved prefix/suffix product
scan + one safegcd inversion per round): per-thread registers ~8× lower,
workgroup memory 16 KB → ~4 KB, ~8× fewer inversions, still streaming each
point once. Drop-in for the
stream_walkerdispatch (accum: 'coop').Details:
COOP_WALKER_DESIGN.md.Correctness — GPU vs
@noble/curves, headless SwiftShadercoop GREEN (
gpu == noble): logn 8/10/12/14, multiple seeds. Walker too.Real-hardware timing (BrowserStack, min ms over reps)
Apple M2 (macOS, Chrome 148) — desktop-class, NOT memory-starved:
coop loses at logn≥14 on M2. Expected: M2 has ample shared memory + many
cores, so the walker is not occupancy-starved there, and coop's structural cost
(1 add/thread/round ⇒ ~8× more rounds, each with workgroup barriers) is not
repaid by latency-hiding it doesn't need. The design's premise — the 16 KB
scratch capping the walker to one resident workgroup — only bites on
mobile (Mali 16 KB), which is the regime the measured ground truth
describes. Mobile (Adreno/Mali) runs are in flight to test exactly this.
SwiftShader agrees directionally (coop wins logn≤12, loses at 16) but is
correctness-only — it models none of the occupancy/latency-hiding the design
targets, so neither it nor M2 is the design's intended regime.
Status
devices despite the 213-byte wasm stub.
Reporting mobile numbers as they land. If coop wins on mobile but not desktop,
that is the honest, useful result: the structural win is occupancy on
memory-starved GPUs, and it does not help GPUs that already have occupancy.