Skip to content

feat(bb/msm): coop-walker — workgroup-cooperative inversion bucket accumulator#23739

Draft
AztecBot wants to merge 6 commits into
stream-walker-implfrom
cb/msm-coop-walker
Draft

feat(bb/msm): coop-walker — workgroup-cooperative inversion bucket accumulator#23739
AztecBot wants to merge 6 commits into
stream-walker-implfrom
cb/msm-coop-walker

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

Re-architecture: cooperative-inversion bucket accumulator (coop-walker)

The stream-walker accumulate kernel is memory-bound / occupancy-limited; its
occupancy is capped by two costs that scale with S (slots per thread): ~150+
live registers and a 16 KB var<workgroup> pref_scratch (one resident
workgroup on Mali). coop-walker sets slots-per-thread=1 and shares the
batched inversion across the whole workgroup (interleaved prefix/suffix product
scan + one safegcd inversion per round): per-thread registers ~8× lower,
workgroup memory 16 KB → ~4 KB, ~8× fewer inversions, still streaming each
point once. Drop-in for the stream_walker dispatch (accum: 'coop').
Details: COOP_WALKER_DESIGN.md.

Correctness — GPU vs @noble/curves, headless SwiftShader

coop GREEN (gpu == noble): logn 8/10/12/14, multiple seeds. Walker too.

Real-hardware timing (BrowserStack, min ms over reps)

Apple M2 (macOS, Chrome 148) — desktop-class, NOT memory-starved:

logN walker coop coop speedup
12 15.0 14.6 1.03×
14 23.7 43.2 0.55×
16 49.4 107 0.46×

coop loses at logn≥14 on M2. Expected: M2 has ample shared memory + many
cores, so the walker is not occupancy-starved there, and coop's structural cost
(1 add/thread/round ⇒ ~8× more rounds, each with workgroup barriers) is not
repaid by latency-hiding it doesn't need. The design's premise — the 16 KB
scratch capping the walker to one resident workgroup — only bites on
mobile (Mali 16 KB), which is the regime the measured ground truth
describes. Mobile (Adreno/Mali) runs are in flight to test exactly this.

SwiftShader agrees directionally (coop wins logn≤12, loses at 16) but is
correctness-only — it models none of the occupancy/latency-hiding the design
targets, so neither it nor M2 is the design's intended regime.

Status

  • ✅ coop-walker kernel, correct, drop-in, pushed.
  • ✅ harness: WASM-free GPU-vs-noble + same-device multi-logN A/B sweep, real
    devices despite the 213-byte wasm stub.
  • ✅ Apple M2 real-HW numbers (above).
  • ⏳ Adreno (S25 Ultra) + Mali (Pixel 9 Pro XL) — the memory-starved regime.

Reporting mobile numbers as they land. If coop wins on mobile but not desktop,
that is the honest, useful result: the structural win is occupancy on
memory-starved GPUs, and it does not help GPUs that already have occupancy.

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
AztecBot added 5 commits May 30, 2026 03:44
Sweep walker vs coop across logns in one page load (one BrowserStack worker
per device) for the real-hardware comparison.
Absorbs one-time driver JIT / cold GPU-clock cost so the first logN in the
sweep is not anomalously slow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant