feat(bb/msm): coop-walker — workgroup-cooperative inversion bucket accumulator by AztecBot · Pull Request #23739 · AztecProtocol/aztec-packages

AztecBot · 2026-05-30T03:23:35Z

Re-architecture: cooperative-inversion bucket accumulator (`coop-walker`)

The stream-walker accumulate kernel is memory-bound / occupancy-limited; its
occupancy is capped by two costs that scale with S (slots per thread): ~150+
live registers and a 16 KB var<workgroup> pref_scratch (one resident
workgroup on Mali). coop-walker sets slots-per-thread=1 and shares the
batched inversion across the whole workgroup (interleaved prefix/suffix product
scan + one safegcd inversion per round): per-thread registers ~8× lower,
workgroup memory 16 KB → ~4 KB, ~8× fewer inversions, still streaming each
point once. Drop-in for the stream_walker dispatch (accum: 'coop').
Details: COOP_WALKER_DESIGN.md.

Correctness — GPU vs `@noble/curves`, headless SwiftShader

coop GREEN (gpu == noble): logn 8/10/12/14, multiple seeds. Walker too.

Real-hardware timing (BrowserStack, min ms over reps)

Apple M2 (macOS, Chrome 148) — desktop-class, NOT memory-starved:

logN	walker	coop	coop speedup
12	15.0	14.6	1.03×
14	23.7	43.2	0.55×
16	49.4	107	0.46×

coop loses at logn≥14 on M2. Expected: M2 has ample shared memory + many
cores, so the walker is not occupancy-starved there, and coop's structural cost
(1 add/thread/round ⇒ ~8× more rounds, each with workgroup barriers) is not
repaid by latency-hiding it doesn't need. The design's premise — the 16 KB
scratch capping the walker to one resident workgroup — only bites on
mobile (Mali 16 KB), which is the regime the measured ground truth
describes. Mobile (Adreno/Mali) runs are in flight to test exactly this.

SwiftShader agrees directionally (coop wins logn≤12, loses at 16) but is
correctness-only — it models none of the occupancy/latency-hiding the design
targets, so neither it nor M2 is the design's intended regime.

Status

✅ coop-walker kernel, correct, drop-in, pushed.
✅ harness: WASM-free GPU-vs-noble + same-device multi-logN A/B sweep, real
devices despite the 213-byte wasm stub.
✅ Apple M2 real-HW numbers (above).
⏳ Adreno (S25 Ultra) + Mali (Pixel 9 Pro XL) — the memory-starved regime.

Reporting mobile numbers as they land. If coop wins on mobile but not desktop,
that is the honest, useful result: the structural win is occupancy on
memory-starved GPUs, and it does not help GPUs that already have occupancy.

…cumulator

Sweep walker vs coop across logns in one page load (one BrowserStack worker per device) for the real-hardware comparison.

Absorbs one-time driver JIT / cold GPU-clock cost so the first logN in the sweep is not anomalously slow.

feat(bb/msm): coop-walker — workgroup-cooperative inversion bucket ac…

868c887

…cumulator

AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026

AztecBot added 5 commits May 30, 2026 03:44

update PR #23739

b8a2bba

update PR #23739

c28816a

update PR #23739

8e6432c

feat(bb/msm): multi-logn same-device A/B autorun for coop-walker bench

04685ab

Sweep walker vs coop across logns in one page load (one BrowserStack worker per device) for the real-hardware comparison.

feat(bb/msm): pre-sweep GPU warmup in A/B autorun

cc462e2

Absorbs one-time driver JIT / cold GPU-clock cost so the first logN in the sweep is not anomalously slow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bb/msm): coop-walker — workgroup-cooperative inversion bucket accumulator#23739

feat(bb/msm): coop-walker — workgroup-cooperative inversion bucket accumulator#23739
AztecBot wants to merge 6 commits into
stream-walker-implfrom
cb/msm-coop-walker

AztecBot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-architecture: cooperative-inversion bucket accumulator (coop-walker)

Correctness — GPU vs @noble/curves, headless SwiftShader

Real-hardware timing (BrowserStack, min ms over reps)

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AztecBot commented May 30, 2026 •

edited

Loading

Re-architecture: cooperative-inversion bucket accumulator (`coop-walker`)

Correctness — GPU vs `@noble/curves`, headless SwiftShader