Skip to content

fix(bb.js): stream large Android WebGPU MSM chunks#23523

Draft
AztecBot wants to merge 103 commits into
nextfrom
cb/549298179daa
Draft

fix(bb.js): stream large Android WebGPU MSM chunks#23523
AztecBot wants to merge 103 commits into
nextfrom
cb/549298179daa

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

Summary

  • Adds a bounded-footprint WebGPU MSM path for large Android SRS-prefix MSMs: split into sequential 2^17 chunks, stream a chunk-local Montgomery point pool, reuse one compiled MsmV2, add per-window chunk sums, and return the combined window sums.
  • Applies the same path in the dev BrowserStack harness and the production WebGPU bridge.
  • Removes the unsafe large decompose/transpose path from MsmV2 prepare by materializing level-0 CSR on the host, and keeps desktop/non-Android on the existing full-size path.
  • Fixes an out-of-bounds read in the kind-2 reduction padding lane.

Validation

  • yarn generate:wgsl
  • ./node_modules/.bin/prettier -w src/msm_webgpu/msm_v2.ts src/msm_webgpu/bridge/main.ts src/msm_webgpu/cuzk/shader_manager.ts dev/msm-webgpu/main.ts
  • ./node_modules/.bin/tsgo -p dev/msm-webgpu/tsconfig.json --noEmit
  • S25 BrowserStack 2^18, scalar_seed=2, streamed 2^17 chunks: WebGPU completed and cross-checked against WASM. GPU compute reported 403.4 ms; WASM reported 392.4 ms; prior full-size S25 path lost the device twice.

Known local check limitation

  • ./node_modules/.bin/tsgo -p tsconfig.json --noEmit still fails before reaching this change because generated cbind / aztec-wsdb files are absent in this checkout.

Created by claudebox · group: slackbot

suyash67 and others added 30 commits May 13, 2026 09:19
Plumbing for delegating BN254 batch MSMs from the WASM prover to a
WebGPU MSM running in the browser's main thread, plus a dev page for
running the GPU MSM against a noble Pippenger reference.

C++ side (BBERG_WEBGPU_MSM_HOOK, WASM-only):
- batch_multi_scalar_mul delegates to bb_external_msm_bn254 when the
  batch has at least one MSM hitting BBERG_WEBGPU_MSM_MIN_N (2^16).
  handle_edge_cases=true and Grumpkin stay on native Pippenger.
- Marshalling helpers (LE bytes, auto Montgomery strip/rewrap) split
  into a header that can be unit-tested without the WASM-only hook.

bb.js side (barretenberg/ts/src/msm_webgpu/):
- Port of tal-webgpu submission.ts / cuzk drivers / WGSL templates,
  BN254 only. BLS12-377, GLV, and atheonxyz entries removed.
- Bridge: SAB-based protocol + worker stub + main-thread host that
  owns the GpuContext and CachedBases.
- setExtraEnvImports on BarretenbergWasmBase so the bridge can inject
  bb_external_msm_bn254 into the WASM env without the base class
  knowing about WebGPU.

Dev page (barretenberg/ts/dev/msm-webgpu/):
- Vite-served HTML that generates random affine points + Fr scalars,
  runs compute_bn254_msm, cross-checks against noble's Pippenger MSM.
- WGSL unit tests for the decompose and parallel-transpose stages.
- Run via `yarn dev:msm-webgpu`.

Correctness fixes made while bringing the port up:
- SMVP dispatch math at chunk_size=15 (num_subtasks=18) was truncating
  to zero workgroups. Refactored to pick smvp_subtasks_per_iter as the
  largest divisor of num_subtasks <= 4 and dispatch the full pre-
  computed geometry per iter.
- SMVP shader skips the (j=1, id%h=0) chunk=0 row. At the top BN254
  subtask nearly all random Fr scalars decompose to chunk 0 there, so
  this one thread would otherwise iterate through ~n/2 mixed-adds and
  trip Chrome's GPU watchdog at n >= 2^17. The result was being
  discarded anyway (bucket_idx=0 sentinel).
- Dev page point layout switched to interleaved [x_0|y_0|x_1|y_1|...]
  per point, matching marshal_points and the convert shader's split.

Validated on n = 2^16, 2^17 against noble's MSM reference.
Replaces per-Run random-point regen with the first 2^20 points of the
public BN254 G1 SRS (fetched from crs.aztec-cdn.foundation, decompressed
once, cached in IndexedDB). Each Run now slices the prefix and pairs it
with fresh random Fr scalars, exercising the same bases bb.js hands the
WASM prover. log2(n) range is 16..20.

Cold pull is slow (~2.5 min) because decompression is JS bigint sqrts —
fine for a one-off, replaced in the real prove-flow demo by bb.js WASM
srsInitSrs. Tqdm-style progress bar reports download/decompress/cache
phases with rate + ETA so the 5-min wait is observable.
C++:
- Extract MSM::batch_multi_scalar_mul into a hook-bypassing
  batch_multi_scalar_mul_native. The original entry now branches on
  BBERG_WEBGPU_MSM_HOOK and otherwise delegates to native, so callers
  that explicitly want the in-tree Pippenger (e.g. the in-browser
  comparison harness) don't recurse back through the JS bridge.
- Add a WASM_EXPORT bb_native_pippenger_bn254 that runs Pippenger
  on raw LE-32 byte inputs without going through cbindCall('bbapi', ...)
  msgpack. Intended for direct JS callers that want to measure the
  Pippenger cost without serialization overhead.

bb.js worker factory (only when BBERG_WEBGPU_MSM_HOOK is compiled in):
- main.worker.ts / thread.worker.ts install default stubs for the hook's
  env imports (bb_external_msm_bn254 throws, bb_publish_srs_bn254 no-op)
  before WASM instantiation. WebAssembly.instantiate refuses to link
  without these even when the bridge isn't wired up.
- Both worker entries use dynamic import() inside an async IIFE so the
  error/unhandledrejection listeners register synchronously at module
  evaluation, before any import resolves. Bootstrap failures get
  postMessaged back as 'worker-error' instead of vanishing into an
  opaque parent-side error event.
- createMainWorker races the readiness signal against a 50 ms grace
  window for these self-reported errors and surfaces the rich version
  when available.
Adds the bb WASM Pippenger paths (single-threaded + multi-threaded) to
the existing in-browser BN254 MSM dev page, alongside the WebGPU MSM
and the noble reference. Lets us sweep log₂(n)=16..20 and read out
median timings for all three implementations on the same SRS-backed
inputs.

- pippenger_wasm.ts: bb.js-style harness that boots the threaded
  barretenberg WASM in a Worker, calls bb_native_pippenger_bn254 via
  the WASM exports table (bypassing the bbapi msgpack path), and
  returns the affine result. Used twice — one worker booted with
  threads=1 for the ST row, one with threads=N for the MT row. Two
  workers is required because bb's parallel_for static pool is sized
  at first call and stuck thereafter; sharing one worker for both
  roles produced a slower MT time than ST.
- main.ts / index.html: sweep + benchmark UI, lazy WASM boot, Stop
  button to tear down workers, ?coi=1 toggle for cross-origin isolation
  (gated because the threaded WASM needs SharedArrayBuffer).
- vite.config.ts: dev server config — middleware that serves the
  freshly-built barretenberg.wasm.gz directly out of
  cpp/build-wasm-threads/bin (no copy needed), and conditional COOP/COEP
  headers. The COI middleware additionally attaches
  Cross-Origin-Embedder-Policy: require-corp to any response whose
  Sec-Fetch-Dest is worker/sharedworker/serviceworker — without this,
  a COI'd document can't spawn workers (the worker response must also
  carry COEP, and worker URLs don't carry the ?coi=1 query suffix the
  document does).
… threads to hwConcurrency

- Sweep result now renders two tables instead of one. A consistency
  table at log₂n = 16 (n = 65,536) cross-checks WebGPU / WASM-ST /
  WASM-MT / Noble pairwise — each cell pass/FAIL independently so a
  regression can be pinpointed to which pair disagrees. A perf table
  for log₂n = 16..20 reports WebGPU and WASM MT medians + ratio;
  WASM ST is omitted from the perf table (strictly slower at these
  sizes and not the production path), and the Noble column also
  drops from the perf table since it only runs at the reference size.
- MT threads input now defaults to navigator.hardwareConcurrency
  (capped at 32) instead of the fixed value of 4. Falls back to 4
  when the browser doesn't expose hardwareConcurrency. The HTML
  input no longer hard-codes value="4" — main.ts writes the
  computed default into the input at startup.
…ches

The dev page measured WebGPU at ~200 ms for 2^16, while the same WGSL
running through tal-webgpu's Bn254Grid measured ~90 ms. The gap was
entry-point selection, not algorithm: the page called the cold
`compute_bn254_msm` (re-acquires device, recompiles every pipeline,
re-uploads SRS, runs Stage-1 Barrett/Montgomery convert, destroys
device) on the timed dispatch.

main.ts now mirrors the tal-webgpu measurement protocol: one persistent
GpuContext for the page lifetime, one CachedBases per logN (regenerated
when the slider changes), one warm-up dispatch, then time
`compute_bn254_msm_batch_affine` across reps. Setup costs are logged
under `[gpu-warm]` so the per-step amortisation is visible. Sweep at
2^16..2^20 now reports WebGPU 1.03x..2.06x faster than WASM MT (14t),
with 2^16 at ~75 ms median.

Two bugs surfaced once the persistent context was held across logN
transitions:

1. msm.ts: `bpr1_key` / `bpr2_key` omitted input_size. The cached
   bind groups bound bucket_sum_*_sb and g_points_*_sb from the first
   logN; SMVP at the next logN wrote into fresh ws_key-keyed buffers
   that BPR never read, and BPR wrote into stale buffers that Horner
   never read. Output was identity (gpu.x=0). Threaded input_size into
   bpr_1 / bpr_2 and appended it to the keys.

2. batch_affine.ts: `ws_key` (governs init/schedule/inverse/apply/
   finalize bind groups) omitted input_size. Those bind groups capture
   `point_x_sb` / `point_y_sb` from CachedBases plus all_csc_*_sb from
   the transpose — both recreated on a logN switch. Cached bind groups
   ended up referencing destroyed buffers; kernels dispatched fast
   (~4.5 ms total) writing nothing. Added input_size to the key.

Also replaced two `Buffer.alloc(0)` placeholders in compute_bn254_msm_cached
and compute_bn254_msm_batch_affine with `new Uint8Array(0) as unknown as
Buffer`. The placeholder is unread on the cached path; the change
removes the dependency on a Node `Buffer` global so the warm entry
points work in a Vite dev server / browser without a polyfill.
Builds on the warm cached-bases path (76d08ed) to surface a
per-pass breakdown of each WebGPU MSM rep.

- Re-export `type ProfileCapture` from src/msm_webgpu/index.ts.
- runWebGpuOnce passes a fresh ProfileCapture out-param to
  compute_bn254_msm_batch_affine on every call and threads the
  populated capture (per-pass GPU timestamps, CPU phase totals, GPU
  readback decomposition) up through BackendSample.
- New renderBreakdownTable aggregates captures across reps (subtasks
  and rounds summed within each rep, then medianised across reps)
  and renders a median-ms breakdown alongside the perf table on
  Sweep, and as the sole result panel on Run × 5 and Quick Sanity
  Check. STAGE_ORDER lays stages out in pipeline order
  (decompose → convert → transpose → ba_* → smvp → bpr → reduce)
  so rows read top-to-bottom along the dataflow.
Adds a WGSL compute shader (decompress_g1_bn254) that recovers BN254
G1 points from compressed CRS bytes via fr_pow sqrt in Montgomery
form, plus a TS runner wired into loadSrsPoints with a JS-bigint
fallback. A small sample cross-check against the JS reference runs
on cold load so a shader regression can't silently corrupt every
cached SRS.

Cold first-load on 2^20 points drops from ~29s of JS bigint sqrts
to ~0.7s of GPU compute; subsequent loads still hit IndexedDB.
…T table

Renders a small ✓ on the WebGPU and WASM MT median-ms cells when the
two backends produced the same MSM result for that size, ✗ on
disagreement. Stays blank until the row's first rep finishes. Makes
"is this comparison even valid" visible at a glance without scrolling
down to the consistency table.
The batch-affine SMVP path caches init/schedule/apply bind groups under
a key that includes input_size, and those bind groups reference
`cachedBases.point_x_sb` / `point_y_sb` directly. When a caller rotated
CachedBases across a logN sweep (16 → 17 → … → 20 → 16 again), the bind
group inserted on the first visit to N=2^16 outlived the buffers it
pointed at. The second visit to the same N hit the cache, dispatched
against destroyed/zero memory, and the MSM returned the identity
point — visible on the dev page as a ✗ on the per-row correctness tick.

`CachedBases.destroy()` now calls `context.invalidateAllBindGroups()`
so the next dispatch rebuilds bind groups against the live buffers.
The existing `invalidateBindGroupsReferencing` private helper is
reframed as a public clear-all method.
ba_dispatch_args was previously skipped under the assumption it was
sub-millisecond; profiling it confirmed it is, but it also revealed
the bulk of the residual lives elsewhere. Profiler.report() now
derives an `encoder_all` row from min/max of existing stage timestamps
(empty-pass markers are unreliable on Dawn — slots stay at 0). The
dev page surfaces `inter_pass_overhead = encoder_all − Σ stages` and
`post_encoder_tail = wall − encoder_all`, which splits the previous
`untimestamped` row into Dawn barrier cost (small) and pre/post
queue work (dominant — driven by the scalars writeBuffer).
…ults

Adds a standalone WebGPU micro-benchmark page comparing three BN254
Montgomery product implementations for chained-mul throughput:

- cios (u32): mitschabaude runtime-loop CIOS over 20×13-bit limbs.
  Baseline, ~109 ms at n=2^20, k=100.
- karat (u32): recursive Karatsuba + Yuval reduction. 9 5×5 schoolbook
  sub-sub-products are computed independently and combined via two
  Karatsuba levels; reduction uses precomputed r_inv = W^-1 mod p with
  zero drains in the multiply phase (unsigned wrap unwinds via
  subsequent subtraction). ~80 ms (~28% faster than cios).
- sos3uv3 (f32, reference): 22-bit f32 limbs with separate per-slot
  tlo/thi accumulators that break the inner-j carry chain. Single
  drain per outer iter via bias_split_f32_le4w. ~79 ms.

The bench harness:
- bench-field-mul.html is a standalone page; reads ?path=u32|f32
  &n=N&k=K&validate-n=N&reps=R&variant=V from the URL.
- bench-field-mul.ts runs k chained Mont mults per thread, validates
  the first `validate-n` outputs against a host BigInt reference, and
  writes timing into window.__bench.
- scripts/bench-field-mul.mjs is a Playwright driver for headless
  invocation from the CLI (added playwright-core as devDependency).
Routes the `montgomery_product_funcs` mustache partial through a
pre-rendered Karatsuba+Yuval body in every MSM shader that does a
base-field multiply (15 callsites: convert_points, smvp, horner,
batch_affine_{apply,schedule,finalize_*,init,apply_scatter},
batch_inverse{,_parallel}, bpr, decompress_g1, montgomery_parity).

The Karatsuba body benches ~27% faster than the mitschabaude
runtime-loop CIOS at n=2^20, k=100 (80 ms vs 109 ms). It exposes the
same `fn montgomery_product(x, y) -> BigInt` symbol plus the same
`get_p` / `conditional_reduce` helpers and uses the same 20×13-bit
limb layout, so the swap is a drop-in change with no callsite churn.

The field-mul bench retains both options (`?variant=cios` renders the
original template inline, `?variant=karat` reuses the class-level
default) so the two bodies can be compared side-by-side.
Phase 1 LANDED — BY safegcd inversion (fr_inv_by_a, Option A: 20×13-bit, BATCH=26, carry-free apply_matrix):
- Production swap-in: wgsl/cuzk/batch_inverse{,_parallel}.template.wgsl call fr_inv_by_a
- 1.5× faster than legacy fr_inv (Pornin K=12) at chained-inverse bench
- ~8% MSM wall reduction at logN=16 sanity check
- TS port (cuzk/bernstein_yang.ts, bernstein_yang_a.ts) + Jest tests (24 passing)
- WGSL impls: wgsl/field/by_inverse{,_a}.template.wgsl + wgsl/bigint/bigint_by.template.wgsl

Phase 2 EXPLORATORY — multi-window pooled batch_inverse + multi-window BPR:
- WPB plumbing in batch_inverse_parallel + dispatch_args + batch_affine.ts
- Default WPB=1 (= legacy behavior, no perf change)
- BPR_WINDOWS_PER_BATCH knob in bpr_bn254.template.wgsl
- Empirical: pooling without growing WG count gives 0% gain — design needs restructure

Standalone bench infrastructure:
- bench-divsteps, bench-apply-matrix, bench-fr-inv, bench-batch-affine
- Each with HTML page + TS dispatcher + Playwright runner under dev/msm-webgpu/scripts/
- profile-sanity.mjs for per-pass GPU time breakdown on the Quick Sanity Check

Tree-reduce design (Stage B) for autonomous remote execution:
- .claude/plans/msm-tree-reduce.md — full design (adaptive batch sizing, analytical slice partition, 2 distinct phase kernels)
- .claude/plans/remote-agent-brief.md — remote agent execution brief

Co-authored with Claude.
Adds a remote-device bench loop for the MSM-webgpu dev pages so the
tree-reduce work can validate against real WebGPU hardware (Apple M2,
Snapdragon 8 Elite, Tensor G4) from a workstation without a local GPU.

- vite.config.ts: results/progress POST endpoints write JSONL to files
  named by MSM_WEBGPU_RESULTS_FILE / MSM_WEBGPU_PROGRESS_FILE; allow
  .trycloudflare.com so the dev server is reachable via Cloudflare
  Quick Tunnel.
- results_post.ts: tiny in-page client used by bench/sanity pages to
  POST progress + final-state payloads (no keepalive — the page is
  alive when the bench completes).
- bench-batch-affine.ts: post per-batch progress and a terminal
  done/error row.
- scripts/run-browserstack.mjs: spawn vite + cloudflared, drive a BS
  worker through the REST API, watchdog-tail the JSONL with
  first-progress / stall / deadline budgets.
- scripts/bs-targets.mjs: macOS Sequoia Chrome, S25 Ultra, Pixel 9
  Pro XL presets (WebGPU stable). iPhone 15 Pro listed but flagged
  as needs-iOS-26-or-newer.

Validated against macOS Sequoia Chrome 148 (Apple M2, hc=8) on
?total=8192&sizes=64,256,1024:
  B=64  ns/pair=305.2  median=2.500ms
  B=256 ns/pair=146.5  median=1.200ms
  B=1024 ns/pair=219.7 median=1.800ms
Implements smvp_tree_partition.ts: the host computes per-WG slice
boundaries by binary search on bucketStart[], no GPU pre-pass. Uses
the analytical identity running_adds(i) = i - bucket_idx(i) from
msm-tree-reduce.md.

Documents a design ambiguity the plan didn't call out: the identity
under-counts when bucketStart contains empty buckets (bucket_idx
jumps faster than the entry count grows). Resolved by requiring
compacted input; compactBucketStart() + assertCompact() do the
one-pass cleanup and a side activeBucketIds[] map carries the
original bucket index for kernels that tag partials.

Exports:
  - computeTotalAdds, bucketIdx, runningAdds, findAddsBoundary
  - compactBucketStart, assertCompact
  - buildSliceLayout(bucketStart, numWgs) -> SliceLayout
    { sliceStart, outputCount, outputOffset, totalAdds }

24 Jest tests pass — including the pair-detection brute-force walk
that catches the empty-bucket regression, the heavy-bucket-skew case
(7+ of 8 WGs fall inside a single 10k-population bucket), and the
pathological totalAdds < numWgs case.

No GPU code touched.
…validated)

Phase 1 of the tree-reduce SMVP: pair detection + cooperative batch-
affine + per-bucket-tagged write-out, one workgroup per slice.

Files:
- src/msm_webgpu/wgsl/cuzk/smvp_tree_phase1.template.wgsl — the kernel.
  Thread-0 serial pair-detection preamble fills a workgroup-shared
  pair_list (packed PAIR + UNPAIRED entries in slice walk order, which
  is already bucket-sorted so no reorder postlude is needed). Phase
  A/B/C/D batch-affine pattern from bench_batch_affine.template.wgsl,
  with rank-indexed chunks over the PAIR sub-stream so a single
  fr_inv_by_a amortises across the WG. UNPAIRED entries get a final
  cooperative copy pass with sign-flip. Loop bounds all `const`
  (MAX_PAIRS = MAX_SLICE_ENTRIES baked at compile time; v0 uses 128 to
  keep workgroup memory comfortable).

- src/msm_webgpu/cuzk/shader_manager.ts — gen_smvp_tree_phase1_shader
  generator + import wiring.

- dev/msm-webgpu/bench-smvp-tree-phase1.{html,ts} — standalone bench
  page with a CPU reference. The reference walks the slice with the
  same paired/unpaired state machine and computes Mont-form affine
  adds via BigInt mod-inverse; correctness is checked bit-for-bit
  against the GPU output.

Status: structure-complete but NOT yet correctness-validated on
hardware. The BS macOS Chrome 148 run hangs on the page before the
first log call lands (the previous BS run on the same tunnel for
bench-batch-affine worked fine, so the issue is page-specific not
infrastructure). Likely candidates: an early-eval import side effect
in smvp_tree_partition.ts, the buildSynthetic randomBelow loop
generating off the main thread, or a Mont-form-conversion stall.
Worth investigating with browser console access; the BS screenshot
API doesn't surface uncaught errors.

Documents a design decision in the shader header: Phase 1 does NOT
collapse same-bucket pair results sequentially into a single per-
bucket partial inside the slice (the plan's "merge consecutive same-
bucket results into running sum" wording). Sequential merging would
break batch-affine amortisation and would need (pop-1) sequential
adds per heavy bucket. Instead Phase 1 halves per bucket (ceil(p/2)
outputs per bucket per slice), letting the recursive Phase 2 dispatch
do the rest of the reduction in log layers.

The plan's wg_output_count[k] = "buckets touched" formula is
overridden here by the per-slice CPU pair-detection walk that
computes the actual output count.
The window.error / unhandledrejection listeners and skip_gpu URL flag
were added to narrow down a BS-side hang in the phase1 bench page;
they didn't surface the underlying issue and have been removed. Page
remains structurally the same as bench-batch-affine.ts plus the
buildSliceLayout import and the phase1-specific synthetic-data
generation + CPU reference.
Phase 1 of the tree-reduce SMVP now passes correctness on local
Chromium WebGPU (SwiftShader): 20/20 outputs match the CPU reference
bit-for-bit on the small-N smoke (num_wgs=2, slice_entries=16).

Three real bugs found and fixed by getting local WebGPU into the
debug loop (via Playwright + chrome-headless-shell, no GPU on the
dev container so SwiftShader is used):

1. randomBelow consumed only the LOW BYTE of each rng() output. For
   the 32-bit LCG the low 8 bits cycle every 256 outputs, so a 32-byte
   randomBelow draw cycles every 8 calls — fatal when the caller
   builds a Set of distinct values. Fixed to consume the full 32 bits.
   Latent bug in bench-batch-affine.ts too; harmless there because the
   only check is `pxMont !== qxMont` on adjacent calls.

2. WGSL `get_p()` redeclared in smvp_tree_phase1.template.wgsl.
   Already provided by the `montgomery_product_funcs` partial.
   Removed the local definition.

3. Shader needs 10 storage buffers per stage; WebGPU's default cap is
   8. Adapter actually exposes 10+. get_device now requests the
   adapter max for `maxStorageBuffersPerShaderStage` alongside
   `maxComputeWorkgroupStorageSize`.

CPU reference rewritten to do all arithmetic in canonical (non-Mont)
form, then convert back to Mont for the diff against GPU output. The
prior Mont-form-in-place pass got the inverse semantics wrong:
fr_inv_by_a(dx_mont) returns inv_dx_canon * R^2 (a "double Mont"
form, picked because the subsequent montgomery_product strips one R
factor to give Mont-form slope), not inv_dx_canon * R as the original
reference assumed.

GPU bench wall-time: ~6.5ms for 32 entries / 20 outputs / 1 dispatch
on SwiftShader CPU-emulated WebGPU. Not a benchmark number — real
silicon will be 100× faster.
Phase 2 of the tree-reduce SMVP: recursive halving over partials.

Structurally identical to Phase 1 (same pair-detection state machine,
same Phase A/B/C/D batch-affine, same per-WG output write-out) but
takes `(bucket_id, AffinePoint)` tuples directly rather than
`(sign_bit | scalar_idx)` from the raw schedule + a separate
entry_bucket_id table. One less indirection, no sign flip.

Output schema matches Phase 1 so the recursion can rebind the same
buffers and just swap the input/output roles each layer.

Correctness gate: 19/19 outputs match CPU reference bit-for-bit on
the small smoke (num_wgs=2, slice_entries=16) on local SwiftShader.
GPU bench wall: 5.4ms (CPU-emulated WebGPU; M2 would be ~10× faster
based on Phase 1 readback).

Done definition for this step met.
…artial)

Drives Phase 1 → CPU sort → Phase 2 → CPU sort → Phase 2 → ... until
every bucket has one partial. CPU-side resort between phases (Step 4
is deferred to GPU follow-up — choice documented in module header).

Standalone bench-smvp-tree.{html,ts} compares the final per-bucket
partials against a CPU reference that computes the full sequential
sum per bucket directly.

Status:
  - Phase 1 alone: 1/1 buckets match (entries=2)
  - Phase 1 + 1× Phase 2 with mixed pair_result+unpaired input
    (entries=3): 1/1 buckets match
  - Phase 1 + 1× Phase 2 with two pair_result inputs (entries=4):
    1/1 MISMATCH

Repro: load `bench-smvp-tree.html?entries=4&buckets=1&seed=42` on
local SwiftShader Chromium. CPU reference matches the sequential-add
of 4 canonical points; orchestrator's Phase 2 output disagrees.
Phase 2 standalone test (against synthetic Mont-form pair-like
inputs) passes 19/19, so the bug must live in the boundary between
Phase 1's output buffers and Phase 2's input expectations — likely
a Mont-form / BigInt-stride mismatch that the standalone Phase 2
test wasn't hitting because its inputs are generated as fresh random
Mont values rather than the output of a previous batch-affine.

Next step in this debug path: instrument the orchestrator to print
the Phase 1 readback values and diff each (P_2k + P_2k+1) against
its corresponding CPU pair-add for entries=4. That narrows whether
Phase 1's emitted bytes are wrong vs. whether Phase 2 misreads them.

Step 6 (production swap) is unblocked from a structural standpoint
— if the Phase 1/2 chain is fed by the existing transpose +
bucket_start, the same bug surfaces and gives a concrete failing
Quick Sanity Check to triangulate with.
…5 validated)

The previous reference summed each bucket's points sequentially:
  ((P0+P1)+P2)+P3+...
which only matches the GPU's tree-reduce parenthesization
  (P0+P1)+(P2+P3)+...
when the inputs are on the EC group. The synthetic bench uses random
off-curve bigints (we test the algebraic affine-add formula, not the
group law), so the two orderings produce different bytes.

Fixed by walking each bucket via the same pair-detection state
machine the GPU uses, recursing layer-by-layer until one partial
remains. Bench passes 5/5 buckets bit-for-bit on local SwiftShader
(entries=40, buckets=5, seed=99) — including bucket=4 which has
pop=9 and recurses through 4 layers.

This validates the full Phase 1 → CPU sort → Phase 2 → CPU sort →
... chain. Step 5 correctness gate met.
The tree-reduce orchestrator (cuzk/smvp_tree.ts) is correctness-validated
standalone but not yet integrated into the production MSM pipeline.
This marker documents the integration checklist at the swap site so a
follow-up session can wire it in without re-discovering the contract.
Bumped to 256 + 200 entries / 12 buckets validated correctness OK on
local SwiftShader (5 layers, 0 mismatches, 140 ms wall) but BS macOS
Chrome 148 fails to compile the resulting shader within the worker's
initial-load window — either maxComputeWorkgroupStorageSize exceeded
or the static-bound pair_list loops blow out the WGSL compile budget.

Keeping 128 for the validated path (5/5 buckets bit-for-bit on M2
at entries=40). Scaling further is a follow-up that needs pair_list
hoisted to global memory + per-WG pair_count uniform sized for the
runtime count instead of MAX_PAIRS-bounded loop iterations.
…SWEET_B=1024

Phase 1/2 shaders rearchitected for thread utilization at the plan's
target SWEET_B=1024 batch-affine size. v1's two main flaws:

1. Per-thread O(MAX_PAIRS) scans for rank → raw_slot lookup AND
   backward search for prev PAIR's raw_slot in Phase D. At
   MAX_PAIRS=1024 that's 1024 idle iterations per thread per phase.

2. `pair_bucket` in workgroup memory inflated per-WG storage past the
   32 KiB cap, forcing MAX_SLICE_ENTRIES=128 and 8× more WGs than the
   plan called for.

v2 fixes both. Thread-0 preamble builds 4 workgroup-shared arrays in
ONE sequential pass:
- pair_idx_a, pair_idx_b: per-raw-slot (PAIR or UNPAIRED) input entry indices
- prev_raw_for_pair: per-raw-slot pointer to immediate prior PAIR's
  raw_slot (O(1) lookup in Phase D, no backward scan)
- rank_to_raw: per-PAIR-rank pointer to raw_slot (O(1) Phase A/D
  iteration over PER_THREAD_PAIRS, not MAX_PAIRS)
pair_bucket writes go straight to global `output_bucket_id` from the
preamble — never in workgroup memory.

Workgroup memory at MAX_PAIRS=1024 / TPB=64:
  4 × 4 KB (pair arrays) + 2 × 5.12 KB (wg_fwd/bwd) + ~80 B = 26.4 KB
fits in M2's 32 KiB cap.

Phase A/D inner loops now iterate exactly PER_THREAD_PAIRS = 16
times each (down from MAX_PAIRS = 1024 in v1). 64× fewer idle
iterations per thread per phase.

Validation on local SwiftShader (Chromium headless, no GPU on dev
container):
- Phase 1 standalone at 4096 entries / 8 WGs × 512 entries: 2057
  outputs, 0 mismatches, 6.5 ms median.
- Orchestrator at 2048 entries / 64 buckets: 64/64 buckets match
  full-reduce CPU reference bit-for-bit. 3 layers, 18.8 ms total GPU
  wall (10.0 + 5.5 + 3.3 across phase1 + phase2 layer2 + layer3).

Apple M2 should be ~10× faster (SwiftShader is CPU-emulated WebGPU).
Pending BS validation.
…y bucket-sorted

First-principles observation: Phase 1 / Phase 2 outputs are ALREADY
globally bucket-sorted. Input entry_bucket_id is monotone non-
decreasing (CSR layout); each WG walks its non-overlapping
contiguous slice left-to-right emitting in walk order; WG outputs
concatenated preserve monotonicity. No sort needed.

Removes the readback-of-points + JS sort + upload between every
phase. Saves O(N × NUM_LIMBS_U32 × 4) bytes of bus traffic + the
O(N log N) JS sort per layer × log layers.

Still does a small (4 B / partial) bucket-id readback to compute
per-WG pair-count + output offsets host-side. Asserts global sort
on the readback as a debug guard — cheap and catches partition
regressions.

Termination changed from "no more pair-adds possible" (required full
bucket-id scan) to "count equals input num_active_buckets" (known
from initial input). One bucket-id readback per phase, point data
never moves between phases.

Bench at 8192 entries / 256 buckets / 5 layers on local SwiftShader:
- 256/256 buckets match full-reduce CPU reference bit-for-bit
- GPU wall: 21.9 + 9.9 + 8.7 + 8.8 + 5.5 = 54.8 ms total

For comparison the prior CPU-sort version at 2048 entries / 64
buckets / 3 layers was 140 ms total. 4× scale, 0.4× time — ~10×
speedup from this change plus the v2 thread-utilization fix.

Bench entry cap raised from 512 → 2^18 (1 << 18) and bucket cap
from 64 → 2^14 so we can run real production-scale workloads.
…to finalize pipeline

Two small kernels that turn the orchestrator's sparse
(bucket_id, AffinePoint) outputs into the dense
(running_x, running_y, bucket_active) arrays the existing
finalize_collect → finalize_inverse → finalize_apply pipeline
expects. With these in place the production swap in msm.ts is
mechanical: replace the round-loop dispatch with
runTreeReduce + scatter_init + scatter, and re-use the finalize
chain unchanged for the affine→Jacobian + magnitude-bucket fold.

scatter_init: one thread per bucket slot, zeros running_x/y +
bucket_active across the full T*num_columns dense layout.

scatter: one thread per orchestrator output, writes
running_x[bucket_id]=P.x, running_y[bucket_id]=P.y,
bucket_active[bucket_id]=1.

Both kernels are trivially parallel (no atomics, no synchronisation
beyond the bucket_active write which is the only output ever
written by any thread for that bucket_id since the orchestrator's
output is unique-per-bucket).
…alize pipeline

`smvp_batch_affine_gpu_tree` is the production adapter that:
  1. Reads CSR row pointers from `all_csc_col_ptr_sb`, computes
     per-entry bucket id, uploads.
  2. Runs the v2 tree-reduce orchestrator (`runTreeReduce`).
  3. Inits the dense workspace (`running_x/y_sb`, `bucket_active_sb`)
     via `scatter_init` (one thread per bucket slot).
  4. Scatters the tree-reduce output (sparse, one per active bucket)
     into the dense workspace via `scatter` (one thread per output).
  5. Returns. Caller continues with the existing `finalize_collect` →
     `finalize_inverse` → `finalize_apply` chain unchanged for the
     affine→Jacobian conversion and the magnitude-bucket fold.

`buildTreeAdapterPipelines` compiles all four pipelines (phase1,
phase2, scatter, scatter_init) once per (num_words, max_slice_entries)
shape; cache the handle for the warm bench loop.

ShaderManager wiring for `gen_smvp_tree_scatter_shader` +
`gen_smvp_tree_scatter_init_shader` added alongside the existing
phase1/phase2 generators.

The actual msm.ts call-site swap is one more edit: replace the
current `smvp_batch_affine_gpu(...)` call with two calls — first
`smvp_batch_affine_gpu_tree(...)` to populate running_x/y +
bucket_active via tree-reduce, then the existing finalize chain.
That swap is mechanical now that the adapter is in place; pending
the Quick Sanity Check correctness gate.
Validates the tree-reduce's main perf claim from the plan: a heavily
skewed input (one bucket with pop = entries/2, the rest uniform) is
handled in O(log heavy_pop) layers regardless of skew.

Measured on Apple M2 via BS at entries=65536 / buckets=512 / skew=heavy
(heavy bucket pop = 32 832):
  layers: 16
  total GPU wall: 34.6 ms

For comparison the same input at skew=uniform (max pop ~256):
  layers: 6
  total GPU wall: 24.3 ms

Heavy skew → only 1.4× more time despite a bucket that the current
round-loop MSM would need ~32 832 sequential rounds to reduce. The
plan's "5–10× faster on heavy-bucket workloads" claim looks
conservative.

Bench page now accepts `?skew=heavy` and abbreviates the pops log
for runs with > 16 buckets.
zac-williamson and others added 28 commits May 21, 2026 12:54
bench-c-sweep fed MsmV2 off-curve random points and raw 256-bit scalars:
- run()'s host window-combine is real elliptic-curve arithmetic, so
  off-curve points drove it to a non-invertible Z.
- MsmV2 gives the GPU `s·R mod p` (de-Montgomery'd to `s mod p`) but the
  host planner Booth-decodes the raw scalar; a scalar >= p decodes
  differently on the two sides and corrupts the level plan.

Load real on-curve SRS points and reduce every scalar mod p. run-bench.mjs
forces HTTP/1.1 so headless Chrome can fetch the SRS.
bench-c-sweep measured the fastest Pippenger window size c per MSM size;
MsmV2.pickC now uses that table (logN 10..20). The index.html GPU-vs-WASM
sweep extends down to 2^10 (was 2^16) — the v2 pipeline has no size floor,
and the small sizes show where the GPU overtakes the WASM Pippenger.
Phase 1 of the optimisation-knob plan: make MsmV2's frozen pipeline
constants tunable so the benchmark suite can A/B them, and add per-pass
GPU profiling to see where a knob lands.

- MsmV2.create takes an MsmConfig { c, s, wgi, reduceWg, l0Log,
  invVariant, profile, jacobianCrossover }; every field defaults to the
  current constant, so an unset config reproduces today's behaviour.
- run() optionally records per-pass GPU timestamps (one query set reused
  across runs) and returns a {demont..redFused, wall} breakdown.
- bench-msm-v2.{html,ts}: an A/B harness — drives MsmV2 at one config,
  reports median wall + the per-pass breakdown. Open twice with one URL
  param changed to compare.
- bench-c-sweep and index.html (main.ts) forward the knob URL params, so
  index.html?...&s=4 cross-checks a knob against the WASM Pippenger.

jacobianCrossover is accepted but inert — the Phase-2 hook.
bench-msm-v2 sweeps (2^7..2^16, this GPU) gave clear per-size optima;
adopt them as the defaults so an unconfigured MsmV2 is the tuned config.

- pickC extended down to 2^7 (7->4, 8->4, 9->5) — tiny n wants a small c
  to shrink the bucket-reduction floor.
- pickS(n): fused chunk size — 2 for n<=2^11, 4 for 2^12-2^13, 8 above.
  Small n is occupancy-starved and wants more, smaller chunks.
- pickReduceWg(c): reduction workgroup size — 32 for c<=9, 64 for c<=12,
  128 for c>=13; tracks the reduction stride / GPU subgroup width.
- Default wgi 64->128 and invVariant 'a'->'loop' (>= old everywhere).

Net: ~12-30% faster at 2^10-2^14, neutral at 2^16 (already saturated).
Cross-check + noble still agree at every size; bench-msm-v2-check now
shares MsmV2's pickC instead of a stale local copy.
fr_sub was three limb-passes — a+p, then -b, then fr_reduce. Rewrite it
as bigint_sub(a,b) plus a borrow-masked +p correction folded into one
add pass: branch-free, two passes. Verified bit-identical against the
GPU-vs-noble cross-check; -4.6% wall at 2^16 (28.2 -> 26.9 ms).

Fold lambda^2 - x1 - x2 to lambda^2 - (x1 + x2) in the affine point-add
across ba_fused_super, ba_reduce_fused and ba_window_combine — one
fr_add is a limb-pass cheaper than the fr_sub it replaces.

Add ba_window_combine.template.wgsl: an on-GPU single-workgroup Jacobian
tree fold that collapses the per-window sums into one affine point. Not
yet wired into MsmV2 — the host window-combine still runs.
'active' is a WGSL reserved keyword; the tree-fold flag is renamed to
is_active so the shader compiles. The shader is not yet wired into
MsmV2 — wiring waits on a cooperative-arithmetic rewrite.
The on-GPU window-combine is a maximally-serial doubling chain; a
single-workgroup kernel runs it ~50x slower than the host Horner fold
(one GPU lane loses to a CPU core on serial work). The host
window-combine stays — removing the dead shader.
Three register-pressure cuts in the canonical MsmV2 hot kernels, targeting
small-register-file GPUs (Adreno/Mali) where the kernels spill to DRAM and
collapse occupancy:

- montgomery_product: the modulus p is now individual P_LIMB_* constants
  rather than a function-wide `var p` live across the whole multiply
  (follows the existing R_INV precedent). conditional_reduce materialises
  its own p after the register peak.
- ba_fused_super: the backward peel is split into an inverse pass and an
  affine-add pass, so the running batched inverse is no longer a
  register-resident value live across the affine add.
- byl_divsteps: rewritten branchless (select over the swap / add / shift
  cases) to remove data-dependent divergence on wide-wave GPUs.

Bit-for-bit identical results, verified against noble at n=4096 and
n=65536. On M4 a net 10% speedup at n=2^16 (27.4 -> 24.7 ms): the lower
register footprint also lifts M4 occupancy.
fr_inv_by_loop_pk stores the safegcd working set (f,g,d,e,p) as two
13-bit limbs per u32 word — 10 words instead of 20 — halving the
inverse's per-thread private-memory footprint and apply_matrix memory
traffic. Same Bernstein-Yang algorithm and i32/13-bit arithmetic; only
the storage is packed. The smaller footprint keeps more inverse threads
resident, lifting occupancy on small-register-file GPUs (Adreno).

Wired as the `pk` inverse variant in the ba_fused_super and
ba_reduce_fused shader generators; MsmV2 defaults to it. Byte-exact vs
noble at n=4096 and n=65536. On M4 (not register-bound) the pack/unpack
shifts cost ~1 ms — pk's payoff is on the Adreno.
Make the 2x13-packed safegcd inverse (fr_inv_by_loop_pk) the default at
every entry point, not just MsmV2's DEFAULT_INV_VARIANT:

- gen_ba_fused_super_bench_shader / gen_ba_reduce_fused_bench_shader
  function-default 'a' -> 'pk' (covers any caller that omits the arg,
  e.g. bench-msm-tree-v3)
- bench-msm-tree-v2's INV_VARIANT now defaults to 'pk'
- the ?inv= parsers in main.ts, bench-msm-v2 and bench-c-sweep accept
  a|loop|pk; with no ?inv they fall through to the 'pk' default

bench-msm-tree-v2 ?validate=1 host-replay passes byte-exact with pk.
The Karatsuba+Yuval montmul was emitted flat: all 9 schoolbook 5x5
sub-products (81 output limbs), then all 3 inner combines, then the
outer combine. Every schoolbook output is a function-scope value
consumed only later, so all 81 sit live at once — a ~120-u32 register
peak that spills on small-register-file GPUs (Adreno).

Re-emit grouped by half-product: each of P_lo / P_hi / P_cr gets one
scoped block that computes its 3 schoolbooks, runs the inner+outer
Karatsuba combine, and folds the result straight into the 40-limb
accumulator t. Only one group's 27 schoolbook outputs are live at a
time. Identical arithmetic — same 225 multiplies, same combine adds,
the p_lo/p_hi/p_cr common-subexpression sharing preserved — purely a
tighter live-range schedule.

montmul-internal register peak ~120 -> ~80. Byte-exact vs noble at
n=4096 and n=65536. On M4 the lower footprint also lifts occupancy:
n=2^16 wall 26.7 -> 24.4 ms.
…er 2)

ba_fused_super now carries field elements in the canonical 8x u32
packed form — which is also the storage form — instead of the
20x13-limb BigInt. The affine add's live values cost 8 registers each
instead of 20; loads/stores are plain 8-word copies (the unpack256 /
pack256 at every load/store boundary is gone).

- fr_add / fr_sub run natively on 8x u32 (8-word modular add/sub; the
  carry out of each word is `u32(sum < operand)` — one compare, no
  branch). An unpack-repack variant is kept behind the `addsub` knob.
- montgomery_product_f8 wraps the grouped Karatsuba: expand both
  operands 8 -> 20x13, multiply, contract 20x13 -> 8.
- get_r is now a constant derived from get_r_f8 (no var-builder).

Byte-exact vs noble at n=4096 and n=65536. A/B (n=2^16, fused pass):
native fr 10.84 ms vs unpack-repack 11.80 ms — native ~9% faster, so it
stays the default. On M4 the fused pass drops 14.30 -> 10.84 ms (cheaper
8-word fr ops + free loads/stores + the lower footprint lifts occupancy);
wall 24.4 -> 21.0 ms. On the Adreno the ~2.5x smaller live state is the
register-spill fix.
Extend Lever 2 (8x u32 live field representation) from the accumulate
kernel to ba_reduce_fused, the 4-phase recursive bucket reduction.
Field elements were 20x13-bit BigInts — 20 registers each, and with
~13 live values across the affine-add peel that kernel was the
reduction's register-pressure hotspot.

ba_reduce_fused now carries field elements as array<u32,8> (the
canonical 256-bit packed form, which is also the red_buf storage form):
8 registers each, and load_x/load_y/store_x/store_y become plain
8-word copies with no unpack256/pack256 at the boundary.

The 8x u32 field ops (montgomery_product_f8, native fr_add_f8/fr_sub_f8,
get_r_f8, is_zero_f8) move out of ba_fused_super's inline block into a
shared partial, wgsl/field/field8.template.wgsl, included by both hot
kernels via the field8_funcs partial. shader_manager gains f8Context()
for the mustache context and threads the addsub knob through the
reduce-fused generator.

M4 n=2^16: redFused 7.5 -> 6.2 ms, wall 21.0 -> 19.6 ms. Byte-exact vs
noble at n=4096 and n=65536.
Add a `reduceVariant` knob to MsmConfig — 'fused' (default, unchanged)
or 'unfused'. The fused ba_reduce_fused runs the whole 4-phase
reduction in one dispatch with a level loop and storageBarrier between
levels; on a register-starved GPU that monolith is the reduction's
spill hotspot and its single unbounded dispatch risks the watchdog.

The unfused variant is ba_reduce_level_bench.template.wgsl: one
schedule level per dispatch, with `kind` (0 = phase-A suffix add,
1 = phase-B/D tree-add, 2 = phase-C double) baked in as a compile-time
constant. Three kind-specialized pipelines are compiled; each
const-folds away the other two kinds' branches, so each carries only
its own path — a smaller per-shader register footprint than the
kind-0/1/2 monolith. WebGPU's between-pass ordering replaces the fused
storageBarrier; every dispatch is watchdog-bounded.

Default stays 'fused', so the M4 path is untouched by construction;
'unfused' is the variant for the S25/Adreno, selectable via
?reducevariant=unfused on the bench pages.

Also wires bench-msm-v2 into the BrowserStack runner: the page now
POSTs progress/results JSONL, run-browserstack.mjs gains the
bench-msm-v2 page mapping and an --n flag, and bench-msm-v2-check
accepts the algorithm knobs so a swept config can be validated.

M4 n=2^16: fused (default) 19.55 ms unchanged; unfused 19.70 ms
(+0.8%, the ~43-dispatch overhead). Both byte-exact vs noble at
n=4096 and n=65536.
At deep reduction levels ppw is smaller than the workgroup size, so a
thread whose entire candidate range is past ppw does no useful work:
every j2 >= ppw, no red_buf / is_present writes, and pref_scratch is
private per thread. It still ran the forward montmuls and — the
expensive part — the unconditional safegcd inversion.

ba_reduce_level returns those threads before the inversion. The unfused
per-level kernel can do this; the fused kernel cannot (every thread
must reach the storageBarrier). Idle lanes then issue no instructions
and, on a register-starved GPU, generate no register-spill traffic.

M4 n=2^16, unfused: redFused 6.40 -> 5.76 ms (idle subgroups skipped).
On a spilling GPU the gain should be larger. Byte-exact vs noble at
n=4096 and n=65536.
Makes the three unfused per-level reduction kernels (`ba_reduce_level`,
kinds 0/1/2) branchless, so the Adreno register allocator schedules a
single straight-line path with no spill traffic.

## What changed
- **Removed the `is_zero_f8` (P = ±Q) dual path.** Under the algorithm's
no-collision assumption (uniformly-random inputs) the affine-add
denominator `x_s - x_d` is never zero, so kinds 0/1 always do
point-addition. This was the main register hog — both the add and double
formulas were live across the branch.
- **`is_present` occupancy → `select`.** The add / copy-into-empty-slot
/ skip cases collapse to straight-line `select`s: the denominator
defaults to the identity `R`, and the stored x/y, the `inv` peel factor,
and the `is_present` flag are all selected.
- **`KIND` switch → compile-time mustache** (`{{#kind0/1/2}}`), so each
compiled variant has zero `KIND` branches.
- **`k==0` / `k>0` loop branches removed** — `acc` inits to `R`
(Montgomery 1) so `mont(acc,denom)` is correct at k=0, and the final
`inv` peel is dead.
- **Forward bounds branch removed** — the forward store target is
per-thread-private scratch, always safe.
- **Preserved the `tid*C >= ppw` early-exit** from this branch (a
workgroup-uniform whole-lane skip, distinct from the per-candidate
divergent branches), plus one store bounds-guard on the backward pass
(tail lanes must not write into neighbouring windows).
- Added a `fr_select_f8` elementwise branchless select helper (local to
the template).

Generated WGSL was regenerated via `yarn generate:wgsl`;
`shader_manager` now passes `kind0/1/2` booleans for the mustache
sections.

Also: `bench-msm-v2.ts` accepts a base64url `cfg` blob param —
BrowserStack real-mobile workers truncate the launch URL at the first
`&`, so multi-knob runs otherwise silently fall back to defaults.

## Result — Samsung Galaxy S25 Ultra (BrowserStack), n=2^10, inv=pk,
c=8, s=2
| reduce path | GPU reduction (`redFused`) | wall median |
|---|---:|---:|
| fused (default) | 468.9 ms | 484.8 ms |
| unfused, branched | 62.8 ms | 78.7 ms |
| **unfused, branchless (this PR)** | **~4.9 ms** | **~18 ms** |

The GPU reduction kernel drops **~12.8×** vs the branched unfused path
(and the MSM is now host-bound — the on-GPU work is ~7 ms).
**Bit-exact**: `resultX = 0x22ec42de…b8d85f19`, identical to both the
fused and branched-unfused paths on the same inputs, so the rewrite (and
the no-collision branch removal) is validated by result-hash equality
on-device.

Benchmarked S25-only per the WebGPU MSM workflow; the M2/macOS reference
path is unchanged.

---
*Created by
[claudebox](https://claudebox.work/v2/sessions/554522191f42098c) ·
group: `slackbot`*
The transpose count did one global atomicAdd per (point, window) onto a
per-bucket counter. Bucketing maps many points to the same bucket, so
those atomics are heavily contended — fine on GPUs with strong atomic
units, but they serialize hard where contended-atomic throughput is weak
(Adreno).

transpose_parallel_count_priv: one workgroup per window tallies the
window's column indices into a workgroup-shared histogram (shared-memory
atomics, on-chip) and writes the result with plain stores — zero global
atomics. When n_cols exceeds the shared histogram capacity the columns
are covered in tiles, each re-scanning the (coalesced) column array.

The base transpose_parallel_count is kept — it is still used by the
legacy bench, the wgsl unit test, and src/msm.ts. MsmV2 switches to the
privatized variant (one workgroup per window, workgroup size 256).

M4 n=2^16: transpose pass 0.31 ms (unchanged — M4 eats contended atomics
fine). Byte-exact vs noble at n=4096 and n=65536. The win is on the S25.
The transpose scatter did one contended global atomicAdd per (point,
window) onto a per-column write cursor — the same Adreno-hostile pattern
as the count.

transpose_parallel_scatter_priv: one workgroup per window, the per-column
write cursors held in workgroup-shared memory (shared atomics only, no
global atomics). One workgroup owns the whole window, so every point of
a column is allocated by the same shared cursor. Tiled when n_cols
exceeds the shared cursor capacity. The global `all_curr` counter buffer
is gone.

Within-column ordering becomes shared-atomic order rather than global
arrival order — a different valid permutation, identical multiset per
column. Bucket accumulation sums each bucket and is order-independent,
so the MSM result is unchanged: byte-exact vs noble at n=4096, 65536.

M4 n=2^16: transpose pass 0.33 -> 0.23 ms (privatized count + scatter).
The base transpose_parallel_scatter is kept for the legacy callers.
…ariant

PR #23485 made the unfused per-level reduction kernels branchless and
fast, but `reduceVariant` still defaulted to 'fused' — so a default
MsmV2 run (the canonical benchmark path) used the slow single-dispatch
fused monolith, not the branchless kernels. The fast reduction was only
reachable with an explicit ?reducevariant=unfused.

Flip the default to 'unfused'. The branchless unfused reduction is now
the better kernel on both M4 (neutral — the ~43-dispatch overhead is
absorbed) and register-starved GPUs (no spill); 'fused' stays selectable
as a fallback / for A/B.

M4 n=2^16: wall 19.6 ms, redFused 6.18 ms — unchanged. Byte-exact vs
noble at n=4096 and n=65536 with default config.
The fused single-dispatch reduction (ba_reduce_fused) was the slow path
— superseded by the unfused branchless per-level kernels, which became
the default in the previous commit. Delete it so the slow path can no
longer be selected anywhere.

- ba_reduce_fused_bench.template.wgsl removed; gen_ba_reduce_fused_bench_
  shader removed from shader_manager.
- The `reduceVariant` MsmConfig knob is gone — the reduction is now
  unconditionally the per-level kernels. reduceFusedLayout (reused by the
  per-level path) renamed reduceLevelLayout; the fused pipeline / bind /
  if-else branches removed from create / prepare / run / profiling.
- ?reducevariant= removed from bench-msm-v2 and bench-msm-v2-check.
- bench-msm-tree-v2 (the legacy v2 prototype, the only other caller of
  the fused reduction) deleted, with its run-browserstack page entry.

Byte-exact vs noble at n=4096 and n=65536 with default config; M4
n=2^16 wall ~19.4-20 ms, unchanged.
dev/msm-webgpu/ had ~20 separate microbench frontends (bench-*.html/.ts)
and ~10 headless scripts accumulated during the MSM work — almost all
superseded. Collapse to the only thing needed:

- index.html + main.ts — the MsmV2-vs-WASM benchmark site (WebGPU MSM
  cross-checked against barretenberg WASM Pippenger ST + MT, and noble
  at logN=16).
- drive-index.mjs — its headless driver (local Chrome).
- scripts/run-browserstack.mjs (+ bs-targets.mjs) — repointed at
  index.html (?autorun=msm-cross-check), for running the site on a real
  device (e.g. the S25).

Deleted: 21 bench-*.html, 21 bench-*.ts, run-bench.mjs, and 9
microbench/runner scripts (scripts/bench-*.mjs, profile-sanity,
run-bench-smvp-tree, run-local-webgpu, run-msm-page). The kept site's
import closure (msm_v2, pippenger_wasm, srs, gpu_decompress,
results_post, wgsl_unit_tests) is untouched.

Validated: index.html headless autorun cross-check — WebGPU, WASM ST,
WASM MT all agree, state=done.
After the bench-page cleanup, 23 shaders were left used by no live path
(MsmV2, wgsl_unit_tests, or the production compute_bn254_msm). Delete
them, the 21 corresponding shader_manager gen_* methods and imports, and
the now-dead renderByInverseFuncs / renderGetRFn helpers:

- microbench shaders: apply_matrix_bench, bench_field_mul, bench_field_inv,
  bench_batch_affine, divsteps_bench, field_mul_bench_u32/f32, fr_inv_bench
- pre-v2 experiments: ba_marshal_chain/pairs/tree_l0, ba_pair_disjoint(_tree),
  ba_planner_bench, ba_scatter_pairs, ba_tail_reduce, ba_rev_packed_carry,
  batch_affine_apply/finalize/fused_wg_scan, batch_inverse
- plain CIOS mont_pro_product — the live paths use the karat-yuval montmul
- plain by_inverse — the live paths use by_inverse_a / by_inverse_loop

mont_pro_product_f32_22_sos3uv3 (+ bigint_f32, mulhilo_22) kept by request
though currently uncalled. 75 -> 52 inlined shaders.

Validated: index.html headless autorun cross-check — WebGPU, WASM ST,
WASM MT all agree, state=done. shader_manager typechecks clean.
The WebGpuMsmHost bridge delegated to the obsolete modified-cuZK
compute_bn254_msm pipeline. Replace it with MsmV2, the carry-free
Booth / privatized-transpose / pair-tree pipeline.

- Move MsmV2 into src/msm_webgpu and add MsmV2Pool: the SRS point pool
  is uploaded and Montgomery-converted on the GPU once, then bound as a
  prefix by every MSM, replacing the per-instance host conversion.
- Rewrite WebGpuMsmHost around the pool: lazy build at SRS publish, a
  pinned SRS-sized instance plus a small LRU, no benchmark warm-up.
- Offload the Horner window-combine and modular inverse to native
  bb::g1 (combine_windows); the bridge ships the per-window sums and the
  C++ hook folds them.
- Drop the redundant GPU demont pass and the scalar Montgomery
  round-trip; the hook omits points for SRS-prefix MSMs and routes
  small MSMs to the native Pippenger.
- Delete msm.ts, the superseded cuzk modules and the dead WGSL shaders;
  the dev/msm-webgpu benchmark page is retained.
The dev MSM comparison harness (barretenberg/ts/dev/msm-webgpu) ran a
redundant single-threaded WASM pass, timed input marshalling as compute,
and trapped outright at large sizes.

Harness changes:
- Drop the single-threaded WASM row — keep WebGPU vs multi-threaded WASM
  Pippenger only.
- Split the WASM path into an untimed bb_native_pippenger_bn254_load
  (decode + upload) and a timed bb_native_pippenger_bn254_run, so the
  measured window is pure Pippenger compute.
- Delete the unused by_inverse_a field-inversion shader variant.

Bug fixes:
- bb_native_pippenger_bn254_run calls batch_multi_scalar_mul_native, not
  pippenger_unsafe: in a BBERG_WEBGPU_MSM_HOOK build the latter routes an
  n >= 2^16 BN254 MSM into the uninstalled WebGPU bridge and throws.
- The MsmV2 warm-up uses a pseudo-random scalar spread; identical scalars
  collapsed every window sum onto one subgroup and made the host Horner
  combine hit a non-invertible value.
- Raise the harness WASM heap cap from 256 MiB to 1 GiB and free the
  previous size's input vectors before reallocating in _load — a 2^20
  sweep otherwise exhausts the heap and traps with `unreachable`.

Verified in headless Chrome: the full logN 10..20 sweep completes with
WebGPU/WASM cross-checks agreeing at every size.
… scaling

The WebGPU MSM transpose dispatched one workgroup per window (~17), so at
large n it was a latency-bound serial scan with no parallelism to hide DRAM
latency — the `transpose` pass blew up superlinearly (0.8 ms at 2^17 to
108 ms at 2^20, 37% of the whole MSM) and dragged the per-doubling cost to
2.2-2.3x where Pippenger should be O(N/log N).

Replace the 3 one-workgroup-per-window kernels with a tiled counting sort
dispatched across point-chunks (numChunks x windows, ~600-1000+ workgroups):
- count: each workgroup histograms its point-chunk into a workgroup-shared
  histogram (shared atomics only, ~1-deep contention — no global atomics)
  and writes a private partial-histogram row;
- reduce: folds the per-chunk partials over the chunk axis into the per-
  window column counts and chunk-exclusive prefixes;
- scan: the existing per-window prefix sum, unchanged;
- scatter: each workgroup scatters its chunk into the CSC slots at the
  scanned offsets via a workgroup-shared write cursor.

The numChunks x BW partials matrix temporally reuses l0IdxBuf (dormant
until convActive overwrites it after the transpose) — no new allocation.

Measured on an M4, headless: transpose at 2^20 drops 108 ms -> 5.9 ms (18x),
total 290 ms -> 190 ms, and the per-doubling ratio falls from 2.27x to
1.81x. WebGPU/WASM/noble cross-checks agree at every size logN 14-20.
After the tiled-transpose fix, `planner` was the last superlinear pass —
one workgroup per Pippenger window (~17), O(pairs) per window, 16 ms at
2^20 (8.5%) scaling ~2.8x/doubling.

Split it into two passes:
- ba_planner_v2_offsets (pass A): the per-window scan — Phase A/B/D
  unchanged — but Phase C now writes only the per-bucket carry prefix
  plus new_counts/new_offsets, not the O(pairs) plans. O(BW), flat in n,
  so one workgroup per window is fine.
- ba_planner_v2_emit (pass B): dispatched (ceil(BW/256), numWindows) —
  one workgroup per (bucket-group, window) — emits the chunk / scatter /
  carry plans in parallel and cooperatively self-pads the plan tails.
  The pair prefix is derived as new_offsets - w*wstride - carry_off.

The per-bucket carry-prefix array temporally reuses valIdxBuf (dead once
convActive has consumed it, strictly before the planner runs) — no new
allocation, no atomics.

Measured on an M4, headless: planner at 2^20 drops 16 ms -> 6 ms and
scales 1.89x/doubling; total 190 ms -> 180 ms. Combined with the tiled
transpose, 2^20 is 287 ms -> 180 ms and the per-doubling ratio is
2.27x -> 1.76x — O(N/log N) restored. WebGPU/WASM/noble cross-checks
agree at every size logN 14-20.
@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 23, 2026
@socket-security
Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addednpm/​@​types/​mustache@​4.2.61001007480100
Addednpm/​playwright-core@​1.59.1741007999100
Addednpm/​mustache@​4.2.010010010075100
Addednpm/​bigint-crypto-utils@​3.3.01001009680100
Addednpm/​@​webgpu/​types@​0.1.6910010010092100

View full report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants