fix(bb.js): stream large Android WebGPU MSM chunks#23523
Draft
AztecBot wants to merge 103 commits into
Draft
Conversation
Plumbing for delegating BN254 batch MSMs from the WASM prover to a WebGPU MSM running in the browser's main thread, plus a dev page for running the GPU MSM against a noble Pippenger reference. C++ side (BBERG_WEBGPU_MSM_HOOK, WASM-only): - batch_multi_scalar_mul delegates to bb_external_msm_bn254 when the batch has at least one MSM hitting BBERG_WEBGPU_MSM_MIN_N (2^16). handle_edge_cases=true and Grumpkin stay on native Pippenger. - Marshalling helpers (LE bytes, auto Montgomery strip/rewrap) split into a header that can be unit-tested without the WASM-only hook. bb.js side (barretenberg/ts/src/msm_webgpu/): - Port of tal-webgpu submission.ts / cuzk drivers / WGSL templates, BN254 only. BLS12-377, GLV, and atheonxyz entries removed. - Bridge: SAB-based protocol + worker stub + main-thread host that owns the GpuContext and CachedBases. - setExtraEnvImports on BarretenbergWasmBase so the bridge can inject bb_external_msm_bn254 into the WASM env without the base class knowing about WebGPU. Dev page (barretenberg/ts/dev/msm-webgpu/): - Vite-served HTML that generates random affine points + Fr scalars, runs compute_bn254_msm, cross-checks against noble's Pippenger MSM. - WGSL unit tests for the decompose and parallel-transpose stages. - Run via `yarn dev:msm-webgpu`. Correctness fixes made while bringing the port up: - SMVP dispatch math at chunk_size=15 (num_subtasks=18) was truncating to zero workgroups. Refactored to pick smvp_subtasks_per_iter as the largest divisor of num_subtasks <= 4 and dispatch the full pre- computed geometry per iter. - SMVP shader skips the (j=1, id%h=0) chunk=0 row. At the top BN254 subtask nearly all random Fr scalars decompose to chunk 0 there, so this one thread would otherwise iterate through ~n/2 mixed-adds and trip Chrome's GPU watchdog at n >= 2^17. The result was being discarded anyway (bucket_idx=0 sentinel). - Dev page point layout switched to interleaved [x_0|y_0|x_1|y_1|...] per point, matching marshal_points and the convert shader's split. Validated on n = 2^16, 2^17 against noble's MSM reference.
Replaces per-Run random-point regen with the first 2^20 points of the public BN254 G1 SRS (fetched from crs.aztec-cdn.foundation, decompressed once, cached in IndexedDB). Each Run now slices the prefix and pairs it with fresh random Fr scalars, exercising the same bases bb.js hands the WASM prover. log2(n) range is 16..20. Cold pull is slow (~2.5 min) because decompression is JS bigint sqrts — fine for a one-off, replaced in the real prove-flow demo by bb.js WASM srsInitSrs. Tqdm-style progress bar reports download/decompress/cache phases with rate + ETA so the 5-min wait is observable.
C++:
- Extract MSM::batch_multi_scalar_mul into a hook-bypassing
batch_multi_scalar_mul_native. The original entry now branches on
BBERG_WEBGPU_MSM_HOOK and otherwise delegates to native, so callers
that explicitly want the in-tree Pippenger (e.g. the in-browser
comparison harness) don't recurse back through the JS bridge.
- Add a WASM_EXPORT bb_native_pippenger_bn254 that runs Pippenger
on raw LE-32 byte inputs without going through cbindCall('bbapi', ...)
msgpack. Intended for direct JS callers that want to measure the
Pippenger cost without serialization overhead.
bb.js worker factory (only when BBERG_WEBGPU_MSM_HOOK is compiled in):
- main.worker.ts / thread.worker.ts install default stubs for the hook's
env imports (bb_external_msm_bn254 throws, bb_publish_srs_bn254 no-op)
before WASM instantiation. WebAssembly.instantiate refuses to link
without these even when the bridge isn't wired up.
- Both worker entries use dynamic import() inside an async IIFE so the
error/unhandledrejection listeners register synchronously at module
evaluation, before any import resolves. Bootstrap failures get
postMessaged back as 'worker-error' instead of vanishing into an
opaque parent-side error event.
- createMainWorker races the readiness signal against a 50 ms grace
window for these self-reported errors and surfaces the rich version
when available.
Adds the bb WASM Pippenger paths (single-threaded + multi-threaded) to the existing in-browser BN254 MSM dev page, alongside the WebGPU MSM and the noble reference. Lets us sweep log₂(n)=16..20 and read out median timings for all three implementations on the same SRS-backed inputs. - pippenger_wasm.ts: bb.js-style harness that boots the threaded barretenberg WASM in a Worker, calls bb_native_pippenger_bn254 via the WASM exports table (bypassing the bbapi msgpack path), and returns the affine result. Used twice — one worker booted with threads=1 for the ST row, one with threads=N for the MT row. Two workers is required because bb's parallel_for static pool is sized at first call and stuck thereafter; sharing one worker for both roles produced a slower MT time than ST. - main.ts / index.html: sweep + benchmark UI, lazy WASM boot, Stop button to tear down workers, ?coi=1 toggle for cross-origin isolation (gated because the threaded WASM needs SharedArrayBuffer). - vite.config.ts: dev server config — middleware that serves the freshly-built barretenberg.wasm.gz directly out of cpp/build-wasm-threads/bin (no copy needed), and conditional COOP/COEP headers. The COI middleware additionally attaches Cross-Origin-Embedder-Policy: require-corp to any response whose Sec-Fetch-Dest is worker/sharedworker/serviceworker — without this, a COI'd document can't spawn workers (the worker response must also carry COEP, and worker URLs don't carry the ?coi=1 query suffix the document does).
… threads to hwConcurrency - Sweep result now renders two tables instead of one. A consistency table at log₂n = 16 (n = 65,536) cross-checks WebGPU / WASM-ST / WASM-MT / Noble pairwise — each cell pass/FAIL independently so a regression can be pinpointed to which pair disagrees. A perf table for log₂n = 16..20 reports WebGPU and WASM MT medians + ratio; WASM ST is omitted from the perf table (strictly slower at these sizes and not the production path), and the Noble column also drops from the perf table since it only runs at the reference size. - MT threads input now defaults to navigator.hardwareConcurrency (capped at 32) instead of the fixed value of 4. Falls back to 4 when the browser doesn't expose hardwareConcurrency. The HTML input no longer hard-codes value="4" — main.ts writes the computed default into the input at startup.
…ches The dev page measured WebGPU at ~200 ms for 2^16, while the same WGSL running through tal-webgpu's Bn254Grid measured ~90 ms. The gap was entry-point selection, not algorithm: the page called the cold `compute_bn254_msm` (re-acquires device, recompiles every pipeline, re-uploads SRS, runs Stage-1 Barrett/Montgomery convert, destroys device) on the timed dispatch. main.ts now mirrors the tal-webgpu measurement protocol: one persistent GpuContext for the page lifetime, one CachedBases per logN (regenerated when the slider changes), one warm-up dispatch, then time `compute_bn254_msm_batch_affine` across reps. Setup costs are logged under `[gpu-warm]` so the per-step amortisation is visible. Sweep at 2^16..2^20 now reports WebGPU 1.03x..2.06x faster than WASM MT (14t), with 2^16 at ~75 ms median. Two bugs surfaced once the persistent context was held across logN transitions: 1. msm.ts: `bpr1_key` / `bpr2_key` omitted input_size. The cached bind groups bound bucket_sum_*_sb and g_points_*_sb from the first logN; SMVP at the next logN wrote into fresh ws_key-keyed buffers that BPR never read, and BPR wrote into stale buffers that Horner never read. Output was identity (gpu.x=0). Threaded input_size into bpr_1 / bpr_2 and appended it to the keys. 2. batch_affine.ts: `ws_key` (governs init/schedule/inverse/apply/ finalize bind groups) omitted input_size. Those bind groups capture `point_x_sb` / `point_y_sb` from CachedBases plus all_csc_*_sb from the transpose — both recreated on a logN switch. Cached bind groups ended up referencing destroyed buffers; kernels dispatched fast (~4.5 ms total) writing nothing. Added input_size to the key. Also replaced two `Buffer.alloc(0)` placeholders in compute_bn254_msm_cached and compute_bn254_msm_batch_affine with `new Uint8Array(0) as unknown as Buffer`. The placeholder is unread on the cached path; the change removes the dependency on a Node `Buffer` global so the warm entry points work in a Vite dev server / browser without a polyfill.
Builds on the warm cached-bases path (76d08ed) to surface a per-pass breakdown of each WebGPU MSM rep. - Re-export `type ProfileCapture` from src/msm_webgpu/index.ts. - runWebGpuOnce passes a fresh ProfileCapture out-param to compute_bn254_msm_batch_affine on every call and threads the populated capture (per-pass GPU timestamps, CPU phase totals, GPU readback decomposition) up through BackendSample. - New renderBreakdownTable aggregates captures across reps (subtasks and rounds summed within each rep, then medianised across reps) and renders a median-ms breakdown alongside the perf table on Sweep, and as the sole result panel on Run × 5 and Quick Sanity Check. STAGE_ORDER lays stages out in pipeline order (decompose → convert → transpose → ba_* → smvp → bpr → reduce) so rows read top-to-bottom along the dataflow.
Adds a WGSL compute shader (decompress_g1_bn254) that recovers BN254 G1 points from compressed CRS bytes via fr_pow sqrt in Montgomery form, plus a TS runner wired into loadSrsPoints with a JS-bigint fallback. A small sample cross-check against the JS reference runs on cold load so a shader regression can't silently corrupt every cached SRS. Cold first-load on 2^20 points drops from ~29s of JS bigint sqrts to ~0.7s of GPU compute; subsequent loads still hit IndexedDB.
…T table Renders a small ✓ on the WebGPU and WASM MT median-ms cells when the two backends produced the same MSM result for that size, ✗ on disagreement. Stays blank until the row's first rep finishes. Makes "is this comparison even valid" visible at a glance without scrolling down to the consistency table.
The batch-affine SMVP path caches init/schedule/apply bind groups under a key that includes input_size, and those bind groups reference `cachedBases.point_x_sb` / `point_y_sb` directly. When a caller rotated CachedBases across a logN sweep (16 → 17 → … → 20 → 16 again), the bind group inserted on the first visit to N=2^16 outlived the buffers it pointed at. The second visit to the same N hit the cache, dispatched against destroyed/zero memory, and the MSM returned the identity point — visible on the dev page as a ✗ on the per-row correctness tick. `CachedBases.destroy()` now calls `context.invalidateAllBindGroups()` so the next dispatch rebuilds bind groups against the live buffers. The existing `invalidateBindGroupsReferencing` private helper is reframed as a public clear-all method.
ba_dispatch_args was previously skipped under the assumption it was sub-millisecond; profiling it confirmed it is, but it also revealed the bulk of the residual lives elsewhere. Profiler.report() now derives an `encoder_all` row from min/max of existing stage timestamps (empty-pass markers are unreliable on Dawn — slots stay at 0). The dev page surfaces `inter_pass_overhead = encoder_all − Σ stages` and `post_encoder_tail = wall − encoder_all`, which splits the previous `untimestamped` row into Dawn barrier cost (small) and pre/post queue work (dominant — driven by the scalars writeBuffer).
…ults Adds a standalone WebGPU micro-benchmark page comparing three BN254 Montgomery product implementations for chained-mul throughput: - cios (u32): mitschabaude runtime-loop CIOS over 20×13-bit limbs. Baseline, ~109 ms at n=2^20, k=100. - karat (u32): recursive Karatsuba + Yuval reduction. 9 5×5 schoolbook sub-sub-products are computed independently and combined via two Karatsuba levels; reduction uses precomputed r_inv = W^-1 mod p with zero drains in the multiply phase (unsigned wrap unwinds via subsequent subtraction). ~80 ms (~28% faster than cios). - sos3uv3 (f32, reference): 22-bit f32 limbs with separate per-slot tlo/thi accumulators that break the inner-j carry chain. Single drain per outer iter via bias_split_f32_le4w. ~79 ms. The bench harness: - bench-field-mul.html is a standalone page; reads ?path=u32|f32 &n=N&k=K&validate-n=N&reps=R&variant=V from the URL. - bench-field-mul.ts runs k chained Mont mults per thread, validates the first `validate-n` outputs against a host BigInt reference, and writes timing into window.__bench. - scripts/bench-field-mul.mjs is a Playwright driver for headless invocation from the CLI (added playwright-core as devDependency).
Routes the `montgomery_product_funcs` mustache partial through a
pre-rendered Karatsuba+Yuval body in every MSM shader that does a
base-field multiply (15 callsites: convert_points, smvp, horner,
batch_affine_{apply,schedule,finalize_*,init,apply_scatter},
batch_inverse{,_parallel}, bpr, decompress_g1, montgomery_parity).
The Karatsuba body benches ~27% faster than the mitschabaude
runtime-loop CIOS at n=2^20, k=100 (80 ms vs 109 ms). It exposes the
same `fn montgomery_product(x, y) -> BigInt` symbol plus the same
`get_p` / `conditional_reduce` helpers and uses the same 20×13-bit
limb layout, so the swap is a drop-in change with no callsite churn.
The field-mul bench retains both options (`?variant=cios` renders the
original template inline, `?variant=karat` reuses the class-level
default) so the two bodies can be compared side-by-side.
Phase 1 LANDED — BY safegcd inversion (fr_inv_by_a, Option A: 20×13-bit, BATCH=26, carry-free apply_matrix):
- Production swap-in: wgsl/cuzk/batch_inverse{,_parallel}.template.wgsl call fr_inv_by_a
- 1.5× faster than legacy fr_inv (Pornin K=12) at chained-inverse bench
- ~8% MSM wall reduction at logN=16 sanity check
- TS port (cuzk/bernstein_yang.ts, bernstein_yang_a.ts) + Jest tests (24 passing)
- WGSL impls: wgsl/field/by_inverse{,_a}.template.wgsl + wgsl/bigint/bigint_by.template.wgsl
Phase 2 EXPLORATORY — multi-window pooled batch_inverse + multi-window BPR:
- WPB plumbing in batch_inverse_parallel + dispatch_args + batch_affine.ts
- Default WPB=1 (= legacy behavior, no perf change)
- BPR_WINDOWS_PER_BATCH knob in bpr_bn254.template.wgsl
- Empirical: pooling without growing WG count gives 0% gain — design needs restructure
Standalone bench infrastructure:
- bench-divsteps, bench-apply-matrix, bench-fr-inv, bench-batch-affine
- Each with HTML page + TS dispatcher + Playwright runner under dev/msm-webgpu/scripts/
- profile-sanity.mjs for per-pass GPU time breakdown on the Quick Sanity Check
Tree-reduce design (Stage B) for autonomous remote execution:
- .claude/plans/msm-tree-reduce.md — full design (adaptive batch sizing, analytical slice partition, 2 distinct phase kernels)
- .claude/plans/remote-agent-brief.md — remote agent execution brief
Co-authored with Claude.
Adds a remote-device bench loop for the MSM-webgpu dev pages so the tree-reduce work can validate against real WebGPU hardware (Apple M2, Snapdragon 8 Elite, Tensor G4) from a workstation without a local GPU. - vite.config.ts: results/progress POST endpoints write JSONL to files named by MSM_WEBGPU_RESULTS_FILE / MSM_WEBGPU_PROGRESS_FILE; allow .trycloudflare.com so the dev server is reachable via Cloudflare Quick Tunnel. - results_post.ts: tiny in-page client used by bench/sanity pages to POST progress + final-state payloads (no keepalive — the page is alive when the bench completes). - bench-batch-affine.ts: post per-batch progress and a terminal done/error row. - scripts/run-browserstack.mjs: spawn vite + cloudflared, drive a BS worker through the REST API, watchdog-tail the JSONL with first-progress / stall / deadline budgets. - scripts/bs-targets.mjs: macOS Sequoia Chrome, S25 Ultra, Pixel 9 Pro XL presets (WebGPU stable). iPhone 15 Pro listed but flagged as needs-iOS-26-or-newer. Validated against macOS Sequoia Chrome 148 (Apple M2, hc=8) on ?total=8192&sizes=64,256,1024: B=64 ns/pair=305.2 median=2.500ms B=256 ns/pair=146.5 median=1.200ms B=1024 ns/pair=219.7 median=1.800ms
Implements smvp_tree_partition.ts: the host computes per-WG slice
boundaries by binary search on bucketStart[], no GPU pre-pass. Uses
the analytical identity running_adds(i) = i - bucket_idx(i) from
msm-tree-reduce.md.
Documents a design ambiguity the plan didn't call out: the identity
under-counts when bucketStart contains empty buckets (bucket_idx
jumps faster than the entry count grows). Resolved by requiring
compacted input; compactBucketStart() + assertCompact() do the
one-pass cleanup and a side activeBucketIds[] map carries the
original bucket index for kernels that tag partials.
Exports:
- computeTotalAdds, bucketIdx, runningAdds, findAddsBoundary
- compactBucketStart, assertCompact
- buildSliceLayout(bucketStart, numWgs) -> SliceLayout
{ sliceStart, outputCount, outputOffset, totalAdds }
24 Jest tests pass — including the pair-detection brute-force walk
that catches the empty-bucket regression, the heavy-bucket-skew case
(7+ of 8 WGs fall inside a single 10k-population bucket), and the
pathological totalAdds < numWgs case.
No GPU code touched.
…validated)
Phase 1 of the tree-reduce SMVP: pair detection + cooperative batch-
affine + per-bucket-tagged write-out, one workgroup per slice.
Files:
- src/msm_webgpu/wgsl/cuzk/smvp_tree_phase1.template.wgsl — the kernel.
Thread-0 serial pair-detection preamble fills a workgroup-shared
pair_list (packed PAIR + UNPAIRED entries in slice walk order, which
is already bucket-sorted so no reorder postlude is needed). Phase
A/B/C/D batch-affine pattern from bench_batch_affine.template.wgsl,
with rank-indexed chunks over the PAIR sub-stream so a single
fr_inv_by_a amortises across the WG. UNPAIRED entries get a final
cooperative copy pass with sign-flip. Loop bounds all `const`
(MAX_PAIRS = MAX_SLICE_ENTRIES baked at compile time; v0 uses 128 to
keep workgroup memory comfortable).
- src/msm_webgpu/cuzk/shader_manager.ts — gen_smvp_tree_phase1_shader
generator + import wiring.
- dev/msm-webgpu/bench-smvp-tree-phase1.{html,ts} — standalone bench
page with a CPU reference. The reference walks the slice with the
same paired/unpaired state machine and computes Mont-form affine
adds via BigInt mod-inverse; correctness is checked bit-for-bit
against the GPU output.
Status: structure-complete but NOT yet correctness-validated on
hardware. The BS macOS Chrome 148 run hangs on the page before the
first log call lands (the previous BS run on the same tunnel for
bench-batch-affine worked fine, so the issue is page-specific not
infrastructure). Likely candidates: an early-eval import side effect
in smvp_tree_partition.ts, the buildSynthetic randomBelow loop
generating off the main thread, or a Mont-form-conversion stall.
Worth investigating with browser console access; the BS screenshot
API doesn't surface uncaught errors.
Documents a design decision in the shader header: Phase 1 does NOT
collapse same-bucket pair results sequentially into a single per-
bucket partial inside the slice (the plan's "merge consecutive same-
bucket results into running sum" wording). Sequential merging would
break batch-affine amortisation and would need (pop-1) sequential
adds per heavy bucket. Instead Phase 1 halves per bucket (ceil(p/2)
outputs per bucket per slice), letting the recursive Phase 2 dispatch
do the rest of the reduction in log layers.
The plan's wg_output_count[k] = "buckets touched" formula is
overridden here by the per-slice CPU pair-detection walk that
computes the actual output count.
The window.error / unhandledrejection listeners and skip_gpu URL flag were added to narrow down a BS-side hang in the phase1 bench page; they didn't surface the underlying issue and have been removed. Page remains structurally the same as bench-batch-affine.ts plus the buildSliceLayout import and the phase1-specific synthetic-data generation + CPU reference.
Phase 1 of the tree-reduce SMVP now passes correctness on local Chromium WebGPU (SwiftShader): 20/20 outputs match the CPU reference bit-for-bit on the small-N smoke (num_wgs=2, slice_entries=16). Three real bugs found and fixed by getting local WebGPU into the debug loop (via Playwright + chrome-headless-shell, no GPU on the dev container so SwiftShader is used): 1. randomBelow consumed only the LOW BYTE of each rng() output. For the 32-bit LCG the low 8 bits cycle every 256 outputs, so a 32-byte randomBelow draw cycles every 8 calls — fatal when the caller builds a Set of distinct values. Fixed to consume the full 32 bits. Latent bug in bench-batch-affine.ts too; harmless there because the only check is `pxMont !== qxMont` on adjacent calls. 2. WGSL `get_p()` redeclared in smvp_tree_phase1.template.wgsl. Already provided by the `montgomery_product_funcs` partial. Removed the local definition. 3. Shader needs 10 storage buffers per stage; WebGPU's default cap is 8. Adapter actually exposes 10+. get_device now requests the adapter max for `maxStorageBuffersPerShaderStage` alongside `maxComputeWorkgroupStorageSize`. CPU reference rewritten to do all arithmetic in canonical (non-Mont) form, then convert back to Mont for the diff against GPU output. The prior Mont-form-in-place pass got the inverse semantics wrong: fr_inv_by_a(dx_mont) returns inv_dx_canon * R^2 (a "double Mont" form, picked because the subsequent montgomery_product strips one R factor to give Mont-form slope), not inv_dx_canon * R as the original reference assumed. GPU bench wall-time: ~6.5ms for 32 entries / 20 outputs / 1 dispatch on SwiftShader CPU-emulated WebGPU. Not a benchmark number — real silicon will be 100× faster.
Phase 2 of the tree-reduce SMVP: recursive halving over partials. Structurally identical to Phase 1 (same pair-detection state machine, same Phase A/B/C/D batch-affine, same per-WG output write-out) but takes `(bucket_id, AffinePoint)` tuples directly rather than `(sign_bit | scalar_idx)` from the raw schedule + a separate entry_bucket_id table. One less indirection, no sign flip. Output schema matches Phase 1 so the recursion can rebind the same buffers and just swap the input/output roles each layer. Correctness gate: 19/19 outputs match CPU reference bit-for-bit on the small smoke (num_wgs=2, slice_entries=16) on local SwiftShader. GPU bench wall: 5.4ms (CPU-emulated WebGPU; M2 would be ~10× faster based on Phase 1 readback). Done definition for this step met.
…artial)
Drives Phase 1 → CPU sort → Phase 2 → CPU sort → Phase 2 → ... until
every bucket has one partial. CPU-side resort between phases (Step 4
is deferred to GPU follow-up — choice documented in module header).
Standalone bench-smvp-tree.{html,ts} compares the final per-bucket
partials against a CPU reference that computes the full sequential
sum per bucket directly.
Status:
- Phase 1 alone: 1/1 buckets match (entries=2)
- Phase 1 + 1× Phase 2 with mixed pair_result+unpaired input
(entries=3): 1/1 buckets match
- Phase 1 + 1× Phase 2 with two pair_result inputs (entries=4):
1/1 MISMATCH
Repro: load `bench-smvp-tree.html?entries=4&buckets=1&seed=42` on
local SwiftShader Chromium. CPU reference matches the sequential-add
of 4 canonical points; orchestrator's Phase 2 output disagrees.
Phase 2 standalone test (against synthetic Mont-form pair-like
inputs) passes 19/19, so the bug must live in the boundary between
Phase 1's output buffers and Phase 2's input expectations — likely
a Mont-form / BigInt-stride mismatch that the standalone Phase 2
test wasn't hitting because its inputs are generated as fresh random
Mont values rather than the output of a previous batch-affine.
Next step in this debug path: instrument the orchestrator to print
the Phase 1 readback values and diff each (P_2k + P_2k+1) against
its corresponding CPU pair-add for entries=4. That narrows whether
Phase 1's emitted bytes are wrong vs. whether Phase 2 misreads them.
Step 6 (production swap) is unblocked from a structural standpoint
— if the Phase 1/2 chain is fed by the existing transpose +
bucket_start, the same bug surfaces and gives a concrete failing
Quick Sanity Check to triangulate with.
…5 validated) The previous reference summed each bucket's points sequentially: ((P0+P1)+P2)+P3+... which only matches the GPU's tree-reduce parenthesization (P0+P1)+(P2+P3)+... when the inputs are on the EC group. The synthetic bench uses random off-curve bigints (we test the algebraic affine-add formula, not the group law), so the two orderings produce different bytes. Fixed by walking each bucket via the same pair-detection state machine the GPU uses, recursing layer-by-layer until one partial remains. Bench passes 5/5 buckets bit-for-bit on local SwiftShader (entries=40, buckets=5, seed=99) — including bucket=4 which has pop=9 and recurses through 4 layers. This validates the full Phase 1 → CPU sort → Phase 2 → CPU sort → ... chain. Step 5 correctness gate met.
The tree-reduce orchestrator (cuzk/smvp_tree.ts) is correctness-validated standalone but not yet integrated into the production MSM pipeline. This marker documents the integration checklist at the swap site so a follow-up session can wire it in without re-discovering the contract.
Bumped to 256 + 200 entries / 12 buckets validated correctness OK on local SwiftShader (5 layers, 0 mismatches, 140 ms wall) but BS macOS Chrome 148 fails to compile the resulting shader within the worker's initial-load window — either maxComputeWorkgroupStorageSize exceeded or the static-bound pair_list loops blow out the WGSL compile budget. Keeping 128 for the validated path (5/5 buckets bit-for-bit on M2 at entries=40). Scaling further is a follow-up that needs pair_list hoisted to global memory + per-WG pair_count uniform sized for the runtime count instead of MAX_PAIRS-bounded loop iterations.
…SWEET_B=1024 Phase 1/2 shaders rearchitected for thread utilization at the plan's target SWEET_B=1024 batch-affine size. v1's two main flaws: 1. Per-thread O(MAX_PAIRS) scans for rank → raw_slot lookup AND backward search for prev PAIR's raw_slot in Phase D. At MAX_PAIRS=1024 that's 1024 idle iterations per thread per phase. 2. `pair_bucket` in workgroup memory inflated per-WG storage past the 32 KiB cap, forcing MAX_SLICE_ENTRIES=128 and 8× more WGs than the plan called for. v2 fixes both. Thread-0 preamble builds 4 workgroup-shared arrays in ONE sequential pass: - pair_idx_a, pair_idx_b: per-raw-slot (PAIR or UNPAIRED) input entry indices - prev_raw_for_pair: per-raw-slot pointer to immediate prior PAIR's raw_slot (O(1) lookup in Phase D, no backward scan) - rank_to_raw: per-PAIR-rank pointer to raw_slot (O(1) Phase A/D iteration over PER_THREAD_PAIRS, not MAX_PAIRS) pair_bucket writes go straight to global `output_bucket_id` from the preamble — never in workgroup memory. Workgroup memory at MAX_PAIRS=1024 / TPB=64: 4 × 4 KB (pair arrays) + 2 × 5.12 KB (wg_fwd/bwd) + ~80 B = 26.4 KB fits in M2's 32 KiB cap. Phase A/D inner loops now iterate exactly PER_THREAD_PAIRS = 16 times each (down from MAX_PAIRS = 1024 in v1). 64× fewer idle iterations per thread per phase. Validation on local SwiftShader (Chromium headless, no GPU on dev container): - Phase 1 standalone at 4096 entries / 8 WGs × 512 entries: 2057 outputs, 0 mismatches, 6.5 ms median. - Orchestrator at 2048 entries / 64 buckets: 64/64 buckets match full-reduce CPU reference bit-for-bit. 3 layers, 18.8 ms total GPU wall (10.0 + 5.5 + 3.3 across phase1 + phase2 layer2 + layer3). Apple M2 should be ~10× faster (SwiftShader is CPU-emulated WebGPU). Pending BS validation.
…y bucket-sorted First-principles observation: Phase 1 / Phase 2 outputs are ALREADY globally bucket-sorted. Input entry_bucket_id is monotone non- decreasing (CSR layout); each WG walks its non-overlapping contiguous slice left-to-right emitting in walk order; WG outputs concatenated preserve monotonicity. No sort needed. Removes the readback-of-points + JS sort + upload between every phase. Saves O(N × NUM_LIMBS_U32 × 4) bytes of bus traffic + the O(N log N) JS sort per layer × log layers. Still does a small (4 B / partial) bucket-id readback to compute per-WG pair-count + output offsets host-side. Asserts global sort on the readback as a debug guard — cheap and catches partition regressions. Termination changed from "no more pair-adds possible" (required full bucket-id scan) to "count equals input num_active_buckets" (known from initial input). One bucket-id readback per phase, point data never moves between phases. Bench at 8192 entries / 256 buckets / 5 layers on local SwiftShader: - 256/256 buckets match full-reduce CPU reference bit-for-bit - GPU wall: 21.9 + 9.9 + 8.7 + 8.8 + 5.5 = 54.8 ms total For comparison the prior CPU-sort version at 2048 entries / 64 buckets / 3 layers was 140 ms total. 4× scale, 0.4× time — ~10× speedup from this change plus the v2 thread-utilization fix. Bench entry cap raised from 512 → 2^18 (1 << 18) and bucket cap from 64 → 2^14 so we can run real production-scale workloads.
…to finalize pipeline Two small kernels that turn the orchestrator's sparse (bucket_id, AffinePoint) outputs into the dense (running_x, running_y, bucket_active) arrays the existing finalize_collect → finalize_inverse → finalize_apply pipeline expects. With these in place the production swap in msm.ts is mechanical: replace the round-loop dispatch with runTreeReduce + scatter_init + scatter, and re-use the finalize chain unchanged for the affine→Jacobian + magnitude-bucket fold. scatter_init: one thread per bucket slot, zeros running_x/y + bucket_active across the full T*num_columns dense layout. scatter: one thread per orchestrator output, writes running_x[bucket_id]=P.x, running_y[bucket_id]=P.y, bucket_active[bucket_id]=1. Both kernels are trivially parallel (no atomics, no synchronisation beyond the bucket_active write which is the only output ever written by any thread for that bucket_id since the orchestrator's output is unique-per-bucket).
…alize pipeline
`smvp_batch_affine_gpu_tree` is the production adapter that:
1. Reads CSR row pointers from `all_csc_col_ptr_sb`, computes
per-entry bucket id, uploads.
2. Runs the v2 tree-reduce orchestrator (`runTreeReduce`).
3. Inits the dense workspace (`running_x/y_sb`, `bucket_active_sb`)
via `scatter_init` (one thread per bucket slot).
4. Scatters the tree-reduce output (sparse, one per active bucket)
into the dense workspace via `scatter` (one thread per output).
5. Returns. Caller continues with the existing `finalize_collect` →
`finalize_inverse` → `finalize_apply` chain unchanged for the
affine→Jacobian conversion and the magnitude-bucket fold.
`buildTreeAdapterPipelines` compiles all four pipelines (phase1,
phase2, scatter, scatter_init) once per (num_words, max_slice_entries)
shape; cache the handle for the warm bench loop.
ShaderManager wiring for `gen_smvp_tree_scatter_shader` +
`gen_smvp_tree_scatter_init_shader` added alongside the existing
phase1/phase2 generators.
The actual msm.ts call-site swap is one more edit: replace the
current `smvp_batch_affine_gpu(...)` call with two calls — first
`smvp_batch_affine_gpu_tree(...)` to populate running_x/y +
bucket_active via tree-reduce, then the existing finalize chain.
That swap is mechanical now that the adapter is in place; pending
the Quick Sanity Check correctness gate.
Validates the tree-reduce's main perf claim from the plan: a heavily skewed input (one bucket with pop = entries/2, the rest uniform) is handled in O(log heavy_pop) layers regardless of skew. Measured on Apple M2 via BS at entries=65536 / buckets=512 / skew=heavy (heavy bucket pop = 32 832): layers: 16 total GPU wall: 34.6 ms For comparison the same input at skew=uniform (max pop ~256): layers: 6 total GPU wall: 24.3 ms Heavy skew → only 1.4× more time despite a bucket that the current round-loop MSM would need ~32 832 sequential rounds to reduce. The plan's "5–10× faster on heavy-bucket workloads" claim looks conservative. Bench page now accepts `?skew=heavy` and abbreviates the pops log for runs with > 16 buckets.
bench-c-sweep fed MsmV2 off-curve random points and raw 256-bit scalars: - run()'s host window-combine is real elliptic-curve arithmetic, so off-curve points drove it to a non-invertible Z. - MsmV2 gives the GPU `s·R mod p` (de-Montgomery'd to `s mod p`) but the host planner Booth-decodes the raw scalar; a scalar >= p decodes differently on the two sides and corrupts the level plan. Load real on-curve SRS points and reduce every scalar mod p. run-bench.mjs forces HTTP/1.1 so headless Chrome can fetch the SRS.
bench-c-sweep measured the fastest Pippenger window size c per MSM size; MsmV2.pickC now uses that table (logN 10..20). The index.html GPU-vs-WASM sweep extends down to 2^10 (was 2^16) — the v2 pipeline has no size floor, and the small sizes show where the GPU overtakes the WASM Pippenger.
Phase 1 of the optimisation-knob plan: make MsmV2's frozen pipeline
constants tunable so the benchmark suite can A/B them, and add per-pass
GPU profiling to see where a knob lands.
- MsmV2.create takes an MsmConfig { c, s, wgi, reduceWg, l0Log,
invVariant, profile, jacobianCrossover }; every field defaults to the
current constant, so an unset config reproduces today's behaviour.
- run() optionally records per-pass GPU timestamps (one query set reused
across runs) and returns a {demont..redFused, wall} breakdown.
- bench-msm-v2.{html,ts}: an A/B harness — drives MsmV2 at one config,
reports median wall + the per-pass breakdown. Open twice with one URL
param changed to compare.
- bench-c-sweep and index.html (main.ts) forward the knob URL params, so
index.html?...&s=4 cross-checks a knob against the WASM Pippenger.
jacobianCrossover is accepted but inert — the Phase-2 hook.
bench-msm-v2 sweeps (2^7..2^16, this GPU) gave clear per-size optima; adopt them as the defaults so an unconfigured MsmV2 is the tuned config. - pickC extended down to 2^7 (7->4, 8->4, 9->5) — tiny n wants a small c to shrink the bucket-reduction floor. - pickS(n): fused chunk size — 2 for n<=2^11, 4 for 2^12-2^13, 8 above. Small n is occupancy-starved and wants more, smaller chunks. - pickReduceWg(c): reduction workgroup size — 32 for c<=9, 64 for c<=12, 128 for c>=13; tracks the reduction stride / GPU subgroup width. - Default wgi 64->128 and invVariant 'a'->'loop' (>= old everywhere). Net: ~12-30% faster at 2^10-2^14, neutral at 2^16 (already saturated). Cross-check + noble still agree at every size; bench-msm-v2-check now shares MsmV2's pickC instead of a stale local copy.
fr_sub was three limb-passes — a+p, then -b, then fr_reduce. Rewrite it as bigint_sub(a,b) plus a borrow-masked +p correction folded into one add pass: branch-free, two passes. Verified bit-identical against the GPU-vs-noble cross-check; -4.6% wall at 2^16 (28.2 -> 26.9 ms). Fold lambda^2 - x1 - x2 to lambda^2 - (x1 + x2) in the affine point-add across ba_fused_super, ba_reduce_fused and ba_window_combine — one fr_add is a limb-pass cheaper than the fr_sub it replaces. Add ba_window_combine.template.wgsl: an on-GPU single-workgroup Jacobian tree fold that collapses the per-window sums into one affine point. Not yet wired into MsmV2 — the host window-combine still runs.
'active' is a WGSL reserved keyword; the tree-fold flag is renamed to is_active so the shader compiles. The shader is not yet wired into MsmV2 — wiring waits on a cooperative-arithmetic rewrite.
The on-GPU window-combine is a maximally-serial doubling chain; a single-workgroup kernel runs it ~50x slower than the host Horner fold (one GPU lane loses to a CPU core on serial work). The host window-combine stays — removing the dead shader.
Three register-pressure cuts in the canonical MsmV2 hot kernels, targeting small-register-file GPUs (Adreno/Mali) where the kernels spill to DRAM and collapse occupancy: - montgomery_product: the modulus p is now individual P_LIMB_* constants rather than a function-wide `var p` live across the whole multiply (follows the existing R_INV precedent). conditional_reduce materialises its own p after the register peak. - ba_fused_super: the backward peel is split into an inverse pass and an affine-add pass, so the running batched inverse is no longer a register-resident value live across the affine add. - byl_divsteps: rewritten branchless (select over the swap / add / shift cases) to remove data-dependent divergence on wide-wave GPUs. Bit-for-bit identical results, verified against noble at n=4096 and n=65536. On M4 a net 10% speedup at n=2^16 (27.4 -> 24.7 ms): the lower register footprint also lifts M4 occupancy.
fr_inv_by_loop_pk stores the safegcd working set (f,g,d,e,p) as two 13-bit limbs per u32 word — 10 words instead of 20 — halving the inverse's per-thread private-memory footprint and apply_matrix memory traffic. Same Bernstein-Yang algorithm and i32/13-bit arithmetic; only the storage is packed. The smaller footprint keeps more inverse threads resident, lifting occupancy on small-register-file GPUs (Adreno). Wired as the `pk` inverse variant in the ba_fused_super and ba_reduce_fused shader generators; MsmV2 defaults to it. Byte-exact vs noble at n=4096 and n=65536. On M4 (not register-bound) the pack/unpack shifts cost ~1 ms — pk's payoff is on the Adreno.
Make the 2x13-packed safegcd inverse (fr_inv_by_loop_pk) the default at every entry point, not just MsmV2's DEFAULT_INV_VARIANT: - gen_ba_fused_super_bench_shader / gen_ba_reduce_fused_bench_shader function-default 'a' -> 'pk' (covers any caller that omits the arg, e.g. bench-msm-tree-v3) - bench-msm-tree-v2's INV_VARIANT now defaults to 'pk' - the ?inv= parsers in main.ts, bench-msm-v2 and bench-c-sweep accept a|loop|pk; with no ?inv they fall through to the 'pk' default bench-msm-tree-v2 ?validate=1 host-replay passes byte-exact with pk.
The Karatsuba+Yuval montmul was emitted flat: all 9 schoolbook 5x5 sub-products (81 output limbs), then all 3 inner combines, then the outer combine. Every schoolbook output is a function-scope value consumed only later, so all 81 sit live at once — a ~120-u32 register peak that spills on small-register-file GPUs (Adreno). Re-emit grouped by half-product: each of P_lo / P_hi / P_cr gets one scoped block that computes its 3 schoolbooks, runs the inner+outer Karatsuba combine, and folds the result straight into the 40-limb accumulator t. Only one group's 27 schoolbook outputs are live at a time. Identical arithmetic — same 225 multiplies, same combine adds, the p_lo/p_hi/p_cr common-subexpression sharing preserved — purely a tighter live-range schedule. montmul-internal register peak ~120 -> ~80. Byte-exact vs noble at n=4096 and n=65536. On M4 the lower footprint also lifts occupancy: n=2^16 wall 26.7 -> 24.4 ms.
…er 2) ba_fused_super now carries field elements in the canonical 8x u32 packed form — which is also the storage form — instead of the 20x13-limb BigInt. The affine add's live values cost 8 registers each instead of 20; loads/stores are plain 8-word copies (the unpack256 / pack256 at every load/store boundary is gone). - fr_add / fr_sub run natively on 8x u32 (8-word modular add/sub; the carry out of each word is `u32(sum < operand)` — one compare, no branch). An unpack-repack variant is kept behind the `addsub` knob. - montgomery_product_f8 wraps the grouped Karatsuba: expand both operands 8 -> 20x13, multiply, contract 20x13 -> 8. - get_r is now a constant derived from get_r_f8 (no var-builder). Byte-exact vs noble at n=4096 and n=65536. A/B (n=2^16, fused pass): native fr 10.84 ms vs unpack-repack 11.80 ms — native ~9% faster, so it stays the default. On M4 the fused pass drops 14.30 -> 10.84 ms (cheaper 8-word fr ops + free loads/stores + the lower footprint lifts occupancy); wall 24.4 -> 21.0 ms. On the Adreno the ~2.5x smaller live state is the register-spill fix.
Extend Lever 2 (8x u32 live field representation) from the accumulate kernel to ba_reduce_fused, the 4-phase recursive bucket reduction. Field elements were 20x13-bit BigInts — 20 registers each, and with ~13 live values across the affine-add peel that kernel was the reduction's register-pressure hotspot. ba_reduce_fused now carries field elements as array<u32,8> (the canonical 256-bit packed form, which is also the red_buf storage form): 8 registers each, and load_x/load_y/store_x/store_y become plain 8-word copies with no unpack256/pack256 at the boundary. The 8x u32 field ops (montgomery_product_f8, native fr_add_f8/fr_sub_f8, get_r_f8, is_zero_f8) move out of ba_fused_super's inline block into a shared partial, wgsl/field/field8.template.wgsl, included by both hot kernels via the field8_funcs partial. shader_manager gains f8Context() for the mustache context and threads the addsub knob through the reduce-fused generator. M4 n=2^16: redFused 7.5 -> 6.2 ms, wall 21.0 -> 19.6 ms. Byte-exact vs noble at n=4096 and n=65536.
Add a `reduceVariant` knob to MsmConfig — 'fused' (default, unchanged) or 'unfused'. The fused ba_reduce_fused runs the whole 4-phase reduction in one dispatch with a level loop and storageBarrier between levels; on a register-starved GPU that monolith is the reduction's spill hotspot and its single unbounded dispatch risks the watchdog. The unfused variant is ba_reduce_level_bench.template.wgsl: one schedule level per dispatch, with `kind` (0 = phase-A suffix add, 1 = phase-B/D tree-add, 2 = phase-C double) baked in as a compile-time constant. Three kind-specialized pipelines are compiled; each const-folds away the other two kinds' branches, so each carries only its own path — a smaller per-shader register footprint than the kind-0/1/2 monolith. WebGPU's between-pass ordering replaces the fused storageBarrier; every dispatch is watchdog-bounded. Default stays 'fused', so the M4 path is untouched by construction; 'unfused' is the variant for the S25/Adreno, selectable via ?reducevariant=unfused on the bench pages. Also wires bench-msm-v2 into the BrowserStack runner: the page now POSTs progress/results JSONL, run-browserstack.mjs gains the bench-msm-v2 page mapping and an --n flag, and bench-msm-v2-check accepts the algorithm knobs so a swept config can be validated. M4 n=2^16: fused (default) 19.55 ms unchanged; unfused 19.70 ms (+0.8%, the ~43-dispatch overhead). Both byte-exact vs noble at n=4096 and n=65536.
At deep reduction levels ppw is smaller than the workgroup size, so a thread whose entire candidate range is past ppw does no useful work: every j2 >= ppw, no red_buf / is_present writes, and pref_scratch is private per thread. It still ran the forward montmuls and — the expensive part — the unconditional safegcd inversion. ba_reduce_level returns those threads before the inversion. The unfused per-level kernel can do this; the fused kernel cannot (every thread must reach the storageBarrier). Idle lanes then issue no instructions and, on a register-starved GPU, generate no register-spill traffic. M4 n=2^16, unfused: redFused 6.40 -> 5.76 ms (idle subgroups skipped). On a spilling GPU the gain should be larger. Byte-exact vs noble at n=4096 and n=65536.
Makes the three unfused per-level reduction kernels (`ba_reduce_level`,
kinds 0/1/2) branchless, so the Adreno register allocator schedules a
single straight-line path with no spill traffic.
## What changed
- **Removed the `is_zero_f8` (P = ±Q) dual path.** Under the algorithm's
no-collision assumption (uniformly-random inputs) the affine-add
denominator `x_s - x_d` is never zero, so kinds 0/1 always do
point-addition. This was the main register hog — both the add and double
formulas were live across the branch.
- **`is_present` occupancy → `select`.** The add / copy-into-empty-slot
/ skip cases collapse to straight-line `select`s: the denominator
defaults to the identity `R`, and the stored x/y, the `inv` peel factor,
and the `is_present` flag are all selected.
- **`KIND` switch → compile-time mustache** (`{{#kind0/1/2}}`), so each
compiled variant has zero `KIND` branches.
- **`k==0` / `k>0` loop branches removed** — `acc` inits to `R`
(Montgomery 1) so `mont(acc,denom)` is correct at k=0, and the final
`inv` peel is dead.
- **Forward bounds branch removed** — the forward store target is
per-thread-private scratch, always safe.
- **Preserved the `tid*C >= ppw` early-exit** from this branch (a
workgroup-uniform whole-lane skip, distinct from the per-candidate
divergent branches), plus one store bounds-guard on the backward pass
(tail lanes must not write into neighbouring windows).
- Added a `fr_select_f8` elementwise branchless select helper (local to
the template).
Generated WGSL was regenerated via `yarn generate:wgsl`;
`shader_manager` now passes `kind0/1/2` booleans for the mustache
sections.
Also: `bench-msm-v2.ts` accepts a base64url `cfg` blob param —
BrowserStack real-mobile workers truncate the launch URL at the first
`&`, so multi-knob runs otherwise silently fall back to defaults.
## Result — Samsung Galaxy S25 Ultra (BrowserStack), n=2^10, inv=pk,
c=8, s=2
| reduce path | GPU reduction (`redFused`) | wall median |
|---|---:|---:|
| fused (default) | 468.9 ms | 484.8 ms |
| unfused, branched | 62.8 ms | 78.7 ms |
| **unfused, branchless (this PR)** | **~4.9 ms** | **~18 ms** |
The GPU reduction kernel drops **~12.8×** vs the branched unfused path
(and the MSM is now host-bound — the on-GPU work is ~7 ms).
**Bit-exact**: `resultX = 0x22ec42de…b8d85f19`, identical to both the
fused and branched-unfused paths on the same inputs, so the rewrite (and
the no-collision branch removal) is validated by result-hash equality
on-device.
Benchmarked S25-only per the WebGPU MSM workflow; the M2/macOS reference
path is unchanged.
---
*Created by
[claudebox](https://claudebox.work/v2/sessions/554522191f42098c) ·
group: `slackbot`*
The transpose count did one global atomicAdd per (point, window) onto a per-bucket counter. Bucketing maps many points to the same bucket, so those atomics are heavily contended — fine on GPUs with strong atomic units, but they serialize hard where contended-atomic throughput is weak (Adreno). transpose_parallel_count_priv: one workgroup per window tallies the window's column indices into a workgroup-shared histogram (shared-memory atomics, on-chip) and writes the result with plain stores — zero global atomics. When n_cols exceeds the shared histogram capacity the columns are covered in tiles, each re-scanning the (coalesced) column array. The base transpose_parallel_count is kept — it is still used by the legacy bench, the wgsl unit test, and src/msm.ts. MsmV2 switches to the privatized variant (one workgroup per window, workgroup size 256). M4 n=2^16: transpose pass 0.31 ms (unchanged — M4 eats contended atomics fine). Byte-exact vs noble at n=4096 and n=65536. The win is on the S25.
The transpose scatter did one contended global atomicAdd per (point, window) onto a per-column write cursor — the same Adreno-hostile pattern as the count. transpose_parallel_scatter_priv: one workgroup per window, the per-column write cursors held in workgroup-shared memory (shared atomics only, no global atomics). One workgroup owns the whole window, so every point of a column is allocated by the same shared cursor. Tiled when n_cols exceeds the shared cursor capacity. The global `all_curr` counter buffer is gone. Within-column ordering becomes shared-atomic order rather than global arrival order — a different valid permutation, identical multiset per column. Bucket accumulation sums each bucket and is order-independent, so the MSM result is unchanged: byte-exact vs noble at n=4096, 65536. M4 n=2^16: transpose pass 0.33 -> 0.23 ms (privatized count + scatter). The base transpose_parallel_scatter is kept for the legacy callers.
…ariant PR #23485 made the unfused per-level reduction kernels branchless and fast, but `reduceVariant` still defaulted to 'fused' — so a default MsmV2 run (the canonical benchmark path) used the slow single-dispatch fused monolith, not the branchless kernels. The fast reduction was only reachable with an explicit ?reducevariant=unfused. Flip the default to 'unfused'. The branchless unfused reduction is now the better kernel on both M4 (neutral — the ~43-dispatch overhead is absorbed) and register-starved GPUs (no spill); 'fused' stays selectable as a fallback / for A/B. M4 n=2^16: wall 19.6 ms, redFused 6.18 ms — unchanged. Byte-exact vs noble at n=4096 and n=65536 with default config.
The fused single-dispatch reduction (ba_reduce_fused) was the slow path — superseded by the unfused branchless per-level kernels, which became the default in the previous commit. Delete it so the slow path can no longer be selected anywhere. - ba_reduce_fused_bench.template.wgsl removed; gen_ba_reduce_fused_bench_ shader removed from shader_manager. - The `reduceVariant` MsmConfig knob is gone — the reduction is now unconditionally the per-level kernels. reduceFusedLayout (reused by the per-level path) renamed reduceLevelLayout; the fused pipeline / bind / if-else branches removed from create / prepare / run / profiling. - ?reducevariant= removed from bench-msm-v2 and bench-msm-v2-check. - bench-msm-tree-v2 (the legacy v2 prototype, the only other caller of the fused reduction) deleted, with its run-browserstack page entry. Byte-exact vs noble at n=4096 and n=65536 with default config; M4 n=2^16 wall ~19.4-20 ms, unchanged.
dev/msm-webgpu/ had ~20 separate microbench frontends (bench-*.html/.ts) and ~10 headless scripts accumulated during the MSM work — almost all superseded. Collapse to the only thing needed: - index.html + main.ts — the MsmV2-vs-WASM benchmark site (WebGPU MSM cross-checked against barretenberg WASM Pippenger ST + MT, and noble at logN=16). - drive-index.mjs — its headless driver (local Chrome). - scripts/run-browserstack.mjs (+ bs-targets.mjs) — repointed at index.html (?autorun=msm-cross-check), for running the site on a real device (e.g. the S25). Deleted: 21 bench-*.html, 21 bench-*.ts, run-bench.mjs, and 9 microbench/runner scripts (scripts/bench-*.mjs, profile-sanity, run-bench-smvp-tree, run-local-webgpu, run-msm-page). The kept site's import closure (msm_v2, pippenger_wasm, srs, gpu_decompress, results_post, wgsl_unit_tests) is untouched. Validated: index.html headless autorun cross-check — WebGPU, WASM ST, WASM MT all agree, state=done.
After the bench-page cleanup, 23 shaders were left used by no live path (MsmV2, wgsl_unit_tests, or the production compute_bn254_msm). Delete them, the 21 corresponding shader_manager gen_* methods and imports, and the now-dead renderByInverseFuncs / renderGetRFn helpers: - microbench shaders: apply_matrix_bench, bench_field_mul, bench_field_inv, bench_batch_affine, divsteps_bench, field_mul_bench_u32/f32, fr_inv_bench - pre-v2 experiments: ba_marshal_chain/pairs/tree_l0, ba_pair_disjoint(_tree), ba_planner_bench, ba_scatter_pairs, ba_tail_reduce, ba_rev_packed_carry, batch_affine_apply/finalize/fused_wg_scan, batch_inverse - plain CIOS mont_pro_product — the live paths use the karat-yuval montmul - plain by_inverse — the live paths use by_inverse_a / by_inverse_loop mont_pro_product_f32_22_sos3uv3 (+ bigint_f32, mulhilo_22) kept by request though currently uncalled. 75 -> 52 inlined shaders. Validated: index.html headless autorun cross-check — WebGPU, WASM ST, WASM MT all agree, state=done. shader_manager typechecks clean.
The WebGpuMsmHost bridge delegated to the obsolete modified-cuZK compute_bn254_msm pipeline. Replace it with MsmV2, the carry-free Booth / privatized-transpose / pair-tree pipeline. - Move MsmV2 into src/msm_webgpu and add MsmV2Pool: the SRS point pool is uploaded and Montgomery-converted on the GPU once, then bound as a prefix by every MSM, replacing the per-instance host conversion. - Rewrite WebGpuMsmHost around the pool: lazy build at SRS publish, a pinned SRS-sized instance plus a small LRU, no benchmark warm-up. - Offload the Horner window-combine and modular inverse to native bb::g1 (combine_windows); the bridge ships the per-window sums and the C++ hook folds them. - Drop the redundant GPU demont pass and the scalar Montgomery round-trip; the hook omits points for SRS-prefix MSMs and routes small MSMs to the native Pippenger. - Delete msm.ts, the superseded cuzk modules and the dead WGSL shaders; the dev/msm-webgpu benchmark page is retained.
The dev MSM comparison harness (barretenberg/ts/dev/msm-webgpu) ran a redundant single-threaded WASM pass, timed input marshalling as compute, and trapped outright at large sizes. Harness changes: - Drop the single-threaded WASM row — keep WebGPU vs multi-threaded WASM Pippenger only. - Split the WASM path into an untimed bb_native_pippenger_bn254_load (decode + upload) and a timed bb_native_pippenger_bn254_run, so the measured window is pure Pippenger compute. - Delete the unused by_inverse_a field-inversion shader variant. Bug fixes: - bb_native_pippenger_bn254_run calls batch_multi_scalar_mul_native, not pippenger_unsafe: in a BBERG_WEBGPU_MSM_HOOK build the latter routes an n >= 2^16 BN254 MSM into the uninstalled WebGPU bridge and throws. - The MsmV2 warm-up uses a pseudo-random scalar spread; identical scalars collapsed every window sum onto one subgroup and made the host Horner combine hit a non-invertible value. - Raise the harness WASM heap cap from 256 MiB to 1 GiB and free the previous size's input vectors before reallocating in _load — a 2^20 sweep otherwise exhausts the heap and traps with `unreachable`. Verified in headless Chrome: the full logN 10..20 sweep completes with WebGPU/WASM cross-checks agreeing at every size.
… scaling The WebGPU MSM transpose dispatched one workgroup per window (~17), so at large n it was a latency-bound serial scan with no parallelism to hide DRAM latency — the `transpose` pass blew up superlinearly (0.8 ms at 2^17 to 108 ms at 2^20, 37% of the whole MSM) and dragged the per-doubling cost to 2.2-2.3x where Pippenger should be O(N/log N). Replace the 3 one-workgroup-per-window kernels with a tiled counting sort dispatched across point-chunks (numChunks x windows, ~600-1000+ workgroups): - count: each workgroup histograms its point-chunk into a workgroup-shared histogram (shared atomics only, ~1-deep contention — no global atomics) and writes a private partial-histogram row; - reduce: folds the per-chunk partials over the chunk axis into the per- window column counts and chunk-exclusive prefixes; - scan: the existing per-window prefix sum, unchanged; - scatter: each workgroup scatters its chunk into the CSC slots at the scanned offsets via a workgroup-shared write cursor. The numChunks x BW partials matrix temporally reuses l0IdxBuf (dormant until convActive overwrites it after the transpose) — no new allocation. Measured on an M4, headless: transpose at 2^20 drops 108 ms -> 5.9 ms (18x), total 290 ms -> 190 ms, and the per-doubling ratio falls from 2.27x to 1.81x. WebGPU/WASM/noble cross-checks agree at every size logN 14-20.
After the tiled-transpose fix, `planner` was the last superlinear pass — one workgroup per Pippenger window (~17), O(pairs) per window, 16 ms at 2^20 (8.5%) scaling ~2.8x/doubling. Split it into two passes: - ba_planner_v2_offsets (pass A): the per-window scan — Phase A/B/D unchanged — but Phase C now writes only the per-bucket carry prefix plus new_counts/new_offsets, not the O(pairs) plans. O(BW), flat in n, so one workgroup per window is fine. - ba_planner_v2_emit (pass B): dispatched (ceil(BW/256), numWindows) — one workgroup per (bucket-group, window) — emits the chunk / scatter / carry plans in parallel and cooperatively self-pads the plan tails. The pair prefix is derived as new_offsets - w*wstride - carry_off. The per-bucket carry-prefix array temporally reuses valIdxBuf (dead once convActive has consumed it, strictly before the planner runs) — no new allocation, no atomics. Measured on an M4, headless: planner at 2^20 drops 16 ms -> 6 ms and scales 1.89x/doubling; total 190 ms -> 180 ms. Combined with the tiled transpose, 2^20 is 287 ms -> 180 ms and the per-doubling ratio is 2.27x -> 1.76x — O(N/log N) restored. WebGPU/WASM/noble cross-checks agree at every size logN 14-20.
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MsmV2, add per-window chunk sums, and return the combined window sums.MsmV2prepare by materializing level-0 CSR on the host, and keeps desktop/non-Android on the existing full-size path.Validation
yarn generate:wgsl./node_modules/.bin/prettier -w src/msm_webgpu/msm_v2.ts src/msm_webgpu/bridge/main.ts src/msm_webgpu/cuzk/shader_manager.ts dev/msm-webgpu/main.ts./node_modules/.bin/tsgo -p dev/msm-webgpu/tsconfig.json --noEmit2^18,scalar_seed=2, streamed2^17chunks: WebGPU completed and cross-checked against WASM. GPU compute reported403.4 ms; WASM reported392.4 ms; prior full-size S25 path lost the device twice.Known local check limitation
./node_modules/.bin/tsgo -p tsconfig.json --noEmitstill fails before reaching this change because generated cbind / aztec-wsdb files are absent in this checkout.Created by claudebox · group:
slackbot