diff --git a/barretenberg/ts/dev/msm-webgpu/WALKER_DX0_FINDINGS.md b/barretenberg/ts/dev/msm-webgpu/WALKER_DX0_FINDINGS.md
new file mode 100644
index 000000000000..84c3a73b6ae0
--- /dev/null
+++ b/barretenberg/ts/dev/msm-webgpu/WALKER_DX0_FINDINGS.md
@@ -0,0 +1,121 @@
+# Root-cause investigation: `ba_walker_combine` dx==0 / off-curve bucket
+
+Branch: `stream-walker-dx0-rootcause-inv` (off `stream-walker-impl`).
+
+Target: the off-curve-bucket bug at logn 8..16 where `ba_walker_combine`
+produces an off-curve value for a small number of buckets, each with exactly
+one `dx == 0` exception during the linked-list traversal. The prior session
+was about to make the combine's affine addition exception-safe; that is a
+symptom patch. This investigation looks upstream.
+
+## TL;DR
+
+1. **The decompose → cut → walker-emission → combine *logic* is correct.** A
+   dependency-free reproduction of the exact integer logic of every kernel in
+   the path (plus inline BN254) finds **zero** structural defects across
+   logn 8..16 × 25 seeds: no range overlap (double-count), no gap, no bucket
+   with both a finalized `sum` and `partial`, no bucket holding the same input
+   point twice, and no bucket holding a point and its negation. Partials always
+   cleanly partition each bucket; the combine of those partials is on-curve and
+   matches the reference. **This rules out the hypothesized upstream causes**
+   (scalar decomposition, signed-digit recoding, bucket-index assignment, and
+   the walker's partial emission producing a duplicate or a P/−P sign
+   collision).
+
+2. **A concrete upstream defect *is* present: an uninitialized-sentinel bug
+   that injects off-curve garbage into the combine.** `ba_walker_combine`'s
+   input linked list is built by `ba_walker_partials_index`, which links every
+   partial slot whose `partial_dest != NO_BUCKET` (`0xffffffff`). But the host
+   prepares that buffer with `enc.clearBuffer(walkerPartialDest)`, which writes
+   **0**, not `0xffffffff`. The walker is *indirect-dispatched* over only
+   `numActive = nwg*256` threads (≤ `streamNumThreads = 8192`), so it only
+   initializes the slots it owns. Every slot owned by a **non-dispatched**
+   thread therefore still reads **0**, which the indexer interprets as
+   `bucket_id = 0` and links into **global bucket 0's** combine list — with
+   `partials_buf = (0,0)` (off-curve, from the same zero-clear). Combining
+   `(0,0)+(0,0)` is exactly the `dx == 0` exception, and bucket 0's value goes
+   off-curve.
+
+   - This matches "off-curve **value** for a small number of buckets" (it is
+     global bucket 0). It is **invisible to the final MSM result**: the
+     reduction (`ba_reduce_init_bench.wgsl:5-8,37-38`) explicitly drops
+     column 0 (`src = w*bw + i + 1`), the zero digit, weight 0 — so the
+     cross-check against the reference still passes even though bucket 0 is
+     corrupt. That is consistent with the prior session detecting it via a
+     per-bucket on-curve assertion rather than a result mismatch.
+   - It manifests only when `numActive < streamNumThreads` (smaller logn,
+     roughly logn 8..14). `numActive = nwg*256` and `nwg` is capped at
+     `max_workgroups = 32` (the value actually passed to the cumsum shader:
+     `gen_ba_planner_cumsum_shader(STREAM_T, STREAM_S, 1, 32)` — not the 64 in
+     `ba_stream_plan.ts`), so `numActive ≤ 8192 = streamNumThreads`. At the
+     largest sizes `nwg = 32 ⇒ numActive = 8192`, every slot is initialized,
+     stale = 0, and the bug does not manifest. `node_counter` never overflows
+     (`realPartials ≤ dispatched ⇒ realPartials + stale ≤ max_nodes`).
+   - Side check: because `nwg ≤ 32 ⇒ numActive ≤ streamNumThreads`, the
+     walker's `if (t >= NUM_THREADS=8192) return` guard never drops dispatched
+     threads, so there is **no** "upper half of the bucket stream goes
+     unaccumulated" bug — even though `numActive` and the fixed `NUM_THREADS`
+     are wired from different constants. (Had `max_workgroups` been 64,
+     `numActive` would reach 16384 > 8192 and that guard *would* drop half the
+     work — worth noting if either constant changes.)
+
+## Reproduction
+
+`dev/msm-webgpu/walker-dx0-repro.mjs` (no deps, no GPU). It ports the exact
+integer logic of `decompose_scalars_booth`, `ba_planner_cumsum`,
+`ba_planner_partition_thread`, `ba_planner_partition_task`, `ba_stream_walker`,
+and `ba_walker_combine`, plus inline BN254 G1 (complete affine add for
+references, the incomplete add the GPU uses for the combine replay).
+
+```
+node dev/msm-webgpu/walker-dx0-repro.mjs 12 7    # curve-accurate single run
+node dev/msm-webgpu/walker-dx0-repro.mjs sweep    # structural sweep logn 8..16 × 25 seeds
+```
+
+Representative output (`sweep`): `SWEEP CLEAN: no structural double-count /
+sign-collision in the walker+combine LOGIC` for every size/seed. The GPU
+slot-space model line shows the stale slots that the sentinel bug mislinks to
+bucket 0, e.g. logn=12: `stale->bucket0=102400 ... => bucket 0 combine list
+gets 102400 off-curve (0,0) nodes -> dx==0`.
+
+> A subtle harness gotcha surfaced and was fixed in the repro: an LCG PRNG has
+> non-random low bits, which biases the booth digits and can mask
+> bucket-distribution bugs. The repro uses mulberry32.
+
+## Root cause and the correct upstream fix (NOT exception-safe addition)
+
+The `dx == 0` in the combine is not a property of the addition formula and not
+a logic bug in the walker; it is off-curve `(0,0)` garbage that should never
+have entered bucket 0's combine list. Two upstream changes (both implemented in
+this PR) fix it at the source:
+
+1. **Exclude zero-digit buckets in `ba_planner_classify`** (primary). Buckets
+   with `id % BW == 0` are the Booth zero digit; they contribute nothing and
+   the reduction already drops column 0 (`ba_reduce_init`: `src = w*bw + i+1`).
+   Excluding them from the dense/size1 lists means global bucket 0 is never
+   combined at all, so the uninitialized slots that get mis-linked to it are
+   never read. The result is identical (those buckets were dropped anyway) and
+   it saves accumulating every per-window zero bucket. `classifyParams.y` now
+   carries `BW`.
+2. **Guard `ba_walker_partials_index` against `bucket_id == 0`** (defense in
+   depth). Even with (1), the zero-cleared slots still carry `bucket_id 0`;
+   skipping them in the indexer keeps the garbage out of the linked list
+   regardless of classification.
+
+The deeper sentinel hardening — initialize `partial_dest` to `NO_BUCKET`
+(`0xffffffff`) instead of `0`, or bound the indexer scan to the dispatched
+range `2*numActive*S` — would also let global bucket 0 hold its correct
+(on-curve) sum rather than the cleared `(0,0)`. It is unnecessary for
+correctness given (1) and is left as a follow-up to avoid an unvalidated buffer
+change.
+
+## Validation status
+
+The logic correctness and the slot-staleness arithmetic are validated
+deterministically on CPU here. The final GPU cross-check (20+ runs, multiple
+seeds, hot-bucket input vs `@noble/curves` under headless SwiftShader) is the
+remaining step; this container has no Chrome/Playwright and the in-tree
+`barretenberg.wasm.gz` is a 213-byte stub, so the headless GPU harness could
+not be run. The reproduction + the exact buckets to read back (global bucket 0
+and its linked-list nodes' `nodes_slot` / `partials_buf` x,y) are provided so a
+GPU-equipped run can confirm and land the fix.
diff --git a/barretenberg/ts/dev/msm-webgpu/walker-dx0-repro.mjs b/barretenberg/ts/dev/msm-webgpu/walker-dx0-repro.mjs
new file mode 100644
index 000000000000..e7ffc86ca470
--- /dev/null
+++ b/barretenberg/ts/dev/msm-webgpu/walker-dx0-repro.mjs
@@ -0,0 +1,372 @@
+// Dependency-free CPU reproduction of the stream-walker MSM bucket pipeline,
+// built to ROOT-CAUSE the `dx == 0` exception that ba_walker_combine hits on
+// a small number of buckets at logn=8..16 (off-curve result).
+//
+// It ports the EXACT integer logic of the WGSL/TS kernels:
+//   - decompose_scalars_booth  (buildInitCounts, msm_v2.ts:187)  -> (digit,sign)
+//   - global bucket id = window*BW + digit                       (msm_v2.ts:1396)
+//   - ba_planner_cumsum        -> cumulative_adds = exclusive prefix of (count-1)
+//   - ba_planner_partition_thread/_task -> thread/task adds-axis cuts
+//   - ba_stream_walker         -> per-task partial / bucket_sum emission
+//   - ba_walker_partials_index + ba_walker_combine -> per-bucket partial list
+// plus inline BN254 G1 affine/projective arithmetic for the curve checks.
+//
+// For each dense bucket it verifies the emitted pieces partition [0,count)
+// exactly (overlap/gap = a double-count bug) and replays the combine's
+// sequential affine sum to find where dx==0 occurs, then classifies the
+// colliding pieces (same value = duplicate, negated value = sign collision).
+//
+//   node dev/msm-webgpu/walker-dx0-repro.mjs [logn] [seed]
+
+// ---------------------------------------------------------------------------
+// BN254 G1: y^2 = x^3 + 3 over Fp.
+const P = 21888242871839275222246405745257275088696311157297823662689037894645226208583n;
+const R = 21888242871839275222246405745257275088548364400416034343698204186575808495617n; // Fr order
+const Gx = 1n, Gy = 2n;
+
+const mod = (a) => ((a % P) + P) % P;
+const fadd = (a, b) => mod(a + b);
+const fsub = (a, b) => mod(a - b);
+const fmul = (a, b) => mod(a * b);
+function finv(a) {
+  // extended euclid; finv(0) returns 0 (matches the kernels' modInv(0)).
+  let [old_r, r] = [mod(a), P];
+  let [old_s, s] = [1n, 0n];
+  while (r !== 0n) { const q = old_r / r; [old_r, r] = [r, old_r - q * r]; [old_s, s] = [s, old_s - q * s]; }
+  return mod(old_s);
+}
+const ZERO = { x: 0n, y: 0n, inf: true };
+function onCurve(pt) { if (pt.inf) return true; return fsub(fmul(pt.y, pt.y), fadd(fmul(fmul(pt.x, pt.x), pt.x), 3n)) === 0n; }
+
+// Complete affine addition (correct doubling + infinity handling). The
+// reference for bucket sums and partial values.
+function addComplete(a, b) {
+  if (a.inf) return b; if (b.inf) return a;
+  let lam;
+  if (a.x === b.x) {
+    if (fadd(a.y, b.y) === 0n) return ZERO;            // P + (-P) = O
+    lam = fmul(fmul(3n, fmul(a.x, a.x)), finv(fmul(2n, a.y))); // doubling
+  } else {
+    lam = fmul(fsub(b.y, a.y), finv(fsub(b.x, a.x)));
+  }
+  const rx = fsub(fsub(fmul(lam, lam), a.x), b.x);
+  const ry = fsub(fmul(lam, fsub(a.x, rx)), a.y);
+  return { x: rx, y: ry, inf: false };
+}
+function affSum(pts) { let acc = ZERO; for (const pt of pts) acc = addComplete(acc, pt); return acc; }
+
+// Incomplete affine add EXACTLY as the combine/walker do (this is what hits dx==0).
+// Returns {pt, dxZero}. When dx==0, finv(0)=0 -> garbage (matches GPU).
+function affineAddRaw(a, b) {
+  const dx = fsub(b.x, a.x);
+  const dxZero = dx === 0n;
+  const lam = fmul(fsub(b.y, a.y), finv(dx));
+  const rx = fsub(fsub(fmul(lam, lam), a.x), b.x);
+  const ry = fsub(fmul(lam, fsub(a.x, rx)), a.y);
+  return { pt: { x: rx, y: ry, inf: false }, dxZero };
+}
+
+function negate(pt) { return pt.inf ? pt : { x: pt.x, y: fsub(0n, pt.y), inf: false }; }
+function ptEq(a, b) { if (a.inf || b.inf) return a.inf && b.inf; return a.x === b.x && a.y === b.y; }
+
+// ---------------------------------------------------------------------------
+// Deterministic input generation.
+// mulberry32 — good full-word distribution (an LCG's low bits are NOT random,
+// which silently biases the booth digits and can mask bucket-distribution bugs).
+function makeRng(seed) { let a = (seed >>> 0) || 1; return () => { a |= 0; a = (a + 0x6D2B79F5) | 0; let t = Math.imul(a ^ (a >>> 15), 1 | a); t = (t + Math.imul(t ^ (t >>> 7), 61 | t)) ^ t; return (t ^ (t >>> 14)) >>> 0; }; }
+function randFr(rng) { let v = 0n; for (let i = 0; i < 8; i++) v = (v << 32n) | BigInt(rng() >>> 0); v &= (1n << 254n) - 1n; return v % R; }
+function scalarMulG(k) { // double-and-add on G to get a valid random-ish point
+  let acc = ZERO; let base = { x: Gx, y: Gy, inf: false };
+  while (k > 0n) { if (k & 1n) acc = addComplete(acc, base); base = addComplete(base, base); k >>= 1n; }
+  return acc;
+}
+
+// ---------------------------------------------------------------------------
+// Booth decompose: returns {digit, neg} for window w of a scalar's LE bytes.
+// Exact port of buildInitCounts (msm_v2.ts:187-216).
+function scalarToLE(s) { const b = new Uint8Array(32); let x = s; for (let i = 0; i < 32; i++) { b[i] = Number(x & 0xffn); x >>= 8n; } return b; }
+function boothWindows(le, c, numWindows) {
+  const cMask = (1 << c) - 1; const out = [];
+  let lookback = 0;
+  for (let w = 0; w < numWindows; w++) {
+    const lo = w * c; const inOff = lo >>> 3; const bitShift = lo & 7;
+    const b0 = le[inOff] ?? 0; const b1 = inOff + 1 < 32 ? le[inOff + 1] : 0;
+    const b2 = inOff + 2 < 32 ? le[inOff + 2] : 0; const b3 = inOff + 3 < 32 ? le[inOff + 3] : 0;
+    const v = (b0 | (b1 << 8) | (b2 << 16) | (b3 << 24)) >>> 0;
+    const winBits = (v >>> bitShift) & cMask;
+    const raw = ((winBits << 1) | lookback) >>> 0;
+    const neg = (raw >>> c) & 1;
+    const negMask = neg ? 0xffffffff : 0;
+    const encode = (raw + 1) >>> 1;
+    const digit = ((((encode - neg) >>> 0) ^ negMask) & cMask) >>> 0;
+    out.push({ digit, neg });
+    lookback = (v >>> (bitShift + c - 1)) & 1;
+  }
+  return out;
+}
+
+// ---------------------------------------------------------------------------
+function pickC(logN) { const t = { 7: 4, 8: 4, 9: 5, 10: 8, 11: 8, 12: 8, 13: 8, 14: 8, 15: 10, 16: 13 }; return t[logN] ?? 13; }
+// MAX_STREAM_WORKGROUPS is the value actually passed to the cumsum shader
+// (msm_v2.ts: gen_ba_planner_cumsum_shader(STREAM_T, STREAM_S, 1, 32)) — 32,
+// NOT the 64 in ba_stream_plan.ts. It caps nwg, hence numActive = nwg*256 <=
+// 32*256 = 8192 = streamNumThreads, so every walker thread is always
+// dispatched at the largest sizes (no dropped-thread / NUM_THREADS-guard bug).
+const PLANNER_TPB = 256, STREAM_S = 8, MIN_ITERS_PER_WG = 8, MAX_STREAM_WORKGROUPS = 32, THREAD_TPB = 256, NUMBITS = 254;
+
+function run(logN, seed, fast) {
+  const n = 1 << logN;
+  const c = pickC(logN);
+  const numWindows = Math.ceil(NUMBITS / c);
+  const BW = Math.ceil((2 ** (c - 1) + 1) / PLANNER_TPB) * PLANNER_TPB;
+  const S = STREAM_S;
+  if (!fast) console.log(`logN=${logN} n=${n} c=${c} numWindows=${numWindows} BW=${BW} S=${S} seed=${seed}`);
+
+  // 1) inputs. In fast (structural) mode we skip curve-point generation and
+  // use the point index (+sign) as identity — enough to detect duplicates,
+  // sign-collisions, range overlaps and gaps (none of which need curve math).
+  const rng = makeRng(seed);
+  const points = fast ? null : new Array(n); const scalarsLE = new Array(n);
+  for (let i = 0; i < n; i++) { if (!fast) points[i] = scalarMulG((randFr(rng) % (R - 1n)) + 1n); scalarsLE[i] = scalarToLE(randFr(rng)); }
+
+  // 2) build per-bucket point lists (l0_index equivalent): (pointIdx, neg)
+  const buckets = new Map(); // bucketId -> [{pt, neg}...]
+  for (let i = 0; i < n; i++) {
+    const wins = boothWindows(scalarsLE[i], c, numWindows);
+    for (let w = 0; w < numWindows; w++) {
+      const { digit, neg } = wins[w];
+      if (digit === 0) continue; // zero digit skipped
+      const b = w * BW + digit;
+      if (!buckets.has(b)) buckets.set(b, []);
+      buckets.get(b).push({ pt: i, neg });
+    }
+  }
+
+  // CONTENTS-LEVEL check (case a/b at the input level), structural so it needs
+  // no curve math: does any bucket hold the SAME point index twice (case a,
+  // duplicate), or the same index once with +sign and once with -sign (case b,
+  // an exact P and -P collision)?
+  let contentsDup = 0, contentsNegPair = 0;
+  for (const [b, list] of buckets) {
+    const seen = new Map(); // pointIdx -> set of signs
+    for (const e of list) {
+      if (!seen.has(e.pt)) seen.set(e.pt, new Set());
+      const s = seen.get(e.pt);
+      if (s.has(e.neg)) contentsDup++;        // same index, same sign, twice
+      else if (s.size > 0) contentsNegPair++; // same index, both signs => P and -P
+      s.add(e.neg);
+    }
+  }
+
+  // 3) dense buckets (count>=2), sorted by DESCENDING count (radix faithful).
+  const dense = [];
+  for (const [b, list] of buckets) if (list.length >= 2) dense.push(b);
+  dense.sort((a, bb) => buckets.get(bb).length - buckets.get(a).length || a - bb);
+  const numDense = dense.length;
+  const counts = dense.map((b) => buckets.get(b).length);
+
+  // 4) cumulative_adds (exclusive prefix of count-1), total_adds.
+  const cum = new Array(numDense); let acc = 0;
+  for (let i = 0; i < numDense; i++) { cum[i] = acc; acc += counts[i] - 1; }
+  const totalAdds = acc;
+
+  // 5) nwg / num_active_threads (ba_planner_cumsum + partition_thread).
+  const targetWork = THREAD_TPB * S * MIN_ITERS_PER_WG;
+  let nwg = 1; if (totalAdds > targetWork) nwg = Math.floor(totalAdds / targetWork);
+  nwg = Math.min(nwg, MAX_STREAM_WORKGROUPS); nwg = Math.max(nwg, 1);
+  const numActive = nwg * THREAD_TPB;
+
+  // resolve_cut (shared by partition_thread/_task). Returns [bucket, offset].
+  function resolveCut(cutTarget, loB, hiB) {
+    let lo = loB, hi = hiB;
+    while (lo < hi) { const mid = (lo + hi) >> 1; const cumEnd = cum[mid] + counts[mid] - 1; if (cumEnd < cutTarget) lo = mid + 1; else hi = mid; }
+    const cb = Math.min(lo, numDense - 1);
+    let co = 0; if (cutTarget > cum[cb]) co = cutTarget - cum[cb];
+    return [cb, co];
+  }
+
+  // thread cuts: flat thread f in [0,numActive) -> START cut_target.
+  function threadCutTarget(f) {
+    const wg = Math.floor(f / THREAD_TPB), t = f % THREAD_TPB;
+    const wgStart = Math.floor((wg * totalAdds) / nwg);
+    const wgEnd = Math.floor(((wg + 1) * totalAdds) / nwg);
+    const wgTotal = wgEnd - wgStart;
+    return wgStart + Math.floor((t * wgTotal) / THREAD_TPB);
+  }
+
+  // 6) per-task cuts and 7) walker emission.
+  // Record per bucket: pieces = [{range:[lo,hi] inclusive point idx, kind}], where
+  // kind in {'sum','partial'}; 'sum' => store_bucket_sum, 'partial' => combine input.
+  const pieces = new Map(); // bucketId -> [{lo,hi,kind,task}]
+  function addPiece(bId, lo, hi, kind, task) { if (!pieces.has(bId)) pieces.set(bId, []); pieces.get(bId).push({ lo, hi, kind, task }); }
+
+  for (let f = 0; f < numActive; f++) {
+    const firstTarget = threadCutTarget(f);
+    const [fb, fo] = resolveCut(firstTarget, 0, numDense);
+    let lastBucket, lastOff;
+    if (f + 1 >= numActive) { lastBucket = numDense - 1; lastOff = totalAdds - cum[numDense - 1]; }
+    else { const nt = threadCutTarget(f + 1); const rc = resolveCut(nt, 0, numDense); lastBucket = rc[0]; lastOff = rc[1]; }
+    const startAdds = cum[fb] + fo; const endAdds = cum[lastBucket] + lastOff; const threadTotal = endAdds - startAdds;
+    if (threadTotal <= 0) continue;
+    const hiB = Math.min(lastBucket + 1, numDense);
+    const taskCuts = [];
+    for (let k = 0; k <= S; k++) { const ct = startAdds + Math.floor((k * threadTotal) / S); taskCuts.push(resolveCut(ct, fb, hiB)); }
+
+    // Walker per slot k = task [taskCuts[k], taskCuts[k+1]).
+    for (let k = 0; k < S; k++) {
+      const [sb, so] = taskCuts[k]; const [eb, eo] = taskCuts[k + 1];
+      walkTask(f, k, sb, so, eb, eo);
+    }
+  }
+
+  // Faithful port of ba_stream_walker init + loop, emitting point-index ranges.
+  function walkTask(f, k, sb, so, eb, eo) {
+    const sbCount = counts[sb];
+    // --- init: start cursor (point index WITHIN bucket) + split_start ---
+    let effSorted = sb, effCount = sbCount, startPt, splitStart;
+    if (so === 0) { startPt = 0; splitStart = 0; }
+    else if (so + 1 < sbCount) { startPt = so + 1; splitStart = 1; }
+    else { effSorted = sb + 1; effCount = effSorted < numDense ? counts[effSorted] : 0; startPt = 0; splitStart = 0; }
+
+    // task end (region-aware)
+    let teSort, teCur; // teCur = point index (within bucket teSort) one past the last consumed
+    if (eo > 0) { teSort = eb; teCur = eo + 1; }
+    else if (eb > 0) { teSort = eb - 1; teCur = counts[teSort]; }
+    else { teSort = 0; teCur = 0; }
+
+    let curSorted = effSorted, bucketEnd = effCount, cursor = startPt, isFirst = 1, slotDone = 0;
+
+    // empty task
+    if (effSorted > teSort || (effSorted === teSort && startPt >= teCur)) return;
+
+    // single-point leading segment
+    let segEnd = bucketEnd; if (effSorted === teSort) segEnd = teCur;
+    if (splitStart === 1 && segEnd - startPt === 1) {
+      if (effSorted === teSort) { addPiece(dense[effSorted], startPt, startPt, 'partial', { f, k, slot: 1 }); return; }
+      addPiece(dense[effSorted], startPt, startPt, 'partial', { f, k, slot: 0 });
+      const nxt = effSorted + 1; curSorted = nxt; bucketEnd = counts[nxt]; cursor = 0; splitStart = 0; isFirst = 1;
+      if (nxt > teSort) return;
+    }
+
+    // main loop
+    let pieceLo = cursor; // first not-yet-retired point of the current piece
+    for (;;) {
+      // consume one affine step
+      if (isFirst === 1) { cursor += 2; isFirst = 0; } else { cursor += 1; }
+      const taskDone = (curSorted === teSort) && (cursor >= teCur);
+      const bucketDone = cursor >= bucketEnd;
+      if (taskDone) {
+        const isPartial = (splitStart === 1) || (cursor < bucketEnd);
+        const hi = Math.min(cursor, bucketEnd) - 1;
+        addPiece(dense[curSorted], pieceLo, hi, isPartial ? 'partial' : 'sum', { f, k, slot: 1 });
+        return;
+      } else if (bucketDone) {
+        addPiece(dense[curSorted], pieceLo, bucketEnd - 1, splitStart === 1 ? 'partial' : 'sum', { f, k, slot: 0 });
+        const nxt = curSorted + 1; curSorted = nxt; bucketEnd = counts[nxt]; cursor = 0; isFirst = 1; splitStart = 0; pieceLo = 0;
+      }
+      // else: continue accumulating same piece
+    }
+  }
+
+  // 8) Per-bucket combine analysis.
+  let bucketsWithDxZero = 0, offCurveBuckets = 0, overlapBuckets = 0, gapBuckets = 0, mixedSumPartial = 0;
+  const reports = [];
+  for (const b of dense) {
+    const list = buckets.get(b); const count = list.length;
+    const ps = pieces.get(b) || [];
+    const partials = ps.filter((p) => p.kind === 'partial');
+    const sums = ps.filter((p) => p.kind === 'sum');
+
+    // range coverage check
+    const cover = new Array(count).fill(0);
+    let overlap = false, gap = false;
+    for (const p of ps) for (let i = p.lo; i <= p.hi; i++) { if (i >= 0 && i < count) cover[i]++; }
+    for (let i = 0; i < count; i++) { if (cover[i] === 0) gap = true; if (cover[i] > 1) overlap = true; }
+    if (overlap) overlapBuckets++;
+    if (gap) gapBuckets++;
+    if (sums.length > 0 && partials.length > 0) mixedSumPartial++;
+
+    // The combine only runs over partials (a fully-consumed bucket has a 'sum'
+    // and the combine early-returns). Analyze the partial set the combine sees.
+    if (partials.length === 0) continue;
+    if (fast) {
+      if (overlap || gap || (sums.length && partials.length)) {
+        reports.push({ b, count, nPartials: partials.length, overlap, gap, sums: sums.length, ranges: partials.map((p) => [p.lo, p.hi, p.task]) });
+      }
+      continue;
+    }
+    const pieceVal = (p) => { let accPt = ZERO; for (let i = p.lo; i <= p.hi; i++) { const e = list[i]; const v = e.neg ? negate(points[e.pt]) : points[e.pt]; accPt = addComplete(accPt, v); } return accPt; };
+    const vals = partials.map(pieceVal);
+
+    // replay combine sequential affine add (linked-list order ~= emission order;
+    // dx==0 is order-sensitive, but a structural duplicate/neg shows in any order).
+    let dxZeroHere = false; let collide = null;
+    let accPt = vals[0];
+    for (let i = 1; i < vals.length; i++) {
+      if (accPt.x === vals[i].x) { dxZeroHere = true; collide = { i, accIsNeg: ptEq(accPt, negate(vals[i])), accEq: ptEq(accPt, vals[i]) }; }
+      const r = affineAddRaw(accPt, vals[i]); accPt = r.pt;
+    }
+    // also check ALL unordered pairs for structural equality / negation
+    let pairDup = 0, pairNeg = 0;
+    for (let i = 0; i < vals.length; i++) for (let j = i + 1; j < vals.length; j++) {
+      if (ptEq(vals[i], vals[j])) pairDup++; else if (vals[i].x === vals[j].x) pairNeg++;
+    }
+
+    const refAff = affSum(list.map((e) => (e.neg ? negate(points[e.pt]) : points[e.pt])));
+    const combineOk = ptEq(accPt, refAff);
+    if (dxZeroHere) bucketsWithDxZero++;
+    if (!combineOk || !onCurve(accPt)) offCurveBuckets++;
+    if (dxZeroHere || overlap || gap || pairDup || pairNeg || (sums.length && partials.length)) {
+      reports.push({ b, count, nPartials: partials.length, overlap, gap, pairDup, pairNeg, sums: sums.length, dxZeroHere, collide, combineOk, ranges: partials.map((p) => [p.lo, p.hi, p.task]) });
+    }
+  }
+
+  // GPU slot-space model: ba_walker_partials_index scans ALL
+  // 2*streamNumThreads*S partial slots, but the walker is indirect-dispatched
+  // over only numActive (= nwg*256) threads. clearBuffer zero-fills
+  // walkerPartialDest, yet NO_BUCKET = 0xffffffff — so every slot owned by a
+  // NON-dispatched thread reads as bucket_id 0 and is linked into bucket 0's
+  // combine list with partials_buf = (0,0) (off-curve, from the same zero
+  // clear). Count those, and the node-counter pressure vs max_nodes.
+  const STREAM_NUM_THREADS = 8192;
+  const totalSlots = 2 * STREAM_NUM_THREADS * S;          // = max_nodes
+  const dispatchedSlots = 2 * Math.min(numActive, STREAM_NUM_THREADS) * S;
+  const staleSlots = totalSlots - dispatchedSlots;        // -> all linked to bucket 0 as (0,0)
+  let realPartialSlots = 0; for (const ps of pieces.values()) realPartialSlots += ps.filter((p) => p.kind === 'partial').length;
+  const nodeOverflow = realPartialSlots + staleSlots > totalSlots; // node_counter vs max_nodes
+  const bucket0Dense = (buckets.get(0)?.length ?? 0) >= 2;
+  console.log(`  GPU-slots: totalSlots(max_nodes)=${totalSlots} dispatched=${dispatchedSlots} stale->bucket0=${staleSlots} realPartialSlots=${realPartialSlots} nodeCounter=${realPartialSlots + staleSlots} overflow=${nodeOverflow} bucket0Dense=${bucket0Dense}`);
+  if (staleSlots > 0) console.log(`  => bucket 0 combine list gets ${staleSlots} off-curve (0,0) nodes -> dx==0 + off-curve bucket value (clearBuffer 0 != NO_BUCKET 0xffffffff)`);
+  console.log(`  contents: dupPairs=${contentsDup} negPairs=${contentsNegPair}  numDense=${numDense} totalAdds=${totalAdds} nwg=${nwg} numActive=${numActive}`);
+  console.log(`  combine: dxZeroBuckets=${bucketsWithDxZero} offCurveBuckets=${offCurveBuckets} overlapBuckets=${overlapBuckets} gapBuckets=${gapBuckets} mixedSumPartial=${mixedSumPartial}`);
+  for (const r of reports.slice(0, 12)) {
+    console.log(`   bucket ${r.b} count=${r.count} nPartials=${r.nPartials} overlap=${r.overlap} gap=${r.gap} pairDup=${r.pairDup} pairNeg=${r.pairNeg} sums=${r.sums} dxZero=${r.dxZeroHere} combineOk=${r.combineOk}`);
+    console.log(`     partial ranges: ${JSON.stringify(r.ranges)}`);
+    if (r.collide) console.log(`     collide: ${JSON.stringify(r.collide)}`);
+  }
+  return { bucketsWithDxZero, offCurveBuckets, overlapBuckets, gapBuckets, mixedSumPartial, contentsDup, contentsNegPair, reports };
+}
+
+if (process.argv[2] === 'sweep') {
+  // Fast structural sweep across logn 8..16 and many seeds. Flags any bucket
+  // with a range overlap (double-count), gap, mixed sum+partial, or a
+  // contents-level duplicate / sign-collision — the structural root causes.
+  let totalFlag = 0;
+  for (let logN = 8; logN <= 16; logN++) {
+    for (let seed = 1; seed <= 25; seed++) {
+      const r = run(logN, seed, true);
+      const flags = r.overlapBuckets + r.gapBuckets + r.mixedSumPartial + r.contentsDup + r.contentsNegPair;
+      totalFlag += flags;
+      if (flags > 0) {
+        console.log(`FLAG logN=${logN} seed=${seed} overlap=${r.overlapBuckets} gap=${r.gapBuckets} mixed=${r.mixedSumPartial} cDup=${r.contentsDup} cNeg=${r.contentsNegPair}`);
+        for (const rep of r.reports.slice(0, 4)) console.log(`   bucket ${rep.b} count=${rep.count} ranges=${JSON.stringify(rep.ranges)} overlap=${rep.overlap} gap=${rep.gap} sums=${rep.sums}`);
+      }
+    }
+    console.log(`logN=${logN}: swept 25 seeds, cumulative structural flags=${totalFlag}`);
+  }
+  console.log(totalFlag === 0 ? 'SWEEP CLEAN: no structural double-count / sign-collision in the walker+combine LOGIC.' : `SWEEP FOUND ${totalFlag} structural flags.`);
+} else {
+  const logN = parseInt(process.argv[2] ?? '8', 10);
+  const seed = parseInt(process.argv[3] ?? '1', 10);
+  run(logN, seed, false);
+}
diff --git a/barretenberg/ts/src/msm_webgpu/msm_v2.ts b/barretenberg/ts/src/msm_webgpu/msm_v2.ts
index e623b6f881de..7e49977778db 100644
--- a/barretenberg/ts/src/msm_webgpu/msm_v2.ts
+++ b/barretenberg/ts/src/msm_webgpu/msm_v2.ts
@@ -2089,7 +2089,7 @@ export class MsmV2 {
       const pbl = scratch.partialBucketsList;
       const ab = scratch.accBuf;
       const sps = scratch.streamPrefScratch;
-      const classifyParams = ubuf(new Uint32Array([B_TOTAL, 0, 0, 0]));
+      const classifyParams = ubuf(new Uint32Array([B_TOTAL, BW, 0, 0]));
       this.classifyBind = mkBind(this.classifyLayout, [countsBufs[0], offsetsBufs[0], s1, db, dc, sp, classifyParams]);
       this.metaFixupBind = mkBind(this.metaFixupLayout, [sp]);
       const radixParams = [
diff --git a/barretenberg/ts/src/msm_webgpu/wgsl/_generated/shaders.ts b/barretenberg/ts/src/msm_webgpu/wgsl/_generated/shaders.ts
index e5321b600205..f7bb61c52290 100644
--- a/barretenberg/ts/src/msm_webgpu/wgsl/_generated/shaders.ts
+++ b/barretenberg/ts/src/msm_webgpu/wgsl/_generated/shaders.ts
@@ -1190,12 +1190,23 @@ const B_TOTAL: u32 = {{ b_total }}u;
 @group(0) @binding(4) var<storage, read_write> dense_count_list:   array<u32>;
 @group(0) @binding(5) var<storage, read_write> planner_meta:       array<atomic<u32>>;
 @group(0) @binding(6) var<uniform>             params:             vec4<u32>;
+// params.x = B_TOTAL (legacy, unused here), params.y = BW (buckets per window)
 
 @compute @workgroup_size({{ workgroup_size }})
 fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
     let b = gid.x;
     if (b >= B_TOTAL) { return; }
 
+    // Skip the Booth zero-digit bucket of every window (global id b where
+    // b % BW == 0). Those buckets hold the points whose window digit recoded
+    // to 0; they contribute nothing and the reduction already drops column 0
+    // (ba_reduce_init: src = w*bw + i + 1). Accumulating them is wasted work,
+    // and combining global bucket 0 is what surfaced the uninitialised-slot
+    // garbage in ba_walker_partials_index (clearBuffer writes 0, NO_BUCKET is
+    // 0xffffffff). Excluding them here keeps the result identical.
+    let bw = params.y;
+    if (bw != 0u && (b % bw) == 0u) { return; }
+
     let n = counts[b];
     if (n == 0u) { return; }
 
@@ -3437,6 +3448,16 @@ fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
 
     let bucket_id = partial_dest[slot];
     if (bucket_id == NO_BUCKET) { return; }
+    // partial_dest is prepared with clearBuffer (writes 0), but NO_BUCKET is
+    // 0xffffffff. The walker is indirect-dispatched over only the active
+    // threads (nwg*256 <= NUM_THREADS), so any slot owned by a non-dispatched
+    // thread is never initialised and still reads 0 — i.e. bucket_id 0. Linking
+    // those into bucket 0's combine list injects off-curve (0,0) partials
+    // (walkerPartials is likewise zero-cleared) and makes ba_walker_combine hit
+    // dx==0. bucket_id 0 is the Booth zero-digit and is dropped by the
+    // reduction (ba_reduce_init: column 0, weight 0), so skipping it here is
+    // safe and removes the garbage at the source.
+    if (bucket_id == 0u) { return; }
 
     let node_array_idx = atomicAdd(&node_counter, 1u);
     if (node_array_idx >= max_nodes) { return; }
diff --git a/barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_planner_classify.template.wgsl b/barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_planner_classify.template.wgsl
index 287df31ab04a..89437cc0725b 100644
--- a/barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_planner_classify.template.wgsl
+++ b/barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_planner_classify.template.wgsl
@@ -17,12 +17,23 @@ const B_TOTAL: u32 = {{ b_total }}u;
 @group(0) @binding(4) var<storage, read_write> dense_count_list:   array<u32>;
 @group(0) @binding(5) var<storage, read_write> planner_meta:       array<atomic<u32>>;
 @group(0) @binding(6) var<uniform>             params:             vec4<u32>;
+// params.x = B_TOTAL (legacy, unused here), params.y = BW (buckets per window)
 
 @compute @workgroup_size({{ workgroup_size }})
 fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
     let b = gid.x;
     if (b >= B_TOTAL) { return; }
 
+    // Skip the Booth zero-digit bucket of every window (global id b where
+    // b % BW == 0). Those buckets hold the points whose window digit recoded
+    // to 0; they contribute nothing and the reduction already drops column 0
+    // (ba_reduce_init: src = w*bw + i + 1). Accumulating them is wasted work,
+    // and combining global bucket 0 is what surfaced the uninitialised-slot
+    // garbage in ba_walker_partials_index (clearBuffer writes 0, NO_BUCKET is
+    // 0xffffffff). Excluding them here keeps the result identical.
+    let bw = params.y;
+    if (bw != 0u && (b % bw) == 0u) { return; }
+
     let n = counts[b];
     if (n == 0u) { return; }
 
diff --git a/barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_walker_partials_index.template.wgsl b/barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_walker_partials_index.template.wgsl
index d5dd74fc73a6..c00ba937ac30 100644
--- a/barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_walker_partials_index.template.wgsl
+++ b/barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_walker_partials_index.template.wgsl
@@ -41,6 +41,16 @@ fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
 
     let bucket_id = partial_dest[slot];
     if (bucket_id == NO_BUCKET) { return; }
+    // partial_dest is prepared with clearBuffer (writes 0), but NO_BUCKET is
+    // 0xffffffff. The walker is indirect-dispatched over only the active
+    // threads (nwg*256 <= NUM_THREADS), so any slot owned by a non-dispatched
+    // thread is never initialised and still reads 0 — i.e. bucket_id 0. Linking
+    // those into bucket 0's combine list injects off-curve (0,0) partials
+    // (walkerPartials is likewise zero-cleared) and makes ba_walker_combine hit
+    // dx==0. bucket_id 0 is the Booth zero-digit and is dropped by the
+    // reduction (ba_reduce_init: column 0, weight 0), so skipping it here is
+    // safe and removes the garbage at the source.
+    if (bucket_id == 0u) { return; }
 
     let node_array_idx = atomicAdd(&node_counter, 1u);
     if (node_array_idx >= max_nodes) { return; }