diff --git a/barretenberg/ts/dev/msm-webgpu/WALKER_DX0_FINDINGS.md b/barretenberg/ts/dev/msm-webgpu/WALKER_DX0_FINDINGS.md new file mode 100644 index 000000000000..84c3a73b6ae0 --- /dev/null +++ b/barretenberg/ts/dev/msm-webgpu/WALKER_DX0_FINDINGS.md @@ -0,0 +1,121 @@ +# Root-cause investigation: `ba_walker_combine` dx==0 / off-curve bucket + +Branch: `stream-walker-dx0-rootcause-inv` (off `stream-walker-impl`). + +Target: the off-curve-bucket bug at logn 8..16 where `ba_walker_combine` +produces an off-curve value for a small number of buckets, each with exactly +one `dx == 0` exception during the linked-list traversal. The prior session +was about to make the combine's affine addition exception-safe; that is a +symptom patch. This investigation looks upstream. + +## TL;DR + +1. **The decompose → cut → walker-emission → combine *logic* is correct.** A + dependency-free reproduction of the exact integer logic of every kernel in + the path (plus inline BN254) finds **zero** structural defects across + logn 8..16 × 25 seeds: no range overlap (double-count), no gap, no bucket + with both a finalized `sum` and `partial`, no bucket holding the same input + point twice, and no bucket holding a point and its negation. Partials always + cleanly partition each bucket; the combine of those partials is on-curve and + matches the reference. **This rules out the hypothesized upstream causes** + (scalar decomposition, signed-digit recoding, bucket-index assignment, and + the walker's partial emission producing a duplicate or a P/−P sign + collision). + +2. **A concrete upstream defect *is* present: an uninitialized-sentinel bug + that injects off-curve garbage into the combine.** `ba_walker_combine`'s + input linked list is built by `ba_walker_partials_index`, which links every + partial slot whose `partial_dest != NO_BUCKET` (`0xffffffff`). But the host + prepares that buffer with `enc.clearBuffer(walkerPartialDest)`, which writes + **0**, not `0xffffffff`. The walker is *indirect-dispatched* over only + `numActive = nwg*256` threads (≤ `streamNumThreads = 8192`), so it only + initializes the slots it owns. Every slot owned by a **non-dispatched** + thread therefore still reads **0**, which the indexer interprets as + `bucket_id = 0` and links into **global bucket 0's** combine list — with + `partials_buf = (0,0)` (off-curve, from the same zero-clear). Combining + `(0,0)+(0,0)` is exactly the `dx == 0` exception, and bucket 0's value goes + off-curve. + + - This matches "off-curve **value** for a small number of buckets" (it is + global bucket 0). It is **invisible to the final MSM result**: the + reduction (`ba_reduce_init_bench.wgsl:5-8,37-38`) explicitly drops + column 0 (`src = w*bw + i + 1`), the zero digit, weight 0 — so the + cross-check against the reference still passes even though bucket 0 is + corrupt. That is consistent with the prior session detecting it via a + per-bucket on-curve assertion rather than a result mismatch. + - It manifests only when `numActive < streamNumThreads` (smaller logn, + roughly logn 8..14). `numActive = nwg*256` and `nwg` is capped at + `max_workgroups = 32` (the value actually passed to the cumsum shader: + `gen_ba_planner_cumsum_shader(STREAM_T, STREAM_S, 1, 32)` — not the 64 in + `ba_stream_plan.ts`), so `numActive ≤ 8192 = streamNumThreads`. At the + largest sizes `nwg = 32 ⇒ numActive = 8192`, every slot is initialized, + stale = 0, and the bug does not manifest. `node_counter` never overflows + (`realPartials ≤ dispatched ⇒ realPartials + stale ≤ max_nodes`). + - Side check: because `nwg ≤ 32 ⇒ numActive ≤ streamNumThreads`, the + walker's `if (t >= NUM_THREADS=8192) return` guard never drops dispatched + threads, so there is **no** "upper half of the bucket stream goes + unaccumulated" bug — even though `numActive` and the fixed `NUM_THREADS` + are wired from different constants. (Had `max_workgroups` been 64, + `numActive` would reach 16384 > 8192 and that guard *would* drop half the + work — worth noting if either constant changes.) + +## Reproduction + +`dev/msm-webgpu/walker-dx0-repro.mjs` (no deps, no GPU). It ports the exact +integer logic of `decompose_scalars_booth`, `ba_planner_cumsum`, +`ba_planner_partition_thread`, `ba_planner_partition_task`, `ba_stream_walker`, +and `ba_walker_combine`, plus inline BN254 G1 (complete affine add for +references, the incomplete add the GPU uses for the combine replay). + +``` +node dev/msm-webgpu/walker-dx0-repro.mjs 12 7 # curve-accurate single run +node dev/msm-webgpu/walker-dx0-repro.mjs sweep # structural sweep logn 8..16 × 25 seeds +``` + +Representative output (`sweep`): `SWEEP CLEAN: no structural double-count / +sign-collision in the walker+combine LOGIC` for every size/seed. The GPU +slot-space model line shows the stale slots that the sentinel bug mislinks to +bucket 0, e.g. logn=12: `stale->bucket0=102400 ... => bucket 0 combine list +gets 102400 off-curve (0,0) nodes -> dx==0`. + +> A subtle harness gotcha surfaced and was fixed in the repro: an LCG PRNG has +> non-random low bits, which biases the booth digits and can mask +> bucket-distribution bugs. The repro uses mulberry32. + +## Root cause and the correct upstream fix (NOT exception-safe addition) + +The `dx == 0` in the combine is not a property of the addition formula and not +a logic bug in the walker; it is off-curve `(0,0)` garbage that should never +have entered bucket 0's combine list. Two upstream changes (both implemented in +this PR) fix it at the source: + +1. **Exclude zero-digit buckets in `ba_planner_classify`** (primary). Buckets + with `id % BW == 0` are the Booth zero digit; they contribute nothing and + the reduction already drops column 0 (`ba_reduce_init`: `src = w*bw + i+1`). + Excluding them from the dense/size1 lists means global bucket 0 is never + combined at all, so the uninitialized slots that get mis-linked to it are + never read. The result is identical (those buckets were dropped anyway) and + it saves accumulating every per-window zero bucket. `classifyParams.y` now + carries `BW`. +2. **Guard `ba_walker_partials_index` against `bucket_id == 0`** (defense in + depth). Even with (1), the zero-cleared slots still carry `bucket_id 0`; + skipping them in the indexer keeps the garbage out of the linked list + regardless of classification. + +The deeper sentinel hardening — initialize `partial_dest` to `NO_BUCKET` +(`0xffffffff`) instead of `0`, or bound the indexer scan to the dispatched +range `2*numActive*S` — would also let global bucket 0 hold its correct +(on-curve) sum rather than the cleared `(0,0)`. It is unnecessary for +correctness given (1) and is left as a follow-up to avoid an unvalidated buffer +change. + +## Validation status + +The logic correctness and the slot-staleness arithmetic are validated +deterministically on CPU here. The final GPU cross-check (20+ runs, multiple +seeds, hot-bucket input vs `@noble/curves` under headless SwiftShader) is the +remaining step; this container has no Chrome/Playwright and the in-tree +`barretenberg.wasm.gz` is a 213-byte stub, so the headless GPU harness could +not be run. The reproduction + the exact buckets to read back (global bucket 0 +and its linked-list nodes' `nodes_slot` / `partials_buf` x,y) are provided so a +GPU-equipped run can confirm and land the fix. diff --git a/barretenberg/ts/dev/msm-webgpu/walker-dx0-repro.mjs b/barretenberg/ts/dev/msm-webgpu/walker-dx0-repro.mjs new file mode 100644 index 000000000000..e7ffc86ca470 --- /dev/null +++ b/barretenberg/ts/dev/msm-webgpu/walker-dx0-repro.mjs @@ -0,0 +1,372 @@ +// Dependency-free CPU reproduction of the stream-walker MSM bucket pipeline, +// built to ROOT-CAUSE the `dx == 0` exception that ba_walker_combine hits on +// a small number of buckets at logn=8..16 (off-curve result). +// +// It ports the EXACT integer logic of the WGSL/TS kernels: +// - decompose_scalars_booth (buildInitCounts, msm_v2.ts:187) -> (digit,sign) +// - global bucket id = window*BW + digit (msm_v2.ts:1396) +// - ba_planner_cumsum -> cumulative_adds = exclusive prefix of (count-1) +// - ba_planner_partition_thread/_task -> thread/task adds-axis cuts +// - ba_stream_walker -> per-task partial / bucket_sum emission +// - ba_walker_partials_index + ba_walker_combine -> per-bucket partial list +// plus inline BN254 G1 affine/projective arithmetic for the curve checks. +// +// For each dense bucket it verifies the emitted pieces partition [0,count) +// exactly (overlap/gap = a double-count bug) and replays the combine's +// sequential affine sum to find where dx==0 occurs, then classifies the +// colliding pieces (same value = duplicate, negated value = sign collision). +// +// node dev/msm-webgpu/walker-dx0-repro.mjs [logn] [seed] + +// --------------------------------------------------------------------------- +// BN254 G1: y^2 = x^3 + 3 over Fp. +const P = 21888242871839275222246405745257275088696311157297823662689037894645226208583n; +const R = 21888242871839275222246405745257275088548364400416034343698204186575808495617n; // Fr order +const Gx = 1n, Gy = 2n; + +const mod = (a) => ((a % P) + P) % P; +const fadd = (a, b) => mod(a + b); +const fsub = (a, b) => mod(a - b); +const fmul = (a, b) => mod(a * b); +function finv(a) { + // extended euclid; finv(0) returns 0 (matches the kernels' modInv(0)). + let [old_r, r] = [mod(a), P]; + let [old_s, s] = [1n, 0n]; + while (r !== 0n) { const q = old_r / r; [old_r, r] = [r, old_r - q * r]; [old_s, s] = [s, old_s - q * s]; } + return mod(old_s); +} +const ZERO = { x: 0n, y: 0n, inf: true }; +function onCurve(pt) { if (pt.inf) return true; return fsub(fmul(pt.y, pt.y), fadd(fmul(fmul(pt.x, pt.x), pt.x), 3n)) === 0n; } + +// Complete affine addition (correct doubling + infinity handling). The +// reference for bucket sums and partial values. +function addComplete(a, b) { + if (a.inf) return b; if (b.inf) return a; + let lam; + if (a.x === b.x) { + if (fadd(a.y, b.y) === 0n) return ZERO; // P + (-P) = O + lam = fmul(fmul(3n, fmul(a.x, a.x)), finv(fmul(2n, a.y))); // doubling + } else { + lam = fmul(fsub(b.y, a.y), finv(fsub(b.x, a.x))); + } + const rx = fsub(fsub(fmul(lam, lam), a.x), b.x); + const ry = fsub(fmul(lam, fsub(a.x, rx)), a.y); + return { x: rx, y: ry, inf: false }; +} +function affSum(pts) { let acc = ZERO; for (const pt of pts) acc = addComplete(acc, pt); return acc; } + +// Incomplete affine add EXACTLY as the combine/walker do (this is what hits dx==0). +// Returns {pt, dxZero}. When dx==0, finv(0)=0 -> garbage (matches GPU). +function affineAddRaw(a, b) { + const dx = fsub(b.x, a.x); + const dxZero = dx === 0n; + const lam = fmul(fsub(b.y, a.y), finv(dx)); + const rx = fsub(fsub(fmul(lam, lam), a.x), b.x); + const ry = fsub(fmul(lam, fsub(a.x, rx)), a.y); + return { pt: { x: rx, y: ry, inf: false }, dxZero }; +} + +function negate(pt) { return pt.inf ? pt : { x: pt.x, y: fsub(0n, pt.y), inf: false }; } +function ptEq(a, b) { if (a.inf || b.inf) return a.inf && b.inf; return a.x === b.x && a.y === b.y; } + +// --------------------------------------------------------------------------- +// Deterministic input generation. +// mulberry32 — good full-word distribution (an LCG's low bits are NOT random, +// which silently biases the booth digits and can mask bucket-distribution bugs). +function makeRng(seed) { let a = (seed >>> 0) || 1; return () => { a |= 0; a = (a + 0x6D2B79F5) | 0; let t = Math.imul(a ^ (a >>> 15), 1 | a); t = (t + Math.imul(t ^ (t >>> 7), 61 | t)) ^ t; return (t ^ (t >>> 14)) >>> 0; }; } +function randFr(rng) { let v = 0n; for (let i = 0; i < 8; i++) v = (v << 32n) | BigInt(rng() >>> 0); v &= (1n << 254n) - 1n; return v % R; } +function scalarMulG(k) { // double-and-add on G to get a valid random-ish point + let acc = ZERO; let base = { x: Gx, y: Gy, inf: false }; + while (k > 0n) { if (k & 1n) acc = addComplete(acc, base); base = addComplete(base, base); k >>= 1n; } + return acc; +} + +// --------------------------------------------------------------------------- +// Booth decompose: returns {digit, neg} for window w of a scalar's LE bytes. +// Exact port of buildInitCounts (msm_v2.ts:187-216). +function scalarToLE(s) { const b = new Uint8Array(32); let x = s; for (let i = 0; i < 32; i++) { b[i] = Number(x & 0xffn); x >>= 8n; } return b; } +function boothWindows(le, c, numWindows) { + const cMask = (1 << c) - 1; const out = []; + let lookback = 0; + for (let w = 0; w < numWindows; w++) { + const lo = w * c; const inOff = lo >>> 3; const bitShift = lo & 7; + const b0 = le[inOff] ?? 0; const b1 = inOff + 1 < 32 ? le[inOff + 1] : 0; + const b2 = inOff + 2 < 32 ? le[inOff + 2] : 0; const b3 = inOff + 3 < 32 ? le[inOff + 3] : 0; + const v = (b0 | (b1 << 8) | (b2 << 16) | (b3 << 24)) >>> 0; + const winBits = (v >>> bitShift) & cMask; + const raw = ((winBits << 1) | lookback) >>> 0; + const neg = (raw >>> c) & 1; + const negMask = neg ? 0xffffffff : 0; + const encode = (raw + 1) >>> 1; + const digit = ((((encode - neg) >>> 0) ^ negMask) & cMask) >>> 0; + out.push({ digit, neg }); + lookback = (v >>> (bitShift + c - 1)) & 1; + } + return out; +} + +// --------------------------------------------------------------------------- +function pickC(logN) { const t = { 7: 4, 8: 4, 9: 5, 10: 8, 11: 8, 12: 8, 13: 8, 14: 8, 15: 10, 16: 13 }; return t[logN] ?? 13; } +// MAX_STREAM_WORKGROUPS is the value actually passed to the cumsum shader +// (msm_v2.ts: gen_ba_planner_cumsum_shader(STREAM_T, STREAM_S, 1, 32)) — 32, +// NOT the 64 in ba_stream_plan.ts. It caps nwg, hence numActive = nwg*256 <= +// 32*256 = 8192 = streamNumThreads, so every walker thread is always +// dispatched at the largest sizes (no dropped-thread / NUM_THREADS-guard bug). +const PLANNER_TPB = 256, STREAM_S = 8, MIN_ITERS_PER_WG = 8, MAX_STREAM_WORKGROUPS = 32, THREAD_TPB = 256, NUMBITS = 254; + +function run(logN, seed, fast) { + const n = 1 << logN; + const c = pickC(logN); + const numWindows = Math.ceil(NUMBITS / c); + const BW = Math.ceil((2 ** (c - 1) + 1) / PLANNER_TPB) * PLANNER_TPB; + const S = STREAM_S; + if (!fast) console.log(`logN=${logN} n=${n} c=${c} numWindows=${numWindows} BW=${BW} S=${S} seed=${seed}`); + + // 1) inputs. In fast (structural) mode we skip curve-point generation and + // use the point index (+sign) as identity — enough to detect duplicates, + // sign-collisions, range overlaps and gaps (none of which need curve math). + const rng = makeRng(seed); + const points = fast ? null : new Array(n); const scalarsLE = new Array(n); + for (let i = 0; i < n; i++) { if (!fast) points[i] = scalarMulG((randFr(rng) % (R - 1n)) + 1n); scalarsLE[i] = scalarToLE(randFr(rng)); } + + // 2) build per-bucket point lists (l0_index equivalent): (pointIdx, neg) + const buckets = new Map(); // bucketId -> [{pt, neg}...] + for (let i = 0; i < n; i++) { + const wins = boothWindows(scalarsLE[i], c, numWindows); + for (let w = 0; w < numWindows; w++) { + const { digit, neg } = wins[w]; + if (digit === 0) continue; // zero digit skipped + const b = w * BW + digit; + if (!buckets.has(b)) buckets.set(b, []); + buckets.get(b).push({ pt: i, neg }); + } + } + + // CONTENTS-LEVEL check (case a/b at the input level), structural so it needs + // no curve math: does any bucket hold the SAME point index twice (case a, + // duplicate), or the same index once with +sign and once with -sign (case b, + // an exact P and -P collision)? + let contentsDup = 0, contentsNegPair = 0; + for (const [b, list] of buckets) { + const seen = new Map(); // pointIdx -> set of signs + for (const e of list) { + if (!seen.has(e.pt)) seen.set(e.pt, new Set()); + const s = seen.get(e.pt); + if (s.has(e.neg)) contentsDup++; // same index, same sign, twice + else if (s.size > 0) contentsNegPair++; // same index, both signs => P and -P + s.add(e.neg); + } + } + + // 3) dense buckets (count>=2), sorted by DESCENDING count (radix faithful). + const dense = []; + for (const [b, list] of buckets) if (list.length >= 2) dense.push(b); + dense.sort((a, bb) => buckets.get(bb).length - buckets.get(a).length || a - bb); + const numDense = dense.length; + const counts = dense.map((b) => buckets.get(b).length); + + // 4) cumulative_adds (exclusive prefix of count-1), total_adds. + const cum = new Array(numDense); let acc = 0; + for (let i = 0; i < numDense; i++) { cum[i] = acc; acc += counts[i] - 1; } + const totalAdds = acc; + + // 5) nwg / num_active_threads (ba_planner_cumsum + partition_thread). + const targetWork = THREAD_TPB * S * MIN_ITERS_PER_WG; + let nwg = 1; if (totalAdds > targetWork) nwg = Math.floor(totalAdds / targetWork); + nwg = Math.min(nwg, MAX_STREAM_WORKGROUPS); nwg = Math.max(nwg, 1); + const numActive = nwg * THREAD_TPB; + + // resolve_cut (shared by partition_thread/_task). Returns [bucket, offset]. + function resolveCut(cutTarget, loB, hiB) { + let lo = loB, hi = hiB; + while (lo < hi) { const mid = (lo + hi) >> 1; const cumEnd = cum[mid] + counts[mid] - 1; if (cumEnd < cutTarget) lo = mid + 1; else hi = mid; } + const cb = Math.min(lo, numDense - 1); + let co = 0; if (cutTarget > cum[cb]) co = cutTarget - cum[cb]; + return [cb, co]; + } + + // thread cuts: flat thread f in [0,numActive) -> START cut_target. + function threadCutTarget(f) { + const wg = Math.floor(f / THREAD_TPB), t = f % THREAD_TPB; + const wgStart = Math.floor((wg * totalAdds) / nwg); + const wgEnd = Math.floor(((wg + 1) * totalAdds) / nwg); + const wgTotal = wgEnd - wgStart; + return wgStart + Math.floor((t * wgTotal) / THREAD_TPB); + } + + // 6) per-task cuts and 7) walker emission. + // Record per bucket: pieces = [{range:[lo,hi] inclusive point idx, kind}], where + // kind in {'sum','partial'}; 'sum' => store_bucket_sum, 'partial' => combine input. + const pieces = new Map(); // bucketId -> [{lo,hi,kind,task}] + function addPiece(bId, lo, hi, kind, task) { if (!pieces.has(bId)) pieces.set(bId, []); pieces.get(bId).push({ lo, hi, kind, task }); } + + for (let f = 0; f < numActive; f++) { + const firstTarget = threadCutTarget(f); + const [fb, fo] = resolveCut(firstTarget, 0, numDense); + let lastBucket, lastOff; + if (f + 1 >= numActive) { lastBucket = numDense - 1; lastOff = totalAdds - cum[numDense - 1]; } + else { const nt = threadCutTarget(f + 1); const rc = resolveCut(nt, 0, numDense); lastBucket = rc[0]; lastOff = rc[1]; } + const startAdds = cum[fb] + fo; const endAdds = cum[lastBucket] + lastOff; const threadTotal = endAdds - startAdds; + if (threadTotal <= 0) continue; + const hiB = Math.min(lastBucket + 1, numDense); + const taskCuts = []; + for (let k = 0; k <= S; k++) { const ct = startAdds + Math.floor((k * threadTotal) / S); taskCuts.push(resolveCut(ct, fb, hiB)); } + + // Walker per slot k = task [taskCuts[k], taskCuts[k+1]). + for (let k = 0; k < S; k++) { + const [sb, so] = taskCuts[k]; const [eb, eo] = taskCuts[k + 1]; + walkTask(f, k, sb, so, eb, eo); + } + } + + // Faithful port of ba_stream_walker init + loop, emitting point-index ranges. + function walkTask(f, k, sb, so, eb, eo) { + const sbCount = counts[sb]; + // --- init: start cursor (point index WITHIN bucket) + split_start --- + let effSorted = sb, effCount = sbCount, startPt, splitStart; + if (so === 0) { startPt = 0; splitStart = 0; } + else if (so + 1 < sbCount) { startPt = so + 1; splitStart = 1; } + else { effSorted = sb + 1; effCount = effSorted < numDense ? counts[effSorted] : 0; startPt = 0; splitStart = 0; } + + // task end (region-aware) + let teSort, teCur; // teCur = point index (within bucket teSort) one past the last consumed + if (eo > 0) { teSort = eb; teCur = eo + 1; } + else if (eb > 0) { teSort = eb - 1; teCur = counts[teSort]; } + else { teSort = 0; teCur = 0; } + + let curSorted = effSorted, bucketEnd = effCount, cursor = startPt, isFirst = 1, slotDone = 0; + + // empty task + if (effSorted > teSort || (effSorted === teSort && startPt >= teCur)) return; + + // single-point leading segment + let segEnd = bucketEnd; if (effSorted === teSort) segEnd = teCur; + if (splitStart === 1 && segEnd - startPt === 1) { + if (effSorted === teSort) { addPiece(dense[effSorted], startPt, startPt, 'partial', { f, k, slot: 1 }); return; } + addPiece(dense[effSorted], startPt, startPt, 'partial', { f, k, slot: 0 }); + const nxt = effSorted + 1; curSorted = nxt; bucketEnd = counts[nxt]; cursor = 0; splitStart = 0; isFirst = 1; + if (nxt > teSort) return; + } + + // main loop + let pieceLo = cursor; // first not-yet-retired point of the current piece + for (;;) { + // consume one affine step + if (isFirst === 1) { cursor += 2; isFirst = 0; } else { cursor += 1; } + const taskDone = (curSorted === teSort) && (cursor >= teCur); + const bucketDone = cursor >= bucketEnd; + if (taskDone) { + const isPartial = (splitStart === 1) || (cursor < bucketEnd); + const hi = Math.min(cursor, bucketEnd) - 1; + addPiece(dense[curSorted], pieceLo, hi, isPartial ? 'partial' : 'sum', { f, k, slot: 1 }); + return; + } else if (bucketDone) { + addPiece(dense[curSorted], pieceLo, bucketEnd - 1, splitStart === 1 ? 'partial' : 'sum', { f, k, slot: 0 }); + const nxt = curSorted + 1; curSorted = nxt; bucketEnd = counts[nxt]; cursor = 0; isFirst = 1; splitStart = 0; pieceLo = 0; + } + // else: continue accumulating same piece + } + } + + // 8) Per-bucket combine analysis. + let bucketsWithDxZero = 0, offCurveBuckets = 0, overlapBuckets = 0, gapBuckets = 0, mixedSumPartial = 0; + const reports = []; + for (const b of dense) { + const list = buckets.get(b); const count = list.length; + const ps = pieces.get(b) || []; + const partials = ps.filter((p) => p.kind === 'partial'); + const sums = ps.filter((p) => p.kind === 'sum'); + + // range coverage check + const cover = new Array(count).fill(0); + let overlap = false, gap = false; + for (const p of ps) for (let i = p.lo; i <= p.hi; i++) { if (i >= 0 && i < count) cover[i]++; } + for (let i = 0; i < count; i++) { if (cover[i] === 0) gap = true; if (cover[i] > 1) overlap = true; } + if (overlap) overlapBuckets++; + if (gap) gapBuckets++; + if (sums.length > 0 && partials.length > 0) mixedSumPartial++; + + // The combine only runs over partials (a fully-consumed bucket has a 'sum' + // and the combine early-returns). Analyze the partial set the combine sees. + if (partials.length === 0) continue; + if (fast) { + if (overlap || gap || (sums.length && partials.length)) { + reports.push({ b, count, nPartials: partials.length, overlap, gap, sums: sums.length, ranges: partials.map((p) => [p.lo, p.hi, p.task]) }); + } + continue; + } + const pieceVal = (p) => { let accPt = ZERO; for (let i = p.lo; i <= p.hi; i++) { const e = list[i]; const v = e.neg ? negate(points[e.pt]) : points[e.pt]; accPt = addComplete(accPt, v); } return accPt; }; + const vals = partials.map(pieceVal); + + // replay combine sequential affine add (linked-list order ~= emission order; + // dx==0 is order-sensitive, but a structural duplicate/neg shows in any order). + let dxZeroHere = false; let collide = null; + let accPt = vals[0]; + for (let i = 1; i < vals.length; i++) { + if (accPt.x === vals[i].x) { dxZeroHere = true; collide = { i, accIsNeg: ptEq(accPt, negate(vals[i])), accEq: ptEq(accPt, vals[i]) }; } + const r = affineAddRaw(accPt, vals[i]); accPt = r.pt; + } + // also check ALL unordered pairs for structural equality / negation + let pairDup = 0, pairNeg = 0; + for (let i = 0; i < vals.length; i++) for (let j = i + 1; j < vals.length; j++) { + if (ptEq(vals[i], vals[j])) pairDup++; else if (vals[i].x === vals[j].x) pairNeg++; + } + + const refAff = affSum(list.map((e) => (e.neg ? negate(points[e.pt]) : points[e.pt]))); + const combineOk = ptEq(accPt, refAff); + if (dxZeroHere) bucketsWithDxZero++; + if (!combineOk || !onCurve(accPt)) offCurveBuckets++; + if (dxZeroHere || overlap || gap || pairDup || pairNeg || (sums.length && partials.length)) { + reports.push({ b, count, nPartials: partials.length, overlap, gap, pairDup, pairNeg, sums: sums.length, dxZeroHere, collide, combineOk, ranges: partials.map((p) => [p.lo, p.hi, p.task]) }); + } + } + + // GPU slot-space model: ba_walker_partials_index scans ALL + // 2*streamNumThreads*S partial slots, but the walker is indirect-dispatched + // over only numActive (= nwg*256) threads. clearBuffer zero-fills + // walkerPartialDest, yet NO_BUCKET = 0xffffffff — so every slot owned by a + // NON-dispatched thread reads as bucket_id 0 and is linked into bucket 0's + // combine list with partials_buf = (0,0) (off-curve, from the same zero + // clear). Count those, and the node-counter pressure vs max_nodes. + const STREAM_NUM_THREADS = 8192; + const totalSlots = 2 * STREAM_NUM_THREADS * S; // = max_nodes + const dispatchedSlots = 2 * Math.min(numActive, STREAM_NUM_THREADS) * S; + const staleSlots = totalSlots - dispatchedSlots; // -> all linked to bucket 0 as (0,0) + let realPartialSlots = 0; for (const ps of pieces.values()) realPartialSlots += ps.filter((p) => p.kind === 'partial').length; + const nodeOverflow = realPartialSlots + staleSlots > totalSlots; // node_counter vs max_nodes + const bucket0Dense = (buckets.get(0)?.length ?? 0) >= 2; + console.log(` GPU-slots: totalSlots(max_nodes)=${totalSlots} dispatched=${dispatchedSlots} stale->bucket0=${staleSlots} realPartialSlots=${realPartialSlots} nodeCounter=${realPartialSlots + staleSlots} overflow=${nodeOverflow} bucket0Dense=${bucket0Dense}`); + if (staleSlots > 0) console.log(` => bucket 0 combine list gets ${staleSlots} off-curve (0,0) nodes -> dx==0 + off-curve bucket value (clearBuffer 0 != NO_BUCKET 0xffffffff)`); + console.log(` contents: dupPairs=${contentsDup} negPairs=${contentsNegPair} numDense=${numDense} totalAdds=${totalAdds} nwg=${nwg} numActive=${numActive}`); + console.log(` combine: dxZeroBuckets=${bucketsWithDxZero} offCurveBuckets=${offCurveBuckets} overlapBuckets=${overlapBuckets} gapBuckets=${gapBuckets} mixedSumPartial=${mixedSumPartial}`); + for (const r of reports.slice(0, 12)) { + console.log(` bucket ${r.b} count=${r.count} nPartials=${r.nPartials} overlap=${r.overlap} gap=${r.gap} pairDup=${r.pairDup} pairNeg=${r.pairNeg} sums=${r.sums} dxZero=${r.dxZeroHere} combineOk=${r.combineOk}`); + console.log(` partial ranges: ${JSON.stringify(r.ranges)}`); + if (r.collide) console.log(` collide: ${JSON.stringify(r.collide)}`); + } + return { bucketsWithDxZero, offCurveBuckets, overlapBuckets, gapBuckets, mixedSumPartial, contentsDup, contentsNegPair, reports }; +} + +if (process.argv[2] === 'sweep') { + // Fast structural sweep across logn 8..16 and many seeds. Flags any bucket + // with a range overlap (double-count), gap, mixed sum+partial, or a + // contents-level duplicate / sign-collision — the structural root causes. + let totalFlag = 0; + for (let logN = 8; logN <= 16; logN++) { + for (let seed = 1; seed <= 25; seed++) { + const r = run(logN, seed, true); + const flags = r.overlapBuckets + r.gapBuckets + r.mixedSumPartial + r.contentsDup + r.contentsNegPair; + totalFlag += flags; + if (flags > 0) { + console.log(`FLAG logN=${logN} seed=${seed} overlap=${r.overlapBuckets} gap=${r.gapBuckets} mixed=${r.mixedSumPartial} cDup=${r.contentsDup} cNeg=${r.contentsNegPair}`); + for (const rep of r.reports.slice(0, 4)) console.log(` bucket ${rep.b} count=${rep.count} ranges=${JSON.stringify(rep.ranges)} overlap=${rep.overlap} gap=${rep.gap} sums=${rep.sums}`); + } + } + console.log(`logN=${logN}: swept 25 seeds, cumulative structural flags=${totalFlag}`); + } + console.log(totalFlag === 0 ? 'SWEEP CLEAN: no structural double-count / sign-collision in the walker+combine LOGIC.' : `SWEEP FOUND ${totalFlag} structural flags.`); +} else { + const logN = parseInt(process.argv[2] ?? '8', 10); + const seed = parseInt(process.argv[3] ?? '1', 10); + run(logN, seed, false); +} diff --git a/barretenberg/ts/src/msm_webgpu/msm_v2.ts b/barretenberg/ts/src/msm_webgpu/msm_v2.ts index e623b6f881de..7e49977778db 100644 --- a/barretenberg/ts/src/msm_webgpu/msm_v2.ts +++ b/barretenberg/ts/src/msm_webgpu/msm_v2.ts @@ -2089,7 +2089,7 @@ export class MsmV2 { const pbl = scratch.partialBucketsList; const ab = scratch.accBuf; const sps = scratch.streamPrefScratch; - const classifyParams = ubuf(new Uint32Array([B_TOTAL, 0, 0, 0])); + const classifyParams = ubuf(new Uint32Array([B_TOTAL, BW, 0, 0])); this.classifyBind = mkBind(this.classifyLayout, [countsBufs[0], offsetsBufs[0], s1, db, dc, sp, classifyParams]); this.metaFixupBind = mkBind(this.metaFixupLayout, [sp]); const radixParams = [ diff --git a/barretenberg/ts/src/msm_webgpu/wgsl/_generated/shaders.ts b/barretenberg/ts/src/msm_webgpu/wgsl/_generated/shaders.ts index e5321b600205..f7bb61c52290 100644 --- a/barretenberg/ts/src/msm_webgpu/wgsl/_generated/shaders.ts +++ b/barretenberg/ts/src/msm_webgpu/wgsl/_generated/shaders.ts @@ -1190,12 +1190,23 @@ const B_TOTAL: u32 = {{ b_total }}u; @group(0) @binding(4) var dense_count_list: array; @group(0) @binding(5) var planner_meta: array>; @group(0) @binding(6) var params: vec4; +// params.x = B_TOTAL (legacy, unused here), params.y = BW (buckets per window) @compute @workgroup_size({{ workgroup_size }}) fn main(@builtin(global_invocation_id) gid: vec3) { let b = gid.x; if (b >= B_TOTAL) { return; } + // Skip the Booth zero-digit bucket of every window (global id b where + // b % BW == 0). Those buckets hold the points whose window digit recoded + // to 0; they contribute nothing and the reduction already drops column 0 + // (ba_reduce_init: src = w*bw + i + 1). Accumulating them is wasted work, + // and combining global bucket 0 is what surfaced the uninitialised-slot + // garbage in ba_walker_partials_index (clearBuffer writes 0, NO_BUCKET is + // 0xffffffff). Excluding them here keeps the result identical. + let bw = params.y; + if (bw != 0u && (b % bw) == 0u) { return; } + let n = counts[b]; if (n == 0u) { return; } @@ -3437,6 +3448,16 @@ fn main(@builtin(global_invocation_id) gid: vec3) { let bucket_id = partial_dest[slot]; if (bucket_id == NO_BUCKET) { return; } + // partial_dest is prepared with clearBuffer (writes 0), but NO_BUCKET is + // 0xffffffff. The walker is indirect-dispatched over only the active + // threads (nwg*256 <= NUM_THREADS), so any slot owned by a non-dispatched + // thread is never initialised and still reads 0 — i.e. bucket_id 0. Linking + // those into bucket 0's combine list injects off-curve (0,0) partials + // (walkerPartials is likewise zero-cleared) and makes ba_walker_combine hit + // dx==0. bucket_id 0 is the Booth zero-digit and is dropped by the + // reduction (ba_reduce_init: column 0, weight 0), so skipping it here is + // safe and removes the garbage at the source. + if (bucket_id == 0u) { return; } let node_array_idx = atomicAdd(&node_counter, 1u); if (node_array_idx >= max_nodes) { return; } diff --git a/barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_planner_classify.template.wgsl b/barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_planner_classify.template.wgsl index 287df31ab04a..89437cc0725b 100644 --- a/barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_planner_classify.template.wgsl +++ b/barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_planner_classify.template.wgsl @@ -17,12 +17,23 @@ const B_TOTAL: u32 = {{ b_total }}u; @group(0) @binding(4) var dense_count_list: array; @group(0) @binding(5) var planner_meta: array>; @group(0) @binding(6) var params: vec4; +// params.x = B_TOTAL (legacy, unused here), params.y = BW (buckets per window) @compute @workgroup_size({{ workgroup_size }}) fn main(@builtin(global_invocation_id) gid: vec3) { let b = gid.x; if (b >= B_TOTAL) { return; } + // Skip the Booth zero-digit bucket of every window (global id b where + // b % BW == 0). Those buckets hold the points whose window digit recoded + // to 0; they contribute nothing and the reduction already drops column 0 + // (ba_reduce_init: src = w*bw + i + 1). Accumulating them is wasted work, + // and combining global bucket 0 is what surfaced the uninitialised-slot + // garbage in ba_walker_partials_index (clearBuffer writes 0, NO_BUCKET is + // 0xffffffff). Excluding them here keeps the result identical. + let bw = params.y; + if (bw != 0u && (b % bw) == 0u) { return; } + let n = counts[b]; if (n == 0u) { return; } diff --git a/barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_walker_partials_index.template.wgsl b/barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_walker_partials_index.template.wgsl index d5dd74fc73a6..c00ba937ac30 100644 --- a/barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_walker_partials_index.template.wgsl +++ b/barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_walker_partials_index.template.wgsl @@ -41,6 +41,16 @@ fn main(@builtin(global_invocation_id) gid: vec3) { let bucket_id = partial_dest[slot]; if (bucket_id == NO_BUCKET) { return; } + // partial_dest is prepared with clearBuffer (writes 0), but NO_BUCKET is + // 0xffffffff. The walker is indirect-dispatched over only the active + // threads (nwg*256 <= NUM_THREADS), so any slot owned by a non-dispatched + // thread is never initialised and still reads 0 — i.e. bucket_id 0. Linking + // those into bucket 0's combine list injects off-curve (0,0) partials + // (walkerPartials is likewise zero-cleared) and makes ba_walker_combine hit + // dx==0. bucket_id 0 is the Booth zero-digit and is dropped by the + // reduction (ba_reduce_init: column 0, weight 0), so skipping it here is + // safe and removes the garbage at the source. + if (bucket_id == 0u) { return; } let node_array_idx = atomicAdd(&node_counter, 1u); if (node_array_idx >= max_nodes) { return; }