feat: bf16 GPU residency — device-resident weight upload + axis reductions + Rsqrt (ADR-075 L4) by dndungu · Pull Request #155 · zerfoo/ztensor

dndungu · 2026-06-17T00:08:33Z

Summary

Closes the two bf16 GPU-residency gaps that prevented a fair on-device bf16 training step (the rest of the bf16 graph landed in v1.13.0/v1.14.0), plus the trivial Rsqrt fallback.

Generic bf16 bulk weight upload — UploadWeightsT[T]/bulkUploadT[T] (sizeof(T)-generic mirror of bulkUploadF32; capture-skip, chunk bounds, managedMem, non-owning views). GPUEngine[float16.BFloat16] weights now go device-resident instead of per-op H2D. (The f32-only UploadWeights + Q4/quantized path is untouched; WeightUploader interface unchanged since Go forbids generic interface methods.)
bf16 axis reductions — kernel_sum_axis_bf16 (FP32 shared-mem + warp-shuffle reduction) with an invDivisor fold (1.0 = Sum, 1/axisSize = Mean, on-device). Wired gpuSum/gpuReduceSum/gpuReduceMean bf16 branches matching the f32 output-shape/keepDims semantics. Unblocks bf16 LayerNorm + cross-entropy on-device.
bf16 Rsqrt — kernel_rsqrt_bf16 + gpuRsqrt bf16 branch.

GB10 verified (sm_121)

Rebuilt libkernels.so from this branch via Spark and ran the CUDA-gated compute suite on GB10 — all green: TestGPUEngine_UploadWeightsT_BF16, TestGPUBF16_UnaryParity/Rsqrt, TestGPUBF16_ReductionParity (Sum/ReduceSum/ReduceMean/SumKeepDims), plus all prior bf16 + MatMulBF16 parity. ok compute 2.046s.

General framework mechanisms; nothing Wolf-specific. CPU f32/f64 paths byte-identical. Builds verified under metal/rocm/fpga/sycl tags.

Add a native bf16 GPU Rsqrt so a bf16 GPUEngine no longer falls back to CPU for 1/sqrt(x): kernel_rsqrt_bf16 (FP32 transcendental, mirrors kernel_sqrt_bf16) + launch_rsqrt_bf16, RsqrtBF16 purego binding + KernelRunner method across all six implementers (CUDA real; ROCm/Metal/ FPGA/SYCL/OpenCL + test stubs), and the isBFloat16 dispatch branch in gpuRsqrt. CPU f32/f64 paths are untouched. Adds a Rsqrt case to the CUDA-gated UnaryParity test (~2 bf16 ulps). GPU-UNVERIFIED: parity test will run on the GB10 verify pod.

…ean) bf16 Sum/ReduceSum/ReduceMean fell back to CPU for non-f32, forcing a D2H of every reduced tensor. Add a native bf16 axis reduction: kernel_sum_axis_bf16 + launch_sum_axis_bf16 accumulate each (outer,inner) stripe in FP32 (mirroring kernel_scaled_softmax_bf16's reduction) and round once to bf16. invDivisor folds the mean's 1/axisSize into the FP32 accumulation so ReduceMean stays on-device and rounds exactly once. SumAxisBF16 purego binding + KernelRunner method across all six implementers (CUDA real; ROCm/Metal/FPGA/SYCL/OpenCL + test stubs). gpuSumAxisBF16 helper mirrors the f32 gpuSum body exactly (axis normalization, keepDims/squeeze output shape, dst reuse, negative-axis and OOM CPU fallbacks -- which reapply the mean divide). Dispatch: gpuSum/gpuReduceSum -> invDivisor=1.0; gpuReduceMean -> 1/axisSize. CPU f32/f64 paths untouched. Adds CUDA-gated TestGPUBF16_ReductionParity (Sum/ReduceSum/ReduceMean + keepDims shape) vs an f64 reference with a sqrt(axisSize)-scaled bf16 tolerance (FP32-accumulated reductions carry more rounding than the f64-then-bf16 oracle). GPU-UNVERIFIED: kernel correctness, the warp-floor launch config, and output-shape matching will be confirmed on GB10.

UploadWeights/bulkUploadF32 were hardcoded to []*TensorNumeric[float32], so a GPUEngine[float16.BFloat16] could not make its bf16 weights device-resident -- they stayed host-backed and paid a per-op H2D firehose. Add a generic path: - bulkUploadT[T] mirrors bulkUploadF32 exactly (capture-skip, MinTensors threshold, bulkUploadChunkRanges bounds, managedMem direct-copy vs staged Memcpy, non-owning NewGPUStorageViewFromPtr[T] views) but sizes by unsafe.Sizeof(T) and skips tensors already *GPUStorage[T]. - UploadWeightsT[T] is the T-typed analogue of UploadWeights. The f32 UploadWeights + its Q4/quantized handling stay intact for the f32 inference path. UploadWeightsT is a generic method, so it cannot join the (float32-only) WeightUploader interface; callers use it directly off GPUEngine[T]. Adds CUDA-gated TestGPUEngine_UploadWeightsT_BF16: builds 128 host-backed bf16 weight tensors, asserts UploadWeightsT makes them *GPUStorage[ float16.BFloat16], values round-trip, and a second call is a no-op (already-resident skip). GPU-UNVERIFIED until run on the GB10 verify pod.

dndungu · 2026-06-17T00:10:17Z

GB10 verified (sm_121): rebuilt libkernels.so from this branch via Spark, CUDA-gated compute suite all green — UploadWeightsT_BF16, UnaryParity/Rsqrt, ReductionParity (Sum/ReduceSum/ReduceMean/SumKeepDims) + all prior bf16/MatMulBF16. ok compute 2.046s.

dndungu added 3 commits June 16, 2026 16:58

dndungu merged commit 419564c into main Jun 17, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: bf16 GPU residency — device-resident weight upload + axis reductions + Rsqrt (ADR-075 L4)#155

feat: bf16 GPU residency — device-resident weight upload + axis reductions + Rsqrt (ADR-075 L4)#155
dndungu merged 3 commits into
mainfrom
feat/bf16-upload-reductions

dndungu commented Jun 17, 2026

Uh oh!

dndungu commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dndungu commented Jun 17, 2026

Summary

GB10 verified (sm_121)

Uh oh!

dndungu commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant