Skip to content

feat: bf16 GPU residency — device-resident weight upload + axis reductions + Rsqrt (ADR-075 L4)#155

Merged
dndungu merged 3 commits into
mainfrom
feat/bf16-upload-reductions
Jun 17, 2026
Merged

feat: bf16 GPU residency — device-resident weight upload + axis reductions + Rsqrt (ADR-075 L4)#155
dndungu merged 3 commits into
mainfrom
feat/bf16-upload-reductions

Conversation

@dndungu

@dndungu dndungu commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes the two bf16 GPU-residency gaps that prevented a fair on-device bf16 training step (the rest of the bf16 graph landed in v1.13.0/v1.14.0), plus the trivial Rsqrt fallback.

  • Generic bf16 bulk weight uploadUploadWeightsT[T]/bulkUploadT[T] (sizeof(T)-generic mirror of bulkUploadF32; capture-skip, chunk bounds, managedMem, non-owning views). GPUEngine[float16.BFloat16] weights now go device-resident instead of per-op H2D. (The f32-only UploadWeights + Q4/quantized path is untouched; WeightUploader interface unchanged since Go forbids generic interface methods.)
  • bf16 axis reductionskernel_sum_axis_bf16 (FP32 shared-mem + warp-shuffle reduction) with an invDivisor fold (1.0 = Sum, 1/axisSize = Mean, on-device). Wired gpuSum/gpuReduceSum/gpuReduceMean bf16 branches matching the f32 output-shape/keepDims semantics. Unblocks bf16 LayerNorm + cross-entropy on-device.
  • bf16 Rsqrtkernel_rsqrt_bf16 + gpuRsqrt bf16 branch.

GB10 verified (sm_121)

Rebuilt libkernels.so from this branch via Spark and ran the CUDA-gated compute suite on GB10 — all green: TestGPUEngine_UploadWeightsT_BF16, TestGPUBF16_UnaryParity/Rsqrt, TestGPUBF16_ReductionParity (Sum/ReduceSum/ReduceMean/SumKeepDims), plus all prior bf16 + MatMulBF16 parity. ok compute 2.046s.

General framework mechanisms; nothing Wolf-specific. CPU f32/f64 paths byte-identical. Builds verified under metal/rocm/fpga/sycl tags.

dndungu added 3 commits June 16, 2026 16:58
Add a native bf16 GPU Rsqrt so a bf16 GPUEngine no longer falls back to
CPU for 1/sqrt(x): kernel_rsqrt_bf16 (FP32 transcendental, mirrors
kernel_sqrt_bf16) + launch_rsqrt_bf16, RsqrtBF16 purego binding +
KernelRunner method across all six implementers (CUDA real; ROCm/Metal/
FPGA/SYCL/OpenCL + test stubs), and the isBFloat16 dispatch branch in
gpuRsqrt. CPU f32/f64 paths are untouched.

Adds a Rsqrt case to the CUDA-gated UnaryParity test (~2 bf16 ulps).
GPU-UNVERIFIED: parity test will run on the GB10 verify pod.
…ean)

bf16 Sum/ReduceSum/ReduceMean fell back to CPU for non-f32, forcing a
D2H of every reduced tensor. Add a native bf16 axis reduction:
kernel_sum_axis_bf16 + launch_sum_axis_bf16 accumulate each (outer,inner)
stripe in FP32 (mirroring kernel_scaled_softmax_bf16's reduction) and
round once to bf16. invDivisor folds the mean's 1/axisSize into the FP32
accumulation so ReduceMean stays on-device and rounds exactly once.

SumAxisBF16 purego binding + KernelRunner method across all six
implementers (CUDA real; ROCm/Metal/FPGA/SYCL/OpenCL + test stubs).
gpuSumAxisBF16 helper mirrors the f32 gpuSum body exactly (axis
normalization, keepDims/squeeze output shape, dst reuse, negative-axis
and OOM CPU fallbacks -- which reapply the mean divide). Dispatch:
gpuSum/gpuReduceSum -> invDivisor=1.0; gpuReduceMean -> 1/axisSize.
CPU f32/f64 paths untouched.

Adds CUDA-gated TestGPUBF16_ReductionParity (Sum/ReduceSum/ReduceMean +
keepDims shape) vs an f64 reference with a sqrt(axisSize)-scaled bf16
tolerance (FP32-accumulated reductions carry more rounding than the
f64-then-bf16 oracle). GPU-UNVERIFIED: kernel correctness, the warp-floor
launch config, and output-shape matching will be confirmed on GB10.
UploadWeights/bulkUploadF32 were hardcoded to []*TensorNumeric[float32],
so a GPUEngine[float16.BFloat16] could not make its bf16 weights
device-resident -- they stayed host-backed and paid a per-op H2D
firehose. Add a generic path:

- bulkUploadT[T] mirrors bulkUploadF32 exactly (capture-skip, MinTensors
  threshold, bulkUploadChunkRanges bounds, managedMem direct-copy vs
  staged Memcpy, non-owning NewGPUStorageViewFromPtr[T] views) but sizes
  by unsafe.Sizeof(T) and skips tensors already *GPUStorage[T].
- UploadWeightsT[T] is the T-typed analogue of UploadWeights. The f32
  UploadWeights + its Q4/quantized handling stay intact for the f32
  inference path. UploadWeightsT is a generic method, so it cannot join
  the (float32-only) WeightUploader interface; callers use it directly
  off GPUEngine[T].

Adds CUDA-gated TestGPUEngine_UploadWeightsT_BF16: builds 128 host-backed
bf16 weight tensors, asserts UploadWeightsT makes them *GPUStorage[
float16.BFloat16], values round-trip, and a second call is a no-op
(already-resident skip). GPU-UNVERIFIED until run on the GB10 verify pod.
@dndungu

dndungu commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

GB10 verified (sm_121): rebuilt libkernels.so from this branch via Spark, CUDA-gated compute suite all green — UploadWeightsT_BF16, UnaryParity/Rsqrt, ReductionParity (Sum/ReduceSum/ReduceMean/SumKeepDims) + all prior bf16/MatMulBF16. ok compute 2.046s.

@dndungu dndungu merged commit 419564c into main Jun 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant