feat: bf16 GPU residency — device-resident weight upload + axis reductions + Rsqrt (ADR-075 L4)#155
Merged
Merged
Conversation
Add a native bf16 GPU Rsqrt so a bf16 GPUEngine no longer falls back to CPU for 1/sqrt(x): kernel_rsqrt_bf16 (FP32 transcendental, mirrors kernel_sqrt_bf16) + launch_rsqrt_bf16, RsqrtBF16 purego binding + KernelRunner method across all six implementers (CUDA real; ROCm/Metal/ FPGA/SYCL/OpenCL + test stubs), and the isBFloat16 dispatch branch in gpuRsqrt. CPU f32/f64 paths are untouched. Adds a Rsqrt case to the CUDA-gated UnaryParity test (~2 bf16 ulps). GPU-UNVERIFIED: parity test will run on the GB10 verify pod.
…ean) bf16 Sum/ReduceSum/ReduceMean fell back to CPU for non-f32, forcing a D2H of every reduced tensor. Add a native bf16 axis reduction: kernel_sum_axis_bf16 + launch_sum_axis_bf16 accumulate each (outer,inner) stripe in FP32 (mirroring kernel_scaled_softmax_bf16's reduction) and round once to bf16. invDivisor folds the mean's 1/axisSize into the FP32 accumulation so ReduceMean stays on-device and rounds exactly once. SumAxisBF16 purego binding + KernelRunner method across all six implementers (CUDA real; ROCm/Metal/FPGA/SYCL/OpenCL + test stubs). gpuSumAxisBF16 helper mirrors the f32 gpuSum body exactly (axis normalization, keepDims/squeeze output shape, dst reuse, negative-axis and OOM CPU fallbacks -- which reapply the mean divide). Dispatch: gpuSum/gpuReduceSum -> invDivisor=1.0; gpuReduceMean -> 1/axisSize. CPU f32/f64 paths untouched. Adds CUDA-gated TestGPUBF16_ReductionParity (Sum/ReduceSum/ReduceMean + keepDims shape) vs an f64 reference with a sqrt(axisSize)-scaled bf16 tolerance (FP32-accumulated reductions carry more rounding than the f64-then-bf16 oracle). GPU-UNVERIFIED: kernel correctness, the warp-floor launch config, and output-shape matching will be confirmed on GB10.
UploadWeights/bulkUploadF32 were hardcoded to []*TensorNumeric[float32], so a GPUEngine[float16.BFloat16] could not make its bf16 weights device-resident -- they stayed host-backed and paid a per-op H2D firehose. Add a generic path: - bulkUploadT[T] mirrors bulkUploadF32 exactly (capture-skip, MinTensors threshold, bulkUploadChunkRanges bounds, managedMem direct-copy vs staged Memcpy, non-owning NewGPUStorageViewFromPtr[T] views) but sizes by unsafe.Sizeof(T) and skips tensors already *GPUStorage[T]. - UploadWeightsT[T] is the T-typed analogue of UploadWeights. The f32 UploadWeights + its Q4/quantized handling stay intact for the f32 inference path. UploadWeightsT is a generic method, so it cannot join the (float32-only) WeightUploader interface; callers use it directly off GPUEngine[T]. Adds CUDA-gated TestGPUEngine_UploadWeightsT_BF16: builds 128 host-backed bf16 weight tensors, asserts UploadWeightsT makes them *GPUStorage[ float16.BFloat16], values round-trip, and a second call is a no-op (already-resident skip). GPU-UNVERIFIED until run on the GB10 verify pod.
Contributor
Author
|
GB10 verified (sm_121): rebuilt libkernels.so from this branch via Spark, CUDA-gated compute suite all green — UploadWeightsT_BF16, UnaryParity/Rsqrt, ReductionParity (Sum/ReduceSum/ReduceMean/SumKeepDims) + all prior bf16/MatMulBF16. ok compute 2.046s. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the two bf16 GPU-residency gaps that prevented a fair on-device bf16 training step (the rest of the bf16 graph landed in v1.13.0/v1.14.0), plus the trivial Rsqrt fallback.
UploadWeightsT[T]/bulkUploadT[T](sizeof(T)-generic mirror of bulkUploadF32; capture-skip, chunk bounds, managedMem, non-owning views).GPUEngine[float16.BFloat16]weights now go device-resident instead of per-op H2D. (The f32-onlyUploadWeights+ Q4/quantized path is untouched;WeightUploaderinterface unchanged since Go forbids generic interface methods.)kernel_sum_axis_bf16(FP32 shared-mem + warp-shuffle reduction) with aninvDivisorfold (1.0 = Sum, 1/axisSize = Mean, on-device). WiredgpuSum/gpuReduceSum/gpuReduceMeanbf16 branches matching the f32 output-shape/keepDims semantics. Unblocks bf16 LayerNorm + cross-entropy on-device.kernel_rsqrt_bf16+gpuRsqrtbf16 branch.GB10 verified (sm_121)
Rebuilt libkernels.so from this branch via Spark and ran the CUDA-gated compute suite on GB10 — all green: TestGPUEngine_UploadWeightsT_BF16, TestGPUBF16_UnaryParity/Rsqrt, TestGPUBF16_ReductionParity (Sum/ReduceSum/ReduceMean/SumKeepDims), plus all prior bf16 + MatMulBF16 parity.
ok compute 2.046s.General framework mechanisms; nothing Wolf-specific. CPU f32/f64 paths byte-identical. Builds verified under metal/rocm/fpga/sycl tags.