[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps by sifakis · Pull Request #2186 · AcademySoftwareFoundation/openvdb

sifakis · 2026-03-27T17:04:49Z

Draft for CI and diff review. WIP.

Add WenoLeafPtrs<BuildT>, resolveWenoLeafPtrs, and computeWenoStencil as static __device__ members of VoxelBlockManager<Log2BlockWidth>. These implement the first phase of a two-function WENO5 stencil gather: resolveWenoLeafPtrs performs exactly 3 probeLeaf calls (one per axis) to resolve neighbor leaf pointers; computeWenoStencil fills a caller-provided array with the 19 global sequential indices using WenoPt<i,j,k>::idx. voxelOffset arithmetic uses octal notation: NanoVDB leaf layout encodes (x,y,z) as x*64+y*8+z, so x/y/z strides are 0100/010/1 in octal. WenoPt<i,j,k>::idx is used throughout to remain independent of any future re-alignment with OpenVDB's NineteenPt (which uses a different convention). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Adds the ex_voxelBlockManager_host_cuda example demonstrating the CPU and CUDA VoxelBlockManager implementations, along with design documentation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Remove DecodeInverseMapsCPUPlan.md (implementation complete) and distill its non-obvious design decisions into the knowledge base: - §11: decodeInverseMaps is intentionally single-threaded/stateless; caller distributes blocks; contrast with cooperative GPU version. - §12: mPrefixSum is bypassed for bulk access — recomputing from raw mask words via buildMaskPrefixSums is cheaper than unpacking 9-bit fields; mPrefixSum is still used for the cross-word offset in Step 5. - §13: output fill is range-fill + contiguous copy (not scatter) because shuffleDownMask produces a sorted compacted array; std::fill/copy caveat for alignment when output arrays come from TLS or stack pointers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Implementation complete; design rationale distilled into VBMImplementationKnowledge.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Design reference for the per-block stencil gather kernel: decodes inverse maps into block-local scratch, then resolves neighbor leaf pointers and fills N-point stencil index arrays for all active voxels in the block. WENO5 (N=19, R=3) is the motivating instance; architecture is stencil-agnostic. Covers GPU inner loop, CPU SIMD batch design (SIMDw=16, probeLeaf dedup), unified StencilLeafPtrs template, and reach-R generalization considerations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

- §3: Stencil type as template parameter needs index→offsets direction (for-each-slot gather loop), not the offsets→index direction of WenoPt. Clarify relationship to BaseStencil/WenoStencil: geometry-only descriptor, no accessor coupling. - §4: Kernel lambda signature std::array<ValueType,K> kernel(const ValueType* u); output is homogeneous std::array (not tuple); K=1 degenerates to scalar; SoA output layout results[k][BlockWidth] for SIMD efficiency. - Renumber §4-§8 → §5-§9; update open questions accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…notes Adds a self-contained test (lift_test.cpp) exploring a generic SIMD-lifting abstraction: given a scalar tuple→tuple kernel, liftToSimd<W> produces an SoA-wide version that loops over W lanes and is the auto-vectorization target. The motivating kernel is WENO5 normSqGrad (19-point stencil, matching WenoStencil::normSqGrad from Stencils.h). The six weno5() calls vectorize cleanly; godunovsNormSqrd() blocks vectorization in two distinct ways depending on how it is written: 1. std::max / bool isOutside ternaries → "control flow in loop" 2. float sign + fmaxf (no ternaries) → "no vectype for stmt" due to GCC's inability to see through std::tuple's recursive-inheritance struct layout in GIMPLE alias analysis INVESTIGATION.md documents all experiments, findings, current blockers, and proposed next steps (pointer-cache approach, Clang comparison, etc.). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Evangelos Sifakis <esifakis@gmail.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Introduce nanovdb::util::Simd<T,W> (simd_test/Simd.h) — a minimal header-only SIMD abstraction backed by std::array<T,W> with arithmetic operators, SimdMask, min/max, and where(). Mirrors the C++26 std::simd interface for forward compatibility. Rewrite the WENO5 normSqGrad kernel as a template on T: - T=float : scalar __hostdev__ path for GPU (one thread per voxel) - T=Simd<float,W> : W-wide CPU path (one call per batch) A single templated godunovsNormSqrd + normSqGrad definition serves both execution contexts with no #ifdef, structurally matching Stencils.h. Clang 18 vectorizes the Simd<float,16> instantiation (691 ymm instructions in the hot function, assembly-verified); GCC 13 does not. Update INVESTIGATION.md with the full scoreboard, both approaches, and next steps (GCC intrinsics path, benchmarking, nanovdb/util/ integration). Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Auto-detect <experimental/simd> (Parallelism TS v2) via __has_include and __cpp_lib_experimental_parallel_simd. When available, Simd<T,W> and SimdMask<T,W> become thin wrappers around fixed_size_simd / fixed_size_simd_mask, delegating all arithmetic to the standard type. The TS v2 where(mask, v) is a 2-arg masked-assignment proxy; wrap it into the 3-arg select(mask, a, b) form expected by the kernels. Verified with clang++-18 -std=c++26: both paths produce identical assembly (1275 ymm instructions, PASS on all 16 lanes), confirming Clang optimizes through the wrapper completely. Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Document the std::experimental::simd backend alongside the std::array default, including the TS v2 where() adaptation, the auto-detection mechanism, and the assembly comparison showing byte-for-byte identical output between the two backends under Clang 18. Update the vectorization results table and open questions accordingly. Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

StencilKernel.h — new prototype header: - BaseStencilKernel<T, SIZE>: owns mValues[], mDx2, mInvDx2; no grid coupling - WenoStencilKernel<T>: derives from above, provides normSqGrad() - WENO5<T> and GodunovsNormSqrd<T, MaskT>: free functions mirroring Stencils.h - T=float for GPU scalar path, T=Simd<float,W> for CPU batch path lift_test.cpp — rewritten to use WenoStencilKernel<T> directly: - SIMD and scalar reference paths both instantiate the same class - dx passed to constructor; mValues populated via operator[] Simd.h — refinements: - Simd<T,W> and SimdMask<T,W> in Backend A are now pure type aliases for stdx::fixed_size_simd / fixed_size_simd_mask (no wrapper struct) - element_aligned_tag / element_aligned: portable load/store tag, always present; aliases stdx::element_aligned_tag in Backend A, dummy struct in B - Backend B load constructor and store() accept element_aligned_tag (defaulted) - NANOVDB_NO_STD_SIMD opt-out flag to force Backend B INVESTIGATION.md — updated: - Approach B section updated to reflect class hierarchy instead of free functions - Backend B GCC note: the struct-access failure was specific to Approach A's liftToSimd outer-lane loop; Backend B's fixed-count operator loops do vectorize on GCC when used with the Generic-T class hierarchy - New ymm tables for both backends under GCC (Backend A: 1267 total, Backend B: 619 total); both pass correctness Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…chPtrs) Add full CPU batch neighbor-leaf resolution design to the planning doc: - §6: Replace StencilLeafPtrs struct with layered design — shared 3×3×3 bit encoding for probedMask/ptrs[27], stencil-specific batchPtrs population (batchPtrs[4][SIMDw] for WENO5, batchPtrs[3][3][3][SIMDw] for box stencil), and GPU scalar design note kept separate. - §8d: Update lazy-probe section to reference ptrs[27] and 27-bit probedMask; add batchPtrs population step (Phase 2) after the probeLeaf loop. - §8e: Update computeNeededDirs direction table to use 3×3×3 bit positions (bits 4,10,12,14,16,22 for WENO5 face neighbors). - §8f/§8g: Minor notation updates to match ptrs[27] naming. - §9: Resolve ptrs-layout and nExtraLeaves open questions; add prototype scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Standalone CPU-only executable that verifies the neighbor leaf resolution design from StencilGather.md §8d–§8f: - For each VBM block: calls decodeInverseMaps, recomputes nLeaves from jumpMap, then processes SIMDw=16 batches with the full probedMask / lazy-probeLeaf / batchPtrs[4][SIMDw] pipeline. - Does not call computeStencil. Instead verifies batchPtrs against a direct probeLeaf reference for all 18 non-center WENO5 stencil offsets that cross leaf boundaries. - Passes at 0.1, 0.25, 0.5, 0.9 occupancy (2.3M–2.9M lane checks each). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Refactor computeNeededDirs to accept a pre-expanded Simd<uint32_t,SIMDw> vector, moving sentinel/masking responsibility to the single gather site where leafMask is known: - kSentinelExpanded = expandVoxelOffset(292) = 0x41044104 (constexpr) - Caller broadcasts sentinel to all lanes, overwrites leafMask lanes with real expandVoxelOffset() values before calling computeNeededDirs - computeNeededDirs is now a pure add+reduce with no masking or cross-check Carry trick (§8e): expandVoxelOffset packs lz/lx/ly into 6 guarded 3-bit groups; a single vpaddd ymm × 2 + vpor + vpand + shuffle-tree detects all six WENO5 directions simultaneously. kExpandCarryK = 0x514530C3. AVX2 codegen confirmed via objdump: - computeNeededDirs: vpbroadcastd + 2×vpaddd ymm + vpor/vpand ymm + vextracti128/vpsrldq shuffle-tree, no branches or calls in hot path - activeMask/leafMask in runPrototype: vpcmpeqd ymm × 4 + vmovmskps ymm × 2 - Sentinel broadcast: 0x41044104 literal → vpbroadcastd → 2×vmovdqa ymm Always-on scalar cross-check at every computeNeededDirs call site. verifyComputeNeededDirsSentinel() tests both the sentinel carry property and the straddle-lane non-pollution scenario before runPrototype(). StencilGather.md §8e and §8f updated to match new API and codegen notes. Phase 1 prototype marked complete in §9. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Documents the BatchAccessor — the SIMD-batch analog of ValueAccessor — developed from the ex_stencil_gather_cpu Phase 1 prototype discussion. Core concept: instead of caching the path to one leaf, cache the full 3×3×3 neighborhood of 27 leaf pointers around the current center leaf, serving SIMDw voxels per call. Key design elements documented: Eviction policy: fires on none_of(leafMask) only — straddle lanes do not evict. leafMask is the "partial-hit" signal with no scalar-accessor analog. Prefetch coverage argument: - WENO5 (R=3): 6 extremal taps (±R,0,0),(0,±R,0),(0,0,±R) are necessary and sufficient — equivalent to the computeNeededDirs carry trick - Box stencil (R=1): 8 corner taps (±1,±1,±1) collectively cover all 26 non-center directions for any voxel position in the batch Three-tier API: - prefetch<di,dj,dk>(vo, leafMask, treeAcc) - cachedGetValue<di,dj,dk>(vo, leafMask) — no treeAcc, cache assumed warm - getValue<di,dj,dk>(vo, leafMask, treeAcc) — lazy combined (vanilla style) Template <di,dj,dk> rationale vs runtime Coord: compile-time direction bit, dead-axis elimination, VDB convention alignment; runtime Coord overload provided for generic stencil adapters. AVX2 profile: offset arithmetic (vpaddd ymm), lane split (vpcmpgtd ymm), gather from ≤2 leaf pointers (vgatherdps×2 + vpblendvb) — both scalar bottlenecks from Phase 1 prototype are eliminated. StencilGather.md: add cross-reference to BatchAccessor.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

sifakis changed the title ~~NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps~~ [WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps Mar 30, 2026

sifakis force-pushed the vbm-cpu-port branch from 249e657 to 294f604 Compare April 2, 2026 18:44

sifakis and others added 2 commits April 2, 2026 13:46

sifakis force-pushed the vbm-cpu-port branch from 294f604 to c9906f4 Compare April 2, 2026 18:46

sifakis and others added 13 commits April 2, 2026 13:50

ex_voxelBlockManager_host_cuda: remove DecodeInverseMapsCPUPlan.md

9a31aed

Implementation complete; design rationale distilled into VBMImplementationKnowledge.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps#2186

[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps#2186
sifakis wants to merge 15 commits intoAcademySoftwareFoundation:masterfrom
sifakis:vbm-cpu-port

sifakis commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sifakis commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant