[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps#2186
Draft
sifakis wants to merge 15 commits intoAcademySoftwareFoundation:masterfrom
Draft
[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps#2186sifakis wants to merge 15 commits intoAcademySoftwareFoundation:masterfrom
sifakis wants to merge 15 commits intoAcademySoftwareFoundation:masterfrom
Conversation
Add WenoLeafPtrs<BuildT>, resolveWenoLeafPtrs, and computeWenoStencil as static __device__ members of VoxelBlockManager<Log2BlockWidth>. These implement the first phase of a two-function WENO5 stencil gather: resolveWenoLeafPtrs performs exactly 3 probeLeaf calls (one per axis) to resolve neighbor leaf pointers; computeWenoStencil fills a caller-provided array with the 19 global sequential indices using WenoPt<i,j,k>::idx. voxelOffset arithmetic uses octal notation: NanoVDB leaf layout encodes (x,y,z) as x*64+y*8+z, so x/y/z strides are 0100/010/1 in octal. WenoPt<i,j,k>::idx is used throughout to remain independent of any future re-alignment with OpenVDB's NineteenPt (which uses a different convention). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Adds the ex_voxelBlockManager_host_cuda example demonstrating the CPU and CUDA VoxelBlockManager implementations, along with design documentation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Remove DecodeInverseMapsCPUPlan.md (implementation complete) and distill its non-obvious design decisions into the knowledge base: - §11: decodeInverseMaps is intentionally single-threaded/stateless; caller distributes blocks; contrast with cooperative GPU version. - §12: mPrefixSum is bypassed for bulk access — recomputing from raw mask words via buildMaskPrefixSums is cheaper than unpacking 9-bit fields; mPrefixSum is still used for the cross-word offset in Step 5. - §13: output fill is range-fill + contiguous copy (not scatter) because shuffleDownMask produces a sorted compacted array; std::fill/copy caveat for alignment when output arrays come from TLS or stack pointers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Implementation complete; design rationale distilled into VBMImplementationKnowledge.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Design reference for the per-block stencil gather kernel: decodes inverse maps into block-local scratch, then resolves neighbor leaf pointers and fills N-point stencil index arrays for all active voxels in the block. WENO5 (N=19, R=3) is the motivating instance; architecture is stencil-agnostic. Covers GPU inner loop, CPU SIMD batch design (SIMDw=16, probeLeaf dedup), unified StencilLeafPtrs template, and reach-R generalization considerations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
- §3: Stencil type as template parameter needs index→offsets direction (for-each-slot gather loop), not the offsets→index direction of WenoPt. Clarify relationship to BaseStencil/WenoStencil: geometry-only descriptor, no accessor coupling. - §4: Kernel lambda signature std::array<ValueType,K> kernel(const ValueType* u); output is homogeneous std::array (not tuple); K=1 degenerates to scalar; SoA output layout results[k][BlockWidth] for SIMD efficiency. - Renumber §4-§8 → §5-§9; update open questions accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…notes
Adds a self-contained test (lift_test.cpp) exploring a generic SIMD-lifting
abstraction: given a scalar tuple→tuple kernel, liftToSimd<W> produces an
SoA-wide version that loops over W lanes and is the auto-vectorization target.
The motivating kernel is WENO5 normSqGrad (19-point stencil, matching
WenoStencil::normSqGrad from Stencils.h). The six weno5() calls vectorize
cleanly; godunovsNormSqrd() blocks vectorization in two distinct ways
depending on how it is written:
1. std::max / bool isOutside ternaries → "control flow in loop"
2. float sign + fmaxf (no ternaries) → "no vectype for stmt" due to
GCC's inability to see through std::tuple's recursive-inheritance
struct layout in GIMPLE alias analysis
INVESTIGATION.md documents all experiments, findings, current blockers,
and proposed next steps (pointer-cache approach, Clang comparison, etc.).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Evangelos Sifakis <esifakis@gmail.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Introduce nanovdb::util::Simd<T,W> (simd_test/Simd.h) — a minimal header-only SIMD abstraction backed by std::array<T,W> with arithmetic operators, SimdMask, min/max, and where(). Mirrors the C++26 std::simd interface for forward compatibility. Rewrite the WENO5 normSqGrad kernel as a template on T: - T=float : scalar __hostdev__ path for GPU (one thread per voxel) - T=Simd<float,W> : W-wide CPU path (one call per batch) A single templated godunovsNormSqrd + normSqGrad definition serves both execution contexts with no #ifdef, structurally matching Stencils.h. Clang 18 vectorizes the Simd<float,16> instantiation (691 ymm instructions in the hot function, assembly-verified); GCC 13 does not. Update INVESTIGATION.md with the full scoreboard, both approaches, and next steps (GCC intrinsics path, benchmarking, nanovdb/util/ integration). Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Auto-detect <experimental/simd> (Parallelism TS v2) via __has_include and __cpp_lib_experimental_parallel_simd. When available, Simd<T,W> and SimdMask<T,W> become thin wrappers around fixed_size_simd / fixed_size_simd_mask, delegating all arithmetic to the standard type. The TS v2 where(mask, v) is a 2-arg masked-assignment proxy; wrap it into the 3-arg select(mask, a, b) form expected by the kernels. Verified with clang++-18 -std=c++26: both paths produce identical assembly (1275 ymm instructions, PASS on all 16 lanes), confirming Clang optimizes through the wrapper completely. Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Document the std::experimental::simd backend alongside the std::array default, including the TS v2 where() adaptation, the auto-detection mechanism, and the assembly comparison showing byte-for-byte identical output between the two backends under Clang 18. Update the vectorization results table and open questions accordingly. Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
StencilKernel.h — new prototype header:
- BaseStencilKernel<T, SIZE>: owns mValues[], mDx2, mInvDx2; no grid coupling
- WenoStencilKernel<T>: derives from above, provides normSqGrad()
- WENO5<T> and GodunovsNormSqrd<T, MaskT>: free functions mirroring Stencils.h
- T=float for GPU scalar path, T=Simd<float,W> for CPU batch path
lift_test.cpp — rewritten to use WenoStencilKernel<T> directly:
- SIMD and scalar reference paths both instantiate the same class
- dx passed to constructor; mValues populated via operator[]
Simd.h — refinements:
- Simd<T,W> and SimdMask<T,W> in Backend A are now pure type aliases for
stdx::fixed_size_simd / fixed_size_simd_mask (no wrapper struct)
- element_aligned_tag / element_aligned: portable load/store tag, always
present; aliases stdx::element_aligned_tag in Backend A, dummy struct in B
- Backend B load constructor and store() accept element_aligned_tag (defaulted)
- NANOVDB_NO_STD_SIMD opt-out flag to force Backend B
INVESTIGATION.md — updated:
- Approach B section updated to reflect class hierarchy instead of free functions
- Backend B GCC note: the struct-access failure was specific to Approach A's
liftToSimd outer-lane loop; Backend B's fixed-count operator loops do vectorize
on GCC when used with the Generic-T class hierarchy
- New ymm tables for both backends under GCC (Backend A: 1267 total,
Backend B: 619 total); both pass correctness
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…chPtrs) Add full CPU batch neighbor-leaf resolution design to the planning doc: - §6: Replace StencilLeafPtrs struct with layered design — shared 3×3×3 bit encoding for probedMask/ptrs[27], stencil-specific batchPtrs population (batchPtrs[4][SIMDw] for WENO5, batchPtrs[3][3][3][SIMDw] for box stencil), and GPU scalar design note kept separate. - §8d: Update lazy-probe section to reference ptrs[27] and 27-bit probedMask; add batchPtrs population step (Phase 2) after the probeLeaf loop. - §8e: Update computeNeededDirs direction table to use 3×3×3 bit positions (bits 4,10,12,14,16,22 for WENO5 face neighbors). - §8f/§8g: Minor notation updates to match ptrs[27] naming. - §9: Resolve ptrs-layout and nExtraLeaves open questions; add prototype scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Standalone CPU-only executable that verifies the neighbor leaf resolution design from StencilGather.md §8d–§8f: - For each VBM block: calls decodeInverseMaps, recomputes nLeaves from jumpMap, then processes SIMDw=16 batches with the full probedMask / lazy-probeLeaf / batchPtrs[4][SIMDw] pipeline. - Does not call computeStencil. Instead verifies batchPtrs against a direct probeLeaf reference for all 18 non-center WENO5 stencil offsets that cross leaf boundaries. - Passes at 0.1, 0.25, 0.5, 0.9 occupancy (2.3M–2.9M lane checks each). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Refactor computeNeededDirs to accept a pre-expanded Simd<uint32_t,SIMDw> vector, moving sentinel/masking responsibility to the single gather site where leafMask is known: - kSentinelExpanded = expandVoxelOffset(292) = 0x41044104 (constexpr) - Caller broadcasts sentinel to all lanes, overwrites leafMask lanes with real expandVoxelOffset() values before calling computeNeededDirs - computeNeededDirs is now a pure add+reduce with no masking or cross-check Carry trick (§8e): expandVoxelOffset packs lz/lx/ly into 6 guarded 3-bit groups; a single vpaddd ymm × 2 + vpor + vpand + shuffle-tree detects all six WENO5 directions simultaneously. kExpandCarryK = 0x514530C3. AVX2 codegen confirmed via objdump: - computeNeededDirs: vpbroadcastd + 2×vpaddd ymm + vpor/vpand ymm + vextracti128/vpsrldq shuffle-tree, no branches or calls in hot path - activeMask/leafMask in runPrototype: vpcmpeqd ymm × 4 + vmovmskps ymm × 2 - Sentinel broadcast: 0x41044104 literal → vpbroadcastd → 2×vmovdqa ymm Always-on scalar cross-check at every computeNeededDirs call site. verifyComputeNeededDirsSentinel() tests both the sentinel carry property and the straddle-lane non-pollution scenario before runPrototype(). StencilGather.md §8e and §8f updated to match new API and codegen notes. Phase 1 prototype marked complete in §9. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Documents the BatchAccessor — the SIMD-batch analog of ValueAccessor — developed from the ex_stencil_gather_cpu Phase 1 prototype discussion. Core concept: instead of caching the path to one leaf, cache the full 3×3×3 neighborhood of 27 leaf pointers around the current center leaf, serving SIMDw voxels per call. Key design elements documented: Eviction policy: fires on none_of(leafMask) only — straddle lanes do not evict. leafMask is the "partial-hit" signal with no scalar-accessor analog. Prefetch coverage argument: - WENO5 (R=3): 6 extremal taps (±R,0,0),(0,±R,0),(0,0,±R) are necessary and sufficient — equivalent to the computeNeededDirs carry trick - Box stencil (R=1): 8 corner taps (±1,±1,±1) collectively cover all 26 non-center directions for any voxel position in the batch Three-tier API: - prefetch<di,dj,dk>(vo, leafMask, treeAcc) - cachedGetValue<di,dj,dk>(vo, leafMask) — no treeAcc, cache assumed warm - getValue<di,dj,dk>(vo, leafMask, treeAcc) — lazy combined (vanilla style) Template <di,dj,dk> rationale vs runtime Coord: compile-time direction bit, dead-axis elimination, VDB convention alignment; runtime Coord overload provided for generic stencil adapters. AVX2 profile: offset arithmetic (vpaddd ymm), lane split (vpcmpgtd ymm), gather from ≤2 leaf pointers (vgatherdps×2 + vpblendvb) — both scalar bottlenecks from Phase 1 prototype are eliminated. StencilGather.md: add cross-reference to BatchAccessor.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft for CI and diff review. WIP.