Skip to content

[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps#2186

Draft
sifakis wants to merge 15 commits intoAcademySoftwareFoundation:masterfrom
sifakis:vbm-cpu-port
Draft

[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps#2186
sifakis wants to merge 15 commits intoAcademySoftwareFoundation:masterfrom
sifakis:vbm-cpu-port

Conversation

@sifakis
Copy link
Copy Markdown
Contributor

@sifakis sifakis commented Mar 27, 2026

Draft for CI and diff review. WIP.

@sifakis sifakis changed the title NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps [WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps Mar 30, 2026
sifakis and others added 2 commits April 2, 2026 13:46
Add WenoLeafPtrs<BuildT>, resolveWenoLeafPtrs, and computeWenoStencil as
static __device__ members of VoxelBlockManager<Log2BlockWidth>. These
implement the first phase of a two-function WENO5 stencil gather:
resolveWenoLeafPtrs performs exactly 3 probeLeaf calls (one per axis) to
resolve neighbor leaf pointers; computeWenoStencil fills a caller-provided
array with the 19 global sequential indices using WenoPt<i,j,k>::idx.

voxelOffset arithmetic uses octal notation: NanoVDB leaf layout encodes
(x,y,z) as x*64+y*8+z, so x/y/z strides are 0100/010/1 in octal.
WenoPt<i,j,k>::idx is used throughout to remain independent of any future
re-alignment with OpenVDB's NineteenPt (which uses a different convention).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Adds the ex_voxelBlockManager_host_cuda example demonstrating the CPU and
CUDA VoxelBlockManager implementations, along with design documentation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
sifakis and others added 13 commits April 2, 2026 13:50
Remove DecodeInverseMapsCPUPlan.md (implementation complete) and distill
its non-obvious design decisions into the knowledge base:

- §11: decodeInverseMaps is intentionally single-threaded/stateless;
  caller distributes blocks; contrast with cooperative GPU version.
- §12: mPrefixSum is bypassed for bulk access — recomputing from raw mask
  words via buildMaskPrefixSums is cheaper than unpacking 9-bit fields;
  mPrefixSum is still used for the cross-word offset in Step 5.
- §13: output fill is range-fill + contiguous copy (not scatter) because
  shuffleDownMask produces a sorted compacted array; std::fill/copy caveat
  for alignment when output arrays come from TLS or stack pointers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Implementation complete; design rationale distilled into VBMImplementationKnowledge.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Design reference for the per-block stencil gather kernel: decodes inverse
maps into block-local scratch, then resolves neighbor leaf pointers and
fills N-point stencil index arrays for all active voxels in the block.
WENO5 (N=19, R=3) is the motivating instance; architecture is stencil-agnostic.
Covers GPU inner loop, CPU SIMD batch design (SIMDw=16, probeLeaf dedup),
unified StencilLeafPtrs template, and reach-R generalization considerations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
- §3: Stencil type as template parameter needs index→offsets direction
  (for-each-slot gather loop), not the offsets→index direction of WenoPt.
  Clarify relationship to BaseStencil/WenoStencil: geometry-only descriptor,
  no accessor coupling.
- §4: Kernel lambda signature std::array<ValueType,K> kernel(const ValueType* u);
  output is homogeneous std::array (not tuple); K=1 degenerates to scalar;
  SoA output layout results[k][BlockWidth] for SIMD efficiency.
- Renumber §4-§8 → §5-§9; update open questions accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…notes

Adds a self-contained test (lift_test.cpp) exploring a generic SIMD-lifting
abstraction: given a scalar tuple→tuple kernel, liftToSimd<W> produces an
SoA-wide version that loops over W lanes and is the auto-vectorization target.

The motivating kernel is WENO5 normSqGrad (19-point stencil, matching
WenoStencil::normSqGrad from Stencils.h).  The six weno5() calls vectorize
cleanly; godunovsNormSqrd() blocks vectorization in two distinct ways
depending on how it is written:

  1. std::max / bool isOutside ternaries → "control flow in loop"
  2. float sign + fmaxf (no ternaries) → "no vectype for stmt" due to
     GCC's inability to see through std::tuple's recursive-inheritance
     struct layout in GIMPLE alias analysis

INVESTIGATION.md documents all experiments, findings, current blockers,
and proposed next steps (pointer-cache approach, Clang comparison, etc.).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Evangelos Sifakis <esifakis@gmail.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Introduce nanovdb::util::Simd<T,W> (simd_test/Simd.h) — a minimal
header-only SIMD abstraction backed by std::array<T,W> with arithmetic
operators, SimdMask, min/max, and where().  Mirrors the C++26 std::simd
interface for forward compatibility.

Rewrite the WENO5 normSqGrad kernel as a template on T:
- T=float  : scalar __hostdev__ path for GPU (one thread per voxel)
- T=Simd<float,W> : W-wide CPU path (one call per batch)

A single templated godunovsNormSqrd + normSqGrad definition serves both
execution contexts with no #ifdef, structurally matching Stencils.h.
Clang 18 vectorizes the Simd<float,16> instantiation (691 ymm instructions
in the hot function, assembly-verified); GCC 13 does not.

Update INVESTIGATION.md with the full scoreboard, both approaches, and
next steps (GCC intrinsics path, benchmarking, nanovdb/util/ integration).

Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Auto-detect <experimental/simd> (Parallelism TS v2) via __has_include and
__cpp_lib_experimental_parallel_simd.  When available, Simd<T,W> and
SimdMask<T,W> become thin wrappers around fixed_size_simd / fixed_size_simd_mask,
delegating all arithmetic to the standard type.

The TS v2 where(mask, v) is a 2-arg masked-assignment proxy; wrap it into
the 3-arg select(mask, a, b) form expected by the kernels.

Verified with clang++-18 -std=c++26: both paths produce identical assembly
(1275 ymm instructions, PASS on all 16 lanes), confirming Clang optimizes
through the wrapper completely.

Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Document the std::experimental::simd backend alongside the std::array
default, including the TS v2 where() adaptation, the auto-detection
mechanism, and the assembly comparison showing byte-for-byte identical
output between the two backends under Clang 18.

Update the vectorization results table and open questions accordingly.

Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
StencilKernel.h — new prototype header:
  - BaseStencilKernel<T, SIZE>: owns mValues[], mDx2, mInvDx2; no grid coupling
  - WenoStencilKernel<T>: derives from above, provides normSqGrad()
  - WENO5<T> and GodunovsNormSqrd<T, MaskT>: free functions mirroring Stencils.h
  - T=float for GPU scalar path, T=Simd<float,W> for CPU batch path

lift_test.cpp — rewritten to use WenoStencilKernel<T> directly:
  - SIMD and scalar reference paths both instantiate the same class
  - dx passed to constructor; mValues populated via operator[]

Simd.h — refinements:
  - Simd<T,W> and SimdMask<T,W> in Backend A are now pure type aliases for
    stdx::fixed_size_simd / fixed_size_simd_mask (no wrapper struct)
  - element_aligned_tag / element_aligned: portable load/store tag, always
    present; aliases stdx::element_aligned_tag in Backend A, dummy struct in B
  - Backend B load constructor and store() accept element_aligned_tag (defaulted)
  - NANOVDB_NO_STD_SIMD opt-out flag to force Backend B

INVESTIGATION.md — updated:
  - Approach B section updated to reflect class hierarchy instead of free functions
  - Backend B GCC note: the struct-access failure was specific to Approach A's
    liftToSimd outer-lane loop; Backend B's fixed-count operator loops do vectorize
    on GCC when used with the Generic-T class hierarchy
  - New ymm tables for both backends under GCC (Backend A: 1267 total,
    Backend B: 619 total); both pass correctness

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…chPtrs)

Add full CPU batch neighbor-leaf resolution design to the planning doc:

- §6: Replace StencilLeafPtrs struct with layered design — shared 3×3×3 bit
  encoding for probedMask/ptrs[27], stencil-specific batchPtrs population
  (batchPtrs[4][SIMDw] for WENO5, batchPtrs[3][3][3][SIMDw] for box stencil),
  and GPU scalar design note kept separate.

- §8d: Update lazy-probe section to reference ptrs[27] and 27-bit probedMask;
  add batchPtrs population step (Phase 2) after the probeLeaf loop.

- §8e: Update computeNeededDirs direction table to use 3×3×3 bit positions
  (bits 4,10,12,14,16,22 for WENO5 face neighbors).

- §8f/§8g: Minor notation updates to match ptrs[27] naming.

- §9: Resolve ptrs-layout and nExtraLeaves open questions; add prototype scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Standalone CPU-only executable that verifies the neighbor leaf resolution
design from StencilGather.md §8d–§8f:

- For each VBM block: calls decodeInverseMaps, recomputes nLeaves from
  jumpMap, then processes SIMDw=16 batches with the full probedMask /
  lazy-probeLeaf / batchPtrs[4][SIMDw] pipeline.
- Does not call computeStencil.  Instead verifies batchPtrs against a
  direct probeLeaf reference for all 18 non-center WENO5 stencil offsets
  that cross leaf boundaries.
- Passes at 0.1, 0.25, 0.5, 0.9 occupancy (2.3M–2.9M lane checks each).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Refactor computeNeededDirs to accept a pre-expanded Simd<uint32_t,SIMDw>
vector, moving sentinel/masking responsibility to the single gather site
where leafMask is known:

- kSentinelExpanded = expandVoxelOffset(292) = 0x41044104 (constexpr)
- Caller broadcasts sentinel to all lanes, overwrites leafMask lanes with
  real expandVoxelOffset() values before calling computeNeededDirs
- computeNeededDirs is now a pure add+reduce with no masking or cross-check

Carry trick (§8e): expandVoxelOffset packs lz/lx/ly into 6 guarded 3-bit
groups; a single vpaddd ymm × 2 + vpor + vpand + shuffle-tree detects all
six WENO5 directions simultaneously.  kExpandCarryK = 0x514530C3.

AVX2 codegen confirmed via objdump:
- computeNeededDirs: vpbroadcastd + 2×vpaddd ymm + vpor/vpand ymm +
  vextracti128/vpsrldq shuffle-tree, no branches or calls in hot path
- activeMask/leafMask in runPrototype: vpcmpeqd ymm × 4 + vmovmskps ymm × 2
- Sentinel broadcast: 0x41044104 literal → vpbroadcastd → 2×vmovdqa ymm

Always-on scalar cross-check at every computeNeededDirs call site.
verifyComputeNeededDirsSentinel() tests both the sentinel carry property
and the straddle-lane non-pollution scenario before runPrototype().

StencilGather.md §8e and §8f updated to match new API and codegen notes.
Phase 1 prototype marked complete in §9.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Documents the BatchAccessor — the SIMD-batch analog of ValueAccessor —
developed from the ex_stencil_gather_cpu Phase 1 prototype discussion.

Core concept: instead of caching the path to one leaf, cache the full
3×3×3 neighborhood of 27 leaf pointers around the current center leaf,
serving SIMDw voxels per call.

Key design elements documented:

Eviction policy: fires on none_of(leafMask) only — straddle lanes do not
evict. leafMask is the "partial-hit" signal with no scalar-accessor analog.

Prefetch coverage argument:
- WENO5 (R=3): 6 extremal taps (±R,0,0),(0,±R,0),(0,0,±R) are necessary
  and sufficient — equivalent to the computeNeededDirs carry trick
- Box stencil (R=1): 8 corner taps (±1,±1,±1) collectively cover all 26
  non-center directions for any voxel position in the batch

Three-tier API:
- prefetch<di,dj,dk>(vo, leafMask, treeAcc)
- cachedGetValue<di,dj,dk>(vo, leafMask)  — no treeAcc, cache assumed warm
- getValue<di,dj,dk>(vo, leafMask, treeAcc) — lazy combined (vanilla style)

Template <di,dj,dk> rationale vs runtime Coord: compile-time direction bit,
dead-axis elimination, VDB convention alignment; runtime Coord overload
provided for generic stencil adapters.

AVX2 profile: offset arithmetic (vpaddd ymm), lane split (vpcmpgtd ymm),
gather from ≤2 leaf pointers (vgatherdps×2 + vpblendvb) — both scalar
bottlenecks from Phase 1 prototype are eliminated.

StencilGather.md: add cross-reference to BatchAccessor.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant