Skip to content

simd: Generic-T Simd<T,W> abstraction#2192

Open
swahtz wants to merge 2 commits intoAcademySoftwareFoundation:masterfrom
sifakis:simd-generic-t
Open

simd: Generic-T Simd<T,W> abstraction#2192
swahtz wants to merge 2 commits intoAcademySoftwareFoundation:masterfrom
sifakis:simd-generic-t

Conversation

@swahtz
Copy link
Copy Markdown
Contributor

@swahtz swahtz commented Apr 10, 2026

Summary

This PR adds openvdb/simd/Simd.h — a zero-dependency SIMD wrapper that lets code be written once as templates on a value type T and compiled for both scalar (T=float/double) and W-wide SIMD (T=Simd<float,W>) paths without #ifdef or duplicated logic. Two backends are selected automatically at compile time:

  • Backend A (OPENVDB_USE_STD_SIMD): wraps std::experimental::simd (C++ Parallelism TS v2), emitting native vector instructions directly without relying on the auto-vectorizer.
  • Backend B (default, C++17): wraps std::array<T,W> with fixed-count element-wise loops, which auto-vectorizers produce equivalent SIMD code from.

Migration to std::simd (C++26) will be a one-line change in the backend detection guard; all call sites remain unchanged.

Proposed Benefits

  • Dependency Management: Introduce VCL, SIMD wrappers, and Vectorized RasterizeSDF(Spheres) #2190 introduces a new dependency on the VCL library. We have an alternative approach that replicates the API/semantics of std::simd<type, width> (either by a thin C++17-compatible emulation layer, or as a wrapper for the STL-proposed types) that introduces no new dependencies and is supported via <experimental/simd>.
  • Hardware Portability: The VCL approach is strictly limited to x86 and does not support other ISAs. ARM support is increasingly critical for platforms such as Apple Silicon and NVIDIA's Grace CPU. Our alternative approach natively supports both architectures as well as SIMD on other ISAs (e.g. RISC-V, POWER, Qualcomm's Hexagon) via the compilers' auto-vectorizer.
  • Performance Parity: While VCL guarantees SIMD code generation, our preliminary tests show that our method generates almost identical SIMD instructions with -O2 or higher on both GCC and Clang. In practice, there does not appear to be a significant performance benefit compared to using VCL.
  • Single-Source Portability: VCL potentially requires 2 separate implementations of kernels, one for scalar CPU (or CUDA), and one for SIMD CPU (using simd::op() methods), whereas our approach is truly single-source. You write the code once, and it works seamlessly across CPU scalar, GPU/CUDA, and SIMD environments.
  • Architectural Flexibility: The VCL approach hardwires the width of data-level parallelism to the current architecture (e.g., 8 floats on AVX2, 16 on AVX-512). Our approach—aligning with what std::simd has converged on—makes the data width a programmatic decision rather than an architectural one. This offers significant advantages for handling padding and memory alignment.
  • Sustainability/Maintenance: The VCL library ultimately depends on intrinsics to emit assembly-level code. This requires maintenance of the operator set, even within the x86 architecture. Any new operator that is introduced in a newer generation of x86 will default to a Tuple<type,width> scalar implementation, until a maintainer introduces the respective intrinsics in our VCL clone. Our alternative has the potential to use the autovectorization capability of compilers to target newer architectures as they emerge.

sifakis and others added 2 commits April 8, 2026 23:01
…zeSdf(SphereSettings)

Adds openvdb/simd/Simd.h — a zero-dependency SIMD wrapper that enables
kernels to be written once as templates on a value type T and compiled
for both scalar (T=float/double) and W-wide SIMD (T=Simd<float,W>)
paths without #ifdef or duplicated logic.

Two backends are provided and selected automatically:
  - Backend A (OPENVDB_USE_STD_SIMD): wraps std::experimental::simd
    (C++ Parallelism TS v2) in a thin class; emits native vector
    instructions without relying on the auto-vectorizer.
  - Backend B (default, C++17): wraps std::array<T,W> with fixed-count
    element-wise loops; the auto-vectorizer produces equivalent code.

Unlike an explicit intrinsic wrapper library, Simd<T,W> uses operator
overloading so that kernels written with plain arithmetic (+, -, *, /,
comparisons) and the where()/hmin()/hall()/hany() helpers compile
identically for scalar and SIMD instantiations. explicit operator T()
on Simd<T,W> extracts lane 0 at write boundaries; the Scalar<T> trait
(detected via std::void_t on T::value_type) recovers the element type
generically. No external dependency is required in either backend.

Demonstrates the approach by porting rasterizeSdf(SphereSettings) in
PointRasterizeSDFImpl.h: SphericalTransfer gains rasterizePoints() for
batched dispatch and a Generic-T stamp<ScalarT>() whose body is shared
word-for-word between the scalar and SIMD paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
- EllipsoidTransferQuat/Mat3: add rasterizePoints override that loops
  per-point, preventing the framework from routing ellipsoid transfers
  through SphericalTransfer::rasterizePoints (which instantiates
  rasterizeN2/stamp for FixedBandRadius<Vec3f>, a type that lacks
  minSq()/maxSq()). All 8 TestPointRasterizeSDF tests now pass.

- Simd.h hmin/hmax: fix stdx::reduce binary-op lambda. The library
  performs a tree reduction and passes intermediate simd<T,abi> chunks
  to the binary op, not scalars; change [](T a, T b) to [](auto a,
  auto b) using stdx::min/max for element-wise selection.

- simd/ASSEMBLY_NOTES.md: in-vivo assembly analysis of rasterizeN2<4>
  (Simd<double,4>, NullCodec, FixedBandRadius<double>, -O3 -mavx).
  Confirms 256-bit YMM throughout the hot path: vsqrtpd (4 sqrts in
  one instruction), vcmplepd+vmovmskpd (all-outside branch), vfmadd213pd
  (fused multiply-add with -march=native). Documents the two
  vzeroupper+call sequences (stdx reduction helpers) as a known minor
  overhead with mitigation strategy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants