simd: Generic-T Simd<T,W> abstraction by swahtz · Pull Request #2192 · AcademySoftwareFoundation/openvdb

swahtz · 2026-04-10T01:15:47Z

Summary

This PR adds openvdb/simd/Simd.h — a zero-dependency SIMD wrapper that lets code be written once as templates on a value type T and compiled for both scalar (T=float/double) and W-wide SIMD (T=Simd<float,W>) paths without #ifdef or duplicated logic. Two backends are selected automatically at compile time:

Backend A (OPENVDB_USE_STD_SIMD): wraps std::experimental::simd (C++ Parallelism TS v2), emitting native vector instructions directly without relying on the auto-vectorizer.
Backend B (default, C++17): wraps std::array<T,W> with fixed-count element-wise loops, which auto-vectorizers produce equivalent SIMD code from.

Migration to std::simd (C++26) will be a one-line change in the backend detection guard; all call sites remain unchanged.

Proposed Benefits

Dependency Management: Introduce VCL, SIMD wrappers, and Vectorized RasterizeSDF(Spheres) #2190 introduces a new dependency on the VCL library. We have an alternative approach that replicates the API/semantics of std::simd<type, width> (either by a thin C++17-compatible emulation layer, or as a wrapper for the STL-proposed types) that introduces no new dependencies and is supported via <experimental/simd>.
Hardware Portability: The VCL approach is strictly limited to x86 and does not support other ISAs. ARM support is increasingly critical for platforms such as Apple Silicon and NVIDIA's Grace CPU. Our alternative approach natively supports both architectures as well as SIMD on other ISAs (e.g. RISC-V, POWER, Qualcomm's Hexagon) via the compilers' auto-vectorizer.
Performance Parity: While VCL guarantees SIMD code generation, our preliminary tests show that our method generates almost identical SIMD instructions with -O2 or higher on both GCC and Clang. In practice, there does not appear to be a significant performance benefit compared to using VCL.
Single-Source Portability: VCL potentially requires 2 separate implementations of kernels, one for scalar CPU (or CUDA), and one for SIMD CPU (using simd::op() methods), whereas our approach is truly single-source. You write the code once, and it works seamlessly across CPU scalar, GPU/CUDA, and SIMD environments.
Architectural Flexibility: The VCL approach hardwires the width of data-level parallelism to the current architecture (e.g., 8 floats on AVX2, 16 on AVX-512). Our approach—aligning with what std::simd has converged on—makes the data width a programmatic decision rather than an architectural one. This offers significant advantages for handling padding and memory alignment.
Sustainability/Maintenance: The VCL library ultimately depends on intrinsics to emit assembly-level code. This requires maintenance of the operator set, even within the x86 architecture. Any new operator that is introduced in a newer generation of x86 will default to a Tuple<type,width> scalar implementation, until a maintainer introduces the respective intrinsics in our VCL clone. Our alternative has the potential to use the autovectorization capability of compilers to target newer architectures as they emerge.

…zeSdf(SphereSettings) Adds openvdb/simd/Simd.h — a zero-dependency SIMD wrapper that enables kernels to be written once as templates on a value type T and compiled for both scalar (T=float/double) and W-wide SIMD (T=Simd<float,W>) paths without #ifdef or duplicated logic. Two backends are provided and selected automatically: - Backend A (OPENVDB_USE_STD_SIMD): wraps std::experimental::simd (C++ Parallelism TS v2) in a thin class; emits native vector instructions without relying on the auto-vectorizer. - Backend B (default, C++17): wraps std::array<T,W> with fixed-count element-wise loops; the auto-vectorizer produces equivalent code. Unlike an explicit intrinsic wrapper library, Simd<T,W> uses operator overloading so that kernels written with plain arithmetic (+, -, *, /, comparisons) and the where()/hmin()/hall()/hany() helpers compile identically for scalar and SIMD instantiations. explicit operator T() on Simd<T,W> extracts lane 0 at write boundaries; the Scalar<T> trait (detected via std::void_t on T::value_type) recovers the element type generically. No external dependency is required in either backend. Demonstrates the approach by porting rasterizeSdf(SphereSettings) in PointRasterizeSDFImpl.h: SphericalTransfer gains rasterizePoints() for batched dispatch and a Generic-T stamp<ScalarT>() whose body is shared word-for-word between the scalar and SIMD paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

- EllipsoidTransferQuat/Mat3: add rasterizePoints override that loops per-point, preventing the framework from routing ellipsoid transfers through SphericalTransfer::rasterizePoints (which instantiates rasterizeN2/stamp for FixedBandRadius<Vec3f>, a type that lacks minSq()/maxSq()). All 8 TestPointRasterizeSDF tests now pass. - Simd.h hmin/hmax: fix stdx::reduce binary-op lambda. The library performs a tree reduction and passes intermediate simd<T,abi> chunks to the binary op, not scalars; change [](T a, T b) to [](auto a, auto b) using stdx::min/max for element-wise selection. - simd/ASSEMBLY_NOTES.md: in-vivo assembly analysis of rasterizeN2<4> (Simd<double,4>, NullCodec, FixedBandRadius<double>, -O3 -mavx). Confirms 256-bit YMM throughout the hot path: vsqrtpd (4 sqrts in one instruction), vcmplepd+vmovmskpd (all-outside branch), vfmadd213pd (fused multiply-add with -march=native). Documents the two vzeroupper+call sequences (stdx reduction helpers) as a known minor overhead with mitigation strategy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

sifakis and others added 2 commits April 8, 2026 23:01

swahtz requested review from Idclip, apradhana, danrbailey, jmlait, kmuseth and richhones as code owners April 10, 2026 01:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simd: Generic-T Simd<T,W> abstraction#2192

simd: Generic-T Simd<T,W> abstraction#2192
swahtz wants to merge 2 commits intoAcademySoftwareFoundation:masterfrom
sifakis:simd-generic-t

swahtz commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

swahtz commented Apr 10, 2026

Summary

Proposed Benefits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants