simd: Generic-T Simd<T,W> abstraction#2192
Open
swahtz wants to merge 2 commits intoAcademySoftwareFoundation:masterfrom
Open
simd: Generic-T Simd<T,W> abstraction#2192swahtz wants to merge 2 commits intoAcademySoftwareFoundation:masterfrom
swahtz wants to merge 2 commits intoAcademySoftwareFoundation:masterfrom
Conversation
…zeSdf(SphereSettings)
Adds openvdb/simd/Simd.h — a zero-dependency SIMD wrapper that enables
kernels to be written once as templates on a value type T and compiled
for both scalar (T=float/double) and W-wide SIMD (T=Simd<float,W>)
paths without #ifdef or duplicated logic.
Two backends are provided and selected automatically:
- Backend A (OPENVDB_USE_STD_SIMD): wraps std::experimental::simd
(C++ Parallelism TS v2) in a thin class; emits native vector
instructions without relying on the auto-vectorizer.
- Backend B (default, C++17): wraps std::array<T,W> with fixed-count
element-wise loops; the auto-vectorizer produces equivalent code.
Unlike an explicit intrinsic wrapper library, Simd<T,W> uses operator
overloading so that kernels written with plain arithmetic (+, -, *, /,
comparisons) and the where()/hmin()/hall()/hany() helpers compile
identically for scalar and SIMD instantiations. explicit operator T()
on Simd<T,W> extracts lane 0 at write boundaries; the Scalar<T> trait
(detected via std::void_t on T::value_type) recovers the element type
generically. No external dependency is required in either backend.
Demonstrates the approach by porting rasterizeSdf(SphereSettings) in
PointRasterizeSDFImpl.h: SphericalTransfer gains rasterizePoints() for
batched dispatch and a Generic-T stamp<ScalarT>() whose body is shared
word-for-word between the scalar and SIMD paths.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
- EllipsoidTransferQuat/Mat3: add rasterizePoints override that loops per-point, preventing the framework from routing ellipsoid transfers through SphericalTransfer::rasterizePoints (which instantiates rasterizeN2/stamp for FixedBandRadius<Vec3f>, a type that lacks minSq()/maxSq()). All 8 TestPointRasterizeSDF tests now pass. - Simd.h hmin/hmax: fix stdx::reduce binary-op lambda. The library performs a tree reduction and passes intermediate simd<T,abi> chunks to the binary op, not scalars; change [](T a, T b) to [](auto a, auto b) using stdx::min/max for element-wise selection. - simd/ASSEMBLY_NOTES.md: in-vivo assembly analysis of rasterizeN2<4> (Simd<double,4>, NullCodec, FixedBandRadius<double>, -O3 -mavx). Confirms 256-bit YMM throughout the hot path: vsqrtpd (4 sqrts in one instruction), vcmplepd+vmovmskpd (all-outside branch), vfmadd213pd (fused multiply-add with -march=native). Documents the two vzeroupper+call sequences (stdx reduction helpers) as a known minor overhead with mitigation strategy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds
openvdb/simd/Simd.h— a zero-dependency SIMD wrapper that lets code be written once as templates on a value typeTand compiled for both scalar (T=float/double) and W-wide SIMD (T=Simd<float,W>) paths without#ifdefor duplicated logic. Two backends are selected automatically at compile time:OPENVDB_USE_STD_SIMD): wrapsstd::experimental::simd(C++ Parallelism TS v2), emitting native vector instructions directly without relying on the auto-vectorizer.std::array<T,W>with fixed-count element-wise loops, which auto-vectorizers produce equivalent SIMD code from.Migration to
std::simd(C++26) will be a one-line change in the backend detection guard; all call sites remain unchanged.Proposed Benefits
std::simd<type, width>(either by a thinC++17-compatible emulation layer, or as a wrapper for the STL-proposed types) that introduces no new dependencies and is supported via<experimental/simd>.-O2or higher on both GCC and Clang. In practice, there does not appear to be a significant performance benefit compared to using VCL.simd::op()methods), whereas our approach is truly single-source. You write the code once, and it works seamlessly across CPU scalar, GPU/CUDA, and SIMD environments.AVX2, 16 onAVX-512). Our approach—aligning with whatstd::simdhas converged on—makes the data width a programmatic decision rather than an architectural one. This offers significant advantages for handling padding and memory alignment.Tuple<type,width>scalar implementation, until a maintainer introduces the respective intrinsics in our VCL clone. Our alternative has the potential to use the autovectorization capability of compilers to target newer architectures as they emerge.