Add fast approximate reciprocal methods for float vectors#204
Open
tomcur wants to merge 2 commits intolinebender:mainfrom
Open
Add fast approximate reciprocal methods for float vectors#204tomcur wants to merge 2 commits intolinebender:mainfrom
tomcur wants to merge 2 commits intolinebender:mainfrom
Conversation
x86 and AArch64 have instructions to calculate fast approximate reciprocals, and these can speed up some algorithms quite nicely (e.g. sprinkling this in Vello's `flatten_simd.rs` results in -4% flattening timings for GhostScript Tiger (actually landing that there requires a bit of thought whether the lowered precision is acceptable of course!). There is some detail here that this PR as-is doesn't attempt to solve. x86's `rcp` has about 12 bits of precision, AArch64's `vrecpe` about 8 bits. AArch64 has an additional instruction however, `vrecps`, to perform a Newton refinement step, which bumps the precision to 16 bits. That'd look something like the following. ```rust let x0 = vrecpeq_f32(a); x0 * vrecpsq_f32(a, x0); // calculates x0 * (2 - x0 * a), roughly doubling the precision of the `x0` estimate ``` Then, AVX512 introduces `rcp14`, which allows calculating to 14-bit precision with (I believe) the same performance as `rcp`, and extends support to `f64`. In any case, this method does the simplest thing of just exposing the cheapest hardware estimate, similar to e.g. Highway's `ApproximateReciprocal`.
691c363 to
52520f7
Compare
LaurenzV
approved these changes
Feb 23, 2026
| } | ||
| #[inline(always)] | ||
| fn approximate_recip_f32x4(self, a: f32x4<Self>) -> f32x4<Self> { | ||
| self.splat_f32x4(1.0) / a |
Collaborator
There was a problem hiding this comment.
I haven't tried it, does division work without splatting? I think for mutliplication it works at least.
| } | ||
| #[inline(always)] | ||
| fn approximate_recip_f32x4(self, a: f32x4<Self>) -> f32x4<Self> { | ||
| self.div_f32x4(self.splat_f32x4(1.0), a) |
Collaborator
There was a problem hiding this comment.
Same comment as in fallback
| unsafe { _mm_sqrt_ps(a.into()).simd_into(self) } | ||
| } | ||
| #[inline(always)] | ||
| fn approximate_recip_f32x4(self, a: f32x4<Self>) -> f32x4<Self> { |
Collaborator
There was a problem hiding this comment.
I'm wondering whether we should just spell reciprocal out. But should be fine this way!
Member
Author
There was a problem hiding this comment.
Wondered the same thing, but decided to mirror e.g. f32::recip.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
x86 and AArch64 have instructions to calculate fast approximate reciprocals, and these can speed up some algorithms quite nicely (e.g. sprinkling this in Vello's
flatten_simd.rsresults in -4% flattening timings for GhostScript Tiger (actually landing that there requires a bit of thought whether the lowered precision is acceptable of course!)).There is some detail here that this PR as-is doesn't attempt to solve. x86's
rcphas about 12 bits of precision, AArch64'svrecpeabout 8 bits. AArch64 has an additional instruction however,vrecps, to perform a Newton refinement step, which bumps the precision to 16 bits. That'd look something like the following.Then, AVX512 introduces
rcp14, which allows calculating to 14-bit precision with (I believe) the same performance asrcp, and extends support tof64.In any case, this method does the simplest thing of just exposing the cheapest hardware estimate, similar to e.g. Highway's
ApproximateReciprocal.