Skip to content

Add fast approximate reciprocal methods for float vectors#204

Open
tomcur wants to merge 2 commits intolinebender:mainfrom
tomcur:approximate-recip
Open

Add fast approximate reciprocal methods for float vectors#204
tomcur wants to merge 2 commits intolinebender:mainfrom
tomcur:approximate-recip

Conversation

@tomcur
Copy link
Member

@tomcur tomcur commented Feb 22, 2026

x86 and AArch64 have instructions to calculate fast approximate reciprocals, and these can speed up some algorithms quite nicely (e.g. sprinkling this in Vello's flatten_simd.rs results in -4% flattening timings for GhostScript Tiger (actually landing that there requires a bit of thought whether the lowered precision is acceptable of course!)).

There is some detail here that this PR as-is doesn't attempt to solve. x86's rcp has about 12 bits of precision, AArch64's vrecpe about 8 bits. AArch64 has an additional instruction however, vrecps, to perform a Newton refinement step, which bumps the precision to 16 bits. That'd look something like the following.

let x0 = vrecpeq_f32(a);
x0 * vrecpsq_f32(a, x0); // calculates x0 * (2 - x0 * a), roughly doubling the precision of the `x0` estimate

Then, AVX512 introduces rcp14, which allows calculating to 14-bit precision with (I believe) the same performance as rcp, and extends support to f64.

In any case, this method does the simplest thing of just exposing the cheapest hardware estimate, similar to e.g. Highway's ApproximateReciprocal.

x86 and AArch64 have instructions to calculate fast approximate
reciprocals, and these can speed up some algorithms quite nicely (e.g.
sprinkling this in Vello's `flatten_simd.rs` results in -4% flattening
timings for GhostScript Tiger (actually landing that there requires a
bit of thought whether the lowered precision is acceptable of course!).

There is some detail here that this PR as-is doesn't attempt to solve.
x86's `rcp` has about 12 bits of precision, AArch64's `vrecpe` about 8
bits. AArch64 has an additional instruction however, `vrecps`, to perform a
Newton refinement step, which bumps the precision to 16 bits. That'd
look something like the following.

```rust
let x0 = vrecpeq_f32(a);
x0 * vrecpsq_f32(a, x0); // calculates x0 * (2 - x0 * a), roughly doubling the precision of the `x0` estimate
```

Then, AVX512 introduces `rcp14`, which allows calculating to 14-bit
precision with (I believe) the same performance as `rcp`, and extends
support to `f64`.

In any case, this method does the simplest thing of just exposing the
cheapest hardware estimate, similar to e.g. Highway's
`ApproximateReciprocal`.
}
#[inline(always)]
fn approximate_recip_f32x4(self, a: f32x4<Self>) -> f32x4<Self> {
self.splat_f32x4(1.0) / a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried it, does division work without splatting? I think for mutliplication it works at least.

}
#[inline(always)]
fn approximate_recip_f32x4(self, a: f32x4<Self>) -> f32x4<Self> {
self.div_f32x4(self.splat_f32x4(1.0), a)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in fallback

unsafe { _mm_sqrt_ps(a.into()).simd_into(self) }
}
#[inline(always)]
fn approximate_recip_f32x4(self, a: f32x4<Self>) -> f32x4<Self> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether we should just spell reciprocal out. But should be fine this way!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondered the same thing, but decided to mirror e.g. f32::recip.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants