-
Notifications
You must be signed in to change notification settings - Fork 21
Add mul_neg_add method to floats #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Can you not just do |
|
Unfortunately, no. I have fully working code based on My hot loop before: My hot loop after: Notice that after the change the However, digging deeper, this may be an artifact of the way
I've reported this to
|
|
I've fixed float negation in |
|
I have to reopen this. While the regular |
|
Ah that's sad. :( I mean from my side it would be fine to merge (at least temporarily), but would be good to hear some other opinons. |
Have you checked that this PR actually improves ARM codegen? The Rust intrinsics themselves lower to a separate negate + multiply-add (the file is too big to display in GitHub, but here it is): pub fn vfmsq_f32(a: float32x4_t, b: float32x4_t, c: float32x4_t) -> float32x4_t {
unsafe {
let b: float32x4_t = simd_neg(b);
vfmaq_f32(a, b, c)
}
} |
|
Not yet. I have verified that the |
|
You were suggested:
But maybe also try If only one of the two actually lowers to a negate-multiply-add instruction, that's definitely something we should document. |
|
I've figured out why I've opened a PR to |
|
That's great, thank you! However, that also seems pretty fragile and relies on compiler optimizations far more than I'd like. This regressing at any point due to a change to LLVM behavior would be a dramatic performance hit for PhastFT. Given that I'm already expressing the algorithm in terms of wrappers around intrinsics, I'd much rather use a dedicated |
|
As I mentioned above, all of the "fused" negate-multiply-add intrinsics are already implemented in Rust itself as a negate followed by a regular multiply-add. There is no way to avoid relying on compiler optimizations here. |
This is needed for porting https://github.com/QuState/PHastFT/ to fearless_simd
Named after the equivalent method in
wide: https://docs.rs/wide/latest/wide/struct.f32x8.html#method.mul_neg_addAI use disclosure: this is generated wholesale by Claude Opus 4.5, using mul_add and mul_sub as reference. The structure is nearly identical, so I'm not terribly surprised that AI could manage it, even though it's still impressive. The result holds up to my review and the newly added tests pass, but I'm not deeply familiar with this crate so take that with a grain of salt.