Add mul_neg_add method to floats #162

Shnatsel · 2025-12-11T21:35:00Z

This is needed for porting https://github.com/QuState/PHastFT/ to fearless_simd

Named after the equivalent method in wide: https://docs.rs/wide/latest/wide/struct.f32x8.html#method.mul_neg_add

AI use disclosure: this is generated wholesale by Claude Opus 4.5, using mul_add and mul_sub as reference. The structure is nearly identical, so I'm not terribly surprised that AI could manage it, even though it's still impressive. The result holds up to my review and the newly added tests pass, but I'm not deeply familiar with this crate so take that with a grain of salt.

LaurenzV · 2025-12-11T21:42:00Z

Can you not just do a.mul_add(-b, c) for this? I think this is one of the cases where we should check whether the compiler can automatically optimize it before adding a dedicated method. See also here, we will likely remove the existing mul_sub method at some point.

Shnatsel · 2025-12-11T22:29:23Z

Unfortunately, no. I have fully working code based on wide and I've tried swapping mul_neg_add(b for mul_add(-b and the code didn't optimize properly.

My hot loop before:

vfmadd213ps ymm13, ymm9, ymm1
vmovaps ymm14, ymm4
vfmadd213ps ymm14, ymm10, ymm2
vfnmadd231ps ymm13, ymm11, ymm7
vfnmadd231ps ymm14, ymm12, ymm8
vfmadd213ps ymm7, ymm9, ymm5
vfmadd213ps ymm8, ymm10, ymm6
vfmadd231ps ymm7, ymm11, ymm3
vfmadd231ps ymm8, ymm12, ymm4
vfmsub213ps ymm1, ymm0, ymm13
vfmsub213ps ymm2, ymm0, ymm14
vfmsub213ps ymm5, ymm0, ymm7
vfmsub213ps ymm6, ymm0, ymm8

My hot loop after:

vsubps ymm14, ymm0, ymm8
vsubps ymm15, ymm0, ymm9
vfmadd213ps ymm8, ymm10, ymm6
vfmadd231ps ymm8, ymm12, ymm4
vfmadd213ps ymm4, ymm10, ymm2
vfmadd213ps ymm9, ymm11, ymm7
vfmadd231ps ymm9, ymm13, ymm5
vfmadd213ps ymm5, ymm11, ymm3
vfmadd231ps ymm4, ymm12, ymm14
vfmadd231ps ymm5, ymm13, ymm15
vfmsub213ps ymm2, ymm1, ymm4
vfmsub213ps ymm3, ymm1, ymm5
vfmsub213ps ymm6, ymm1, ymm8
vfmsub213ps ymm7, ymm1, ymm9

Notice that after the change the vfnmadd with an N in it is gone from thr assembly and two more vsubps instructions are added that weren't there before.

However, digging deeper, this may be an artifact of the way wide incorrectly implements floating-point sign negation: https://docs.rs/wide/latest/src/wide/lib.rs.html#371-390

0.0 - x that wide uses is not a semantically correct way to do sign negation: https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=99548e3611548e6fbcc1a185dde58642

I've reported this to wide: Lokathor/wide#241

fearless_simd does sign negation correctly via a xor, so the semantics are at least correct.

Shnatsel · 2025-12-12T00:30:51Z

I've fixed float negation in wide and after my fix it optimizes into the negated FMA instruction on both AVX2 and AVX-512. SSE4.2 doesn't have FMA so this covers everything.

Shnatsel · 2025-12-13T22:35:24Z

I have to reopen this. While the regular mul_add optimizes fine on x86 (AVX2 and AVX-512 both), it does NOT automatically optimize on ARM (NEON). This was caused a massive regression for my FFT implementation because FFTs are just FMAs plus loads/stores.

LaurenzV · 2025-12-13T23:07:32Z

Ah that's sad. :( I mean from my side it would be fine to merge (at least temporarily), but would be good to hear some other opinons.

valadaptive · 2025-12-14T17:50:18Z

I have to reopen this. While the regular mul_add optimizes fine on x86 (AVX2 and AVX-512 both), it does NOT automatically optimize on ARM (NEON). This was caused a massive regression for my FFT implementation because FFTs are just FMAs plus loads/stores.

Have you checked that this PR actually improves ARM codegen? The Rust intrinsics themselves lower to a separate negate + multiply-add (the file is too big to display in GitHub, but here it is):

pub fn vfmsq_f32(a: float32x4_t, b: float32x4_t, c: float32x4_t) -> float32x4_t {
    unsafe {
        let b: float32x4_t = simd_neg(b);
        vfmaq_f32(a, b, c)
    }
}

Shnatsel · 2025-12-14T18:17:09Z

Not yet.

I have verified that the mul_neg_add implementation in wide crate improves NEON performance significantly. I'll inspect the codegen for wide and this PR more closely.

valadaptive · 2025-12-14T19:31:41Z

You were suggested:

Can you not just do a.mul_add(-b, c) for this?

But maybe also try (-a).mul_add(b, c). The operand order gets a bit confusing, but I think that's what the Rust intrinsics lower to.

If only one of the two actually lowers to a negate-multiply-add instruction, that's definitely something we should document.

valadaptive · 2025-12-21T01:37:43Z

I've figured out why mul_neg_add is still faster in wide on ARM: LLVM only recognizes that x ^ -0.0 is equivalent to -x on x86. On other architectures, wide should use the "negate" intrinsics that those architectures provide.

I've opened a PR to wide to fix this (Lokathor/wide#243). With that fix, PhastFT compiles to the exact same ARM assembly whether you use mul_neg_add or mul_add with a negation.

Shnatsel · 2025-12-21T01:43:16Z

That's great, thank you!

However, that also seems pretty fragile and relies on compiler optimizations far more than I'd like. This regressing at any point due to a change to LLVM behavior would be a dramatic performance hit for PhastFT. Given that I'm already expressing the algorithm in terms of wrappers around intrinsics, I'd much rather use a dedicated mul_neg_add rather than rely on compiler optimizations that may change from version to version.

valadaptive · 2025-12-21T01:56:18Z

As I mentioned above, all of the "fused" negate-multiply-add intrinsics are already implemented in Rust itself as a negate followed by a regular multiply-add. There is no way to avoid relying on compiler optimizations here.

Add mul_neg_add method to floats

3de7d31

Shnatsel closed this Dec 12, 2025

Shnatsel mentioned this pull request Dec 12, 2025

Use mul_add with a float inversion instead of mul_neg_add QuState/PhastFT#57

Merged

Shnatsel reopened this Dec 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add mul_neg_add method to floats #162

Add mul_neg_add method to floats #162

Uh oh!

Shnatsel commented Dec 11, 2025 •

edited

Loading

Uh oh!

LaurenzV commented Dec 11, 2025

Uh oh!

Shnatsel commented Dec 11, 2025 •

edited

Loading

Uh oh!

Shnatsel commented Dec 12, 2025

Uh oh!

Shnatsel commented Dec 13, 2025

Uh oh!

LaurenzV commented Dec 13, 2025

Uh oh!

valadaptive commented Dec 14, 2025

Uh oh!

Shnatsel commented Dec 14, 2025

Uh oh!

valadaptive commented Dec 14, 2025 •

edited

Loading

Uh oh!

valadaptive commented Dec 21, 2025

Uh oh!

Shnatsel commented Dec 21, 2025

Uh oh!

valadaptive commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add mul_neg_add method to floats #162

Are you sure you want to change the base?

Add mul_neg_add method to floats #162

Uh oh!

Conversation

Shnatsel commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LaurenzV commented Dec 11, 2025

Uh oh!

Shnatsel commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shnatsel commented Dec 12, 2025

Uh oh!

Shnatsel commented Dec 13, 2025

Uh oh!

LaurenzV commented Dec 13, 2025

Uh oh!

valadaptive commented Dec 14, 2025

Uh oh!

Shnatsel commented Dec 14, 2025

Uh oh!

valadaptive commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valadaptive commented Dec 21, 2025

Uh oh!

Shnatsel commented Dec 21, 2025

Uh oh!

valadaptive commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Shnatsel commented Dec 11, 2025 •

edited

Loading

Shnatsel commented Dec 11, 2025 •

edited

Loading

valadaptive commented Dec 14, 2025 •

edited

Loading