Skip to content

Conversation

@Shnatsel
Copy link
Contributor

@Shnatsel Shnatsel commented Dec 11, 2025

This is needed for porting https://github.com/QuState/PHastFT/ to fearless_simd

Named after the equivalent method in wide: https://docs.rs/wide/latest/wide/struct.f32x8.html#method.mul_neg_add

AI use disclosure: this is generated wholesale by Claude Opus 4.5, using mul_add and mul_sub as reference. The structure is nearly identical, so I'm not terribly surprised that AI could manage it, even though it's still impressive. The result holds up to my review and the newly added tests pass, but I'm not deeply familiar with this crate so take that with a grain of salt.

@LaurenzV
Copy link
Collaborator

Can you not just do a.mul_add(-b, c) for this? I think this is one of the cases where we should check whether the compiler can automatically optimize it before adding a dedicated method. See also here, we will likely remove the existing mul_sub method at some point.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Dec 11, 2025

Unfortunately, no. I have fully working code based on wide and I've tried swapping mul_neg_add(b for mul_add(-b and the code didn't optimize properly.

My hot loop before:

vfmadd213ps ymm13, ymm9, ymm1
vmovaps ymm14, ymm4
vfmadd213ps ymm14, ymm10, ymm2
vfnmadd231ps ymm13, ymm11, ymm7
vfnmadd231ps ymm14, ymm12, ymm8
vfmadd213ps ymm7, ymm9, ymm5
vfmadd213ps ymm8, ymm10, ymm6
vfmadd231ps ymm7, ymm11, ymm3
vfmadd231ps ymm8, ymm12, ymm4
vfmsub213ps ymm1, ymm0, ymm13
vfmsub213ps ymm2, ymm0, ymm14
vfmsub213ps ymm5, ymm0, ymm7
vfmsub213ps ymm6, ymm0, ymm8

My hot loop after:

vsubps ymm14, ymm0, ymm8
vsubps ymm15, ymm0, ymm9
vfmadd213ps ymm8, ymm10, ymm6
vfmadd231ps ymm8, ymm12, ymm4
vfmadd213ps ymm4, ymm10, ymm2
vfmadd213ps ymm9, ymm11, ymm7
vfmadd231ps ymm9, ymm13, ymm5
vfmadd213ps ymm5, ymm11, ymm3
vfmadd231ps ymm4, ymm12, ymm14
vfmadd231ps ymm5, ymm13, ymm15
vfmsub213ps ymm2, ymm1, ymm4
vfmsub213ps ymm3, ymm1, ymm5
vfmsub213ps ymm6, ymm1, ymm8
vfmsub213ps ymm7, ymm1, ymm9

Notice that after the change the vfnmadd with an N in it is gone from thr assembly and two more vsubps instructions are added that weren't there before.

However, digging deeper, this may be an artifact of the way wide incorrectly implements floating-point sign negation: https://docs.rs/wide/latest/src/wide/lib.rs.html#371-390

0.0 - x that wide uses is not a semantically correct way to do sign negation: https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=99548e3611548e6fbcc1a185dde58642

I've reported this to wide: Lokathor/wide#241

fearless_simd does sign negation correctly via a xor, so the semantics are at least correct.

@Shnatsel
Copy link
Contributor Author

I've fixed float negation in wide and after my fix it optimizes into the negated FMA instruction on both AVX2 and AVX-512. SSE4.2 doesn't have FMA so this covers everything.

@Shnatsel
Copy link
Contributor Author

I have to reopen this. While the regular mul_add optimizes fine on x86 (AVX2 and AVX-512 both), it does NOT automatically optimize on ARM (NEON). This was caused a massive regression for my FFT implementation because FFTs are just FMAs plus loads/stores.

@Shnatsel Shnatsel reopened this Dec 13, 2025
@LaurenzV
Copy link
Collaborator

Ah that's sad. :( I mean from my side it would be fine to merge (at least temporarily), but would be good to hear some other opinons.

@valadaptive
Copy link
Contributor

I have to reopen this. While the regular mul_add optimizes fine on x86 (AVX2 and AVX-512 both), it does NOT automatically optimize on ARM (NEON). This was caused a massive regression for my FFT implementation because FFTs are just FMAs plus loads/stores.

Have you checked that this PR actually improves ARM codegen? The Rust intrinsics themselves lower to a separate negate + multiply-add (the file is too big to display in GitHub, but here it is):

pub fn vfmsq_f32(a: float32x4_t, b: float32x4_t, c: float32x4_t) -> float32x4_t {
    unsafe {
        let b: float32x4_t = simd_neg(b);
        vfmaq_f32(a, b, c)
    }
}

@Shnatsel
Copy link
Contributor Author

Not yet.

I have verified that the mul_neg_add implementation in wide crate improves NEON performance significantly. I'll inspect the codegen for wide and this PR more closely.

@valadaptive
Copy link
Contributor

valadaptive commented Dec 14, 2025

You were suggested:

Can you not just do a.mul_add(-b, c) for this?

But maybe also try (-a).mul_add(b, c). The operand order gets a bit confusing, but I think that's what the Rust intrinsics lower to.

If only one of the two actually lowers to a negate-multiply-add instruction, that's definitely something we should document.

@valadaptive
Copy link
Contributor

I've figured out why mul_neg_add is still faster in wide on ARM: LLVM only recognizes that x ^ -0.0 is equivalent to -x on x86. On other architectures, wide should use the "negate" intrinsics that those architectures provide.

I've opened a PR to wide to fix this (Lokathor/wide#243). With that fix, PhastFT compiles to the exact same ARM assembly whether you use mul_neg_add or mul_add with a negation.

@Shnatsel
Copy link
Contributor Author

That's great, thank you!

However, that also seems pretty fragile and relies on compiler optimizations far more than I'd like. This regressing at any point due to a change to LLVM behavior would be a dramatic performance hit for PhastFT. Given that I'm already expressing the algorithm in terms of wrappers around intrinsics, I'd much rather use a dedicated mul_neg_add rather than rely on compiler optimizations that may change from version to version.

@valadaptive
Copy link
Contributor

As I mentioned above, all of the "fused" negate-multiply-add intrinsics are already implemented in Rust itself as a negate followed by a regular multiply-add. There is no way to avoid relying on compiler optimizations here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants