From 5699273e5189d9fc8be92d5c584e8fa86e3a2b54 Mon Sep 17 00:00:00 2001
From: Daily Perf Improver <github-actions[bot]@users.noreply.github.com>
Date: Sun, 12 Oct 2025 15:22:08 +0000
Subject: [PATCH] Optimize dot product horizontal reduction with Vector.Sum
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Replace manual accumulation loop with Vector.Sum() for SIMD horizontal
reduction. Vector.Sum() uses hardware-specific horizontal add instructions
(e.g., VPHADDPS on AVX) which are more efficient than manual element-by-element
accumulation.

Performance improvements (ShortRun benchmarks):
- Size 10: 35.8% faster (6.894 → 4.426 ns, 1.56× speedup)
- Size 100: 8.3% faster (27.745 → 25.434 ns, 1.09× speedup)
- Size 1000: ~equivalent (238.856 → 241.945 ns)
- Size 10000: ~equivalent (2,359 → 2,355 ns)

All 488 tests pass. No allocations changed.
---
 src/FsMath/SpanMath.fs | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/FsMath/SpanMath.fs b/src/FsMath/SpanMath.fs
index 6c8c26e..312061f 100644
--- a/src/FsMath/SpanMath.fs
+++ b/src/FsMath/SpanMath.fs
@@ -36,10 +36,10 @@ type SpanMath =
                 let vy = Numerics.Vector<'T>(y.Slice(yi, simdWidth))
                 accVec <- accVec + (vx * vy)
 
-            let mutable acc = LanguagePrimitives.GenericZero<'T>
-            for i = 0 to simdWidth - 1 do
-                acc <- acc + accVec.[i]
+            // Use Vector.Sum for optimized horizontal reduction (uses hardware-specific instructions)
+            let mutable acc = Numerics.Vector.Sum(accVec)
 
+            // Handle remaining elements (tail)
             for i = ceiling to length - 1 do
                 acc <- acc + x.[xOffset + i] * y.[yOffset + i]