From ad941b8fb86ddff295f009d5c6c4f1d576e993df Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Tue, 26 May 2026 10:30:25 +0300
Subject: [PATCH 01/24] =?UTF-8?q?Add=20design=20doc=20for=20SQ8=E2=86=94FP?=
 =?UTF-8?q?16=20SIMD=20x86=20kernels=20[MOD-14954]?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Captures the architecture, file-level plan, CMake F16C gating, and risk
register for adding AVX-512 / AVX2+FMA / AVX2 / SSE4 kernels for the
asymmetric SQ8 (storage) ↔ FP16 (query) distance functions, wiring them
into the existing dispatcher tables and SQ8_FP16 unit/benchmark
scaffolding from MOD-15141.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../2026-05-26-sq8-fp16-x86-kernels-design.md | 385 ++++++++++++++++++
 1 file changed, 385 insertions(+)
 create mode 100644 docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md
diff --git a/docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md b/docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md
new file mode 100644
index 000000000..1ef7a787a
--- /dev/null
+++ b/docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md
@@ -0,0 +1,385 @@
+# SQ8↔FP16 SIMD distance kernels — Intel x86 (MOD-14954)
+
+## Goal
+
+Add asymmetric SQ8 (storage) ↔ FP16 (query) distance kernels for Inner
+Product, Cosine, and L2² on Intel x86 across four ISA tiers:
+
+- AVX-512 (F + BW + VL + VNNI bundle already used for SQ8_FP32)
+- AVX2 + FMA
+- AVX2 (no FMA)
+- SSE4.1
+
+Each kernel converts FP16 query lanes to FP32 per SIMD chunk; the inner
+multiply-accumulate runs in FP32. SQ8 metadata and FP32 query metadata
+(precomputed sums) stay scalar and are read with the same algebraic
+identity used by the SQ8_FP32 kernels:
+
+```text
+IP(x, y) = min · y_sum + delta · Σ(q_i · y_i)
+L2²(x, y) = x_sum_squares + y_sum_squares − 2 · IP(x, y)
+```
+
+Wire the new kernels into the dispatcher tables so
+`{IP,Cosine,L2}_SQ8_FP16_GetDistFunc` returns the best SIMD path
+available at runtime instead of the scalar fallback delivered by
+MOD-15141.
+
+## Non-goals
+
+- No new metric (only IP / Cosine / L2²).
+- No change to scalar `SQ8_FP16_*` reference; existing tests against
+  `SQ8_FP16_NotOptimized_*` remain the correctness baseline.
+- No ARM kernels (MOD-14972 covers ARM).
+- No SQ8↔FP32 changes; existing kernels untouched.
+
+## Scope and constraints
+
+- FP16 query layout is `[float16 values (dim)] [y_sum (float)]
+  [y_sum_squares (float, L2 only)]`. Trailing metadata is FP32 and may
+  sit at an offset that is not a multiple of 4 when `dim` is odd; use
+  `load_unaligned<float>` to read it (mirrors scalar `SQ8_FP16_Impl`).
+- All four ISA tiers need a way to widen FP16 → FP32. The 512-bit
+  variant (`_mm512_cvtph_ps`) is in AVX512F. The 256-bit and 128-bit
+  variants (`_mm256_cvtph_ps`, `_mm_cvtph_ps`) require the F16C
+  extension. F16C is its own ISA flag; AVX2/SSE4.1 do not imply it.
+- Existing dispatcher source files (`AVX2_FMA.cpp`, `AVX2.cpp`,
+  `SSE4.cpp`) are compiled without `-mf16c`. We add `-mf16c` to those
+  files in CMake (conditional on `CXX_F16C`), guard the new SQ8_FP16
+  symbols behind `#ifdef OPT_F16C`, and add `features.f16c &&` to the
+  dispatch gates for the AVX2/SSE4 tiers. The AVX-512 tier needs no
+  F16C gate.
+- `dim` must be ≥ 16 for the AVX-512/AVX2 SIMD paths and ≥ 16 for SSE4
+  (matches existing SQ8_FP32 contract).
+- SQ8 storage is read as `uint8_t`; alignment hint returned by
+  `*_GetDistFunc` continues to refer to the SQ8 (first) operand. Hints:
+  16 / 8 / 8 / 4 bytes for AVX-512 / AVX2+FMA / AVX2 / SSE4 when
+  `dim % chunk == 0`, else 0.
+
+## File-level design
+
+### New SIMD headers (8 files)
+
+Per ISA tier × {IP, L2}:
+
+```text
+src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h
+src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
+src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
+src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
+src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h
+src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h
+src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h
+src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h
+```
+
+Each IP header exposes:
+
+- `template <unsigned char residual> float SQ8_FP16_InnerProductImp_<tier>(const void*, const void*, size_t)` — raw inner product (no `1 -`), used by both InnerProduct/Cosine wrappers and the L2 kernel.
+- `template <unsigned char residual> float SQ8_FP16_InnerProductSIMD16_<tier>(...)` — returns `1.0f - Imp`.
+- `template <unsigned char residual> float SQ8_FP16_CosineSIMD16_<tier>(...)` — aliases InnerProduct (vectors are pre-normalised, mirrors SQ8_FP32 pattern).
+
+Each L2 header `#include`s the matching IP header and exposes:
+
+- `template <unsigned char residual> float SQ8_FP16_L2SqrSIMD16_<tier>(...)` — computes `x_sum_sq + y_sum_sq − 2·Imp(...)`.
+
+`<tier>` strings:
+
+- `AVX512F_BW_VL_VNNI`
+- `AVX2_FMA`
+- `AVX2`
+- `SSE4`
+
+All four headers' inner loops:
+
+1. Load 16 SQ8 bytes (one chunk) and widen to 16×FP32.
+2. Load 16 FP16 query lanes and widen to 16×FP32 (`_mm512_cvtph_ps`,
+   two `_mm256_cvtph_ps` calls, two `_mm256_cvtph_ps` for plain AVX2,
+   or four `_mm_cvtph_ps` for SSE4 — chunk granularity matches the
+   existing SQ8_FP32 layout for that tier).
+3. Fuse-multiply-add (or mul + add for SSE4 and plain AVX2) into the
+   FP32 accumulator(s).
+4. After the loop, horizontal-reduce and apply
+   `min_val · y_sum + delta · quantized_dot`.
+
+L2 kernels additionally read `x_sum_squares` from SQ8 metadata and
+`y_sum_squares` from query metadata, return
+`x_sum_sq + y_sum_sq − 2·ip`. **Both** the SQ8 storage metadata
+(`min_val`, `delta`, `x_sum_squares`) and the FP16 query metadata
+(`y_sum`, `y_sum_squares`) are read with `load_unaligned<float>`. SQ8
+metadata starts at byte offset `dim` after the quantised lanes — for
+odd `dim` that offset is not 4-byte aligned. FP16 query metadata
+starts at byte offset `2*dim` after the FP16 lanes — odd `dim` leaves
+it 2-byte aligned. Mirrors the scalar `SQ8_FP16_InnerProduct_Impl`
+pattern in `src/VecSim/spaces/IP/IP.cpp`.
+
+Residual handling:
+
+- **AVX-512** (residual 0..15): load the full 256-bit FP16 chunk
+  (`_mm256_loadu_si256` over 32 bytes; the chunk is always within the
+  query blob since `dim >= 16` and the FP16 metadata follows), convert with
+  `_mm512_cvtph_ps`, then mask away unused lanes via
+  `_mm512_maskz_mov_ps(mask, v2_f)` (or fold the mask into the
+  FP32 multiply with `_mm512_maskz_mul_ps`). The SQ8 side uses
+  `_mm_loadu_si128` + `_mm512_cvtepu8_epi32` + `_mm512_cvtepi32_ps`
+  and is also masked.
+- **AVX2+FMA / AVX2** (residual 0..15, split into a 0..7 head plus a
+  conditional 8-wide pre-step): for the 0..7 head, load the full
+  128-bit FP16 block (`_mm_loadu_si128`), convert with
+  `_mm256_cvtph_ps`, then zero out unused lanes via
+  `_mm256_blend_ps(_mm256_setzero_ps(), v2_f, residuals_mask)` —
+  mirroring the existing F16C `FP16_InnerProductSIMD32_F16C` blend
+  pattern. The SQ8 side uses `_mm_loadl_epi64` (8 bytes) +
+  `_mm256_cvtepu8_epi32` + `_mm256_cvtepi32_ps`. When residual ≥ 8,
+  one extra full 8-wide step runs before the do-while loop, matching
+  the SQ8_FP32 AVX2[+FMA] residual layout.
+- **SSE4** (residual 0..15, split into 4-wide pre-steps): for the
+  0..3 head, materialise the FP32 lanes via `_mm_set_ps(0, ..., 0,
+  FP16_to_FP32(pVec2[k]), ...)` paired with `_mm_set_ps` on the SQ8
+  side — mirrors the existing SSE4 SQ8_FP32 `_mm_set_ps` residual
+  path. For residual ≥ 4 / ≥ 8 / ≥ 12, run 1 / 2 / 3 extra 4-wide
+  steps before the do-while loop. Each 4-wide step loads 8 bytes of
+  FP16 (`_mm_loadl_epi64`), converts with `_mm_cvtph_ps`, and loads
+  4 SQ8 bytes via `_mm_cvtsi32_si128` + `_mm_cvtepu8_epi32` +
+  `_mm_cvtepi32_ps`.
+
+### Dispatcher edits
+
+Per existing ISA dispatcher (no new dispatcher files):
+
+| File | Add declarations / definitions |
+| --- | --- |
+| `src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.{h,cpp}` | `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F_BW_VL_VNNI` |
+| `src/VecSim/spaces/functions/AVX2_FMA.{h,cpp}` | `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX2_FMA`, guarded by `#ifdef OPT_F16C` |
+| `src/VecSim/spaces/functions/AVX2.{h,cpp}` | `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX2`, guarded by `#ifdef OPT_F16C` |
+| `src/VecSim/spaces/functions/SSE4.{h,cpp}` | `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_SSE4`, guarded by `#ifdef OPT_F16C` |
+
+Each `Choose_*` uses the existing `CHOOSE_IMPLEMENTATION(out, dim, 16,
+func)` macro (16-element residual table — matches SQ8_FP32 contract).
+
+`src/VecSim/spaces/IP_space.cpp` — extend `IP_SQ8_FP16_GetDistFunc` and
+`Cosine_SQ8_FP16_GetDistFunc`. `L2_space.cpp` — extend
+`L2_SQ8_FP16_GetDistFunc`. New body shape (IP shown; L2/Cosine
+identical):
+
+```cpp
+dist_func_t<float> ret_dist_func = SQ8_FP16_InnerProduct;
+[[maybe_unused]] auto features = getCpuOptimizationFeatures(arch_opt);
+
+#ifdef CPU_FEATURES_ARCH_X86_64
+if (dim < 16) {
+    return ret_dist_func;
+}
+#ifdef OPT_AVX512_F_BW_VL_VNNI
+if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) {
+    if (dim % 16 == 0) *alignment = 16 * sizeof(uint8_t);
+    return Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim);
+}
+#endif
+#ifdef OPT_AVX2_FMA
+#ifdef OPT_F16C
+if (features.avx2 && features.fma3 && features.f16c) {
+    if (dim % 8 == 0) *alignment = 8 * sizeof(uint8_t);
+    return Choose_SQ8_FP16_IP_implementation_AVX2_FMA(dim);
+}
+#endif
+#endif
+#ifdef OPT_AVX2
+#ifdef OPT_F16C
+if (features.avx2 && features.f16c) {
+    if (dim % 8 == 0) *alignment = 8 * sizeof(uint8_t);
+    return Choose_SQ8_FP16_IP_implementation_AVX2(dim);
+}
+#endif
+#endif
+#ifdef OPT_SSE4
+#ifdef OPT_F16C
+// F16C instructions are VEX-encoded — require AVX as well, matching the
+// existing FP16/F16C dispatcher gate in IP_space.cpp.
+if (features.sse4_1 && features.f16c && features.avx) {
+    if (dim % 4 == 0) *alignment = 4 * sizeof(uint8_t);
+    return Choose_SQ8_FP16_IP_implementation_SSE4(dim);
+}
+#endif
+#endif
+#endif // x86_64
+return ret_dist_func;
+```
+
+ARM block (`OPT_SVE2` / `OPT_SVE` / `OPT_NEON`) is left as-is — the
+SQ8_FP16 ARM kernels arrive via MOD-14972.
+
+### CMake change
+
+`src/VecSim/spaces/CMakeLists.txt` — when both `CXX_F16C` and the
+parent ISA flag are present, add `-mf16c` to the dispatcher file:
+
+```cmake
+if(CXX_AVX2 AND CXX_FMA)
+    set(_avx2_fma_flags "-mavx2 -mfma")
+    if(CXX_F16C)
+        set(_avx2_fma_flags "${_avx2_fma_flags} -mf16c")
+    endif()
+    set_source_files_properties(functions/AVX2_FMA.cpp PROPERTIES COMPILE_FLAGS "${_avx2_fma_flags}")
+    list(APPEND OPTIMIZATIONS functions/AVX2_FMA.cpp)
+endif()
+
+if(CXX_AVX2)
+    set(_avx2_flags "-mavx2")
+    if(CXX_F16C)
+        set(_avx2_flags "${_avx2_flags} -mf16c")
+    endif()
+    set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS "${_avx2_flags}")
+    list(APPEND OPTIMIZATIONS functions/AVX2.cpp)
+endif()
+
+if(CXX_SSE4)
+    set(_sse4_flags "-msse4.1")
+    if(CXX_F16C)
+        set(_sse4_flags "${_sse4_flags} -mf16c")
+    endif()
+    set_source_files_properties(functions/SSE4.cpp PROPERTIES COMPILE_FLAGS "${_sse4_flags}")
+    list(APPEND OPTIMIZATIONS functions/SSE4.cpp)
+endif()
+```
+
+AVX-512 dispatcher (`AVX512F_BW_VL_VNNI.cpp`) needs no flag change —
+`-mavx512f` already enables `_mm512_cvtph_ps`.
+
+`-mf16c` does not alter the emitted code for the existing SQ8_FP32
+sources, since those sources contain no F16C intrinsics.
+
+### Tests (`tests/unit/test_spaces.cpp`)
+
+1. New parameterised class `SQ8_FP16_SpacesOptimizationTest` mirroring
+   `SQ8_FP32_SpacesOptimizationTest`. Three test bodies for L2 / IP /
+   Cosine, each comparing the chosen optimised function against the
+   scalar `SQ8_FP16_*` baseline (`ASSERT_NEAR ... 0.01`). Walks down
+   AVX512 → AVX2_FMA → AVX2 → SSE4 → scalar by zeroing feature flags
+   between assertions, exactly like `SQ8_FP32_SpacesOptimizationTest`.
+   `INSTANTIATE_TEST_SUITE_P` with `testing::Range(16UL, 16 * 2UL + 1)`.
+
+2. Update existing `SpacesTest.GetDistFunc_*_SQ8_FP16` assertions at
+   lines ~563–575: when running on x86, the dispatcher now returns the
+   SIMD `Choose_*` symbol instead of the scalar. AVX-512 selection
+   depends on `avx512f && avx512bw && avx512vl && avx512vnni` only
+   (no F16C requirement — 512-bit `_mm512_cvtph_ps` is part of
+   AVX512F). AVX2+FMA, AVX2, and SSE4 selection additionally requires
+   `features.f16c` (and `features.avx` for the SSE4 gate). The tests
+   should call `getCpuOptimizationFeatures()` and assert the expected
+   `Choose_*` for the host's highest supported tier (same shape used
+   by `SQ8_FP32_SpacesOptimizationTest`).
+
+3. Reuse existing helpers: `populate_sq8_fp16_query`,
+   `populate_float_vec_to_sq8_with_metadata`,
+   `SQ8_FP16_NotOptimized_{InnerProduct,Cosine,L2Sqr}`.
+
+### Benchmarks (`tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp`)
+
+Add per-ISA benches mirroring `bm_spaces_sq8_fp32.cpp`:
+
+```cpp
+#ifdef CPU_FEATURES_ARCH_X86_64
+cpu_features::X86Features opt = cpu_features::GetX86Info().features;
+
+#ifdef OPT_AVX512_F_BW_VL_VNNI
+bool avx512_f_bw_vl_vnni_supported = opt.avx512f && opt.avx512bw && opt.avx512vl && opt.avx512vnni;
+INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F_BW_VL_VNNI, 16, avx512_f_bw_vl_vnni_supported);
+INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F_BW_VL_VNNI, 16, avx512_f_bw_vl_vnni_supported);
+#endif
+
+#ifdef OPT_F16C
+#ifdef OPT_AVX2_FMA
+bool avx2_fma3_f16c_supported = opt.avx2 && opt.fma3 && opt.f16c;
+INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2_FMA, 16, avx2_fma3_f16c_supported);
+INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2_FMA, 16, avx2_fma3_f16c_supported);
+#endif
+
+#ifdef OPT_AVX2
+bool avx2_f16c_supported = opt.avx2 && opt.f16c;
+INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2, 16, avx2_f16c_supported);
+INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2, 16, avx2_f16c_supported);
+#endif
+
+#ifdef OPT_SSE4
+bool sse4_f16c_supported = opt.sse4_1 && opt.f16c && opt.avx;
+INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SSE4, 16, sse4_f16c_supported);
+INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SSE4, 16, sse4_f16c_supported);
+#endif
+#endif // OPT_F16C
+#endif // x86_64
+```
+
+Naive bench lines stay (covers the scalar fallback case).
+
+## Validation strategy
+
+1. Unit tests (`SQ8_FP16_SpacesOptimizationTest`) assert numerical
+   parity against the scalar baseline for all dims in `[16, 32]`
+   (covers every residual class for the 16-wide chunk). Existing
+   `SQ8_FP16_NoOpt` parameterised suite continues to exercise small
+   and odd dims for the scalar reference; combined with the new
+   optimisation tests this covers each SIMD residual class plus the
+   scalar fallback.
+2. Existing edge-case tests (`SQ8_FP16_EdgeCases.ZeroQueryTest`,
+   `SQ8_FP16_l2sqr_odd_dim_unaligned_metadata_test`) keep running
+   against the scalar implementation directly — they exercise
+   alignment-sensitive paths that are deliberately scalar-only.
+3. Microbenchmarks compare per-ISA SQ8_FP16 throughput to the matching
+   SQ8_FP32 throughput on the same machine. Acceptance: SQ8_FP16
+   should be within ~1.0–1.5× of SQ8_FP32 (one extra widening per
+   chunk, no extra memory pressure since the FP16 query is half the
+   size of FP32).
+4. CI: x86 jobs already exist; verifies the CMake change keeps
+   building. No new toolchain requirement (binutils 2.34+ already
+   covers F16C, no AVX-512 FP16 dependency).
+
+## Risk register
+
+| Risk | Likelihood | Mitigation |
+| --- | --- | --- |
+| Adding `-mf16c` to AVX2_FMA.cpp / AVX2.cpp / SSE4.cpp accidentally enables F16C codegen elsewhere | Low | Those sources contain only SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 code; no F16C intrinsics — compiler cannot synthesise F16C without an explicit intrinsic. |
+| Older toolchain without F16C support | Low | `CXX_F16C` already detected; `-mf16c` only appended when present. Dispatcher symbols guarded by `#ifdef OPT_F16C`; missing → falls through to scalar. |
+| Backport branches diverge in dispatcher | Medium | Change is additive (new headers, new symbols, new gates). No SQ8_FP32 path touched. CMake change is conditional. Backport just cherry-picks the commit. |
+| Pre-Ivy Bridge SSE4-only CPUs lose a SIMD tier (no F16C) | Negligible | Fall through to scalar SQ8_FP16. Such CPUs are out of practical support anyway. |
+| Numerical drift between FP16→FP32 widening and the scalar `FP16_to_FP32` software path | Low | `vcvtph2ps` follows IEEE 754 half→single conversion exactly; the scalar `FP16_to_FP32` in `float16.h` is bit-faithful for finite values. Tests use `ASSERT_NEAR ... 0.01` slack. |
+
+## Out-of-scope follow-ups
+
+- AVX512FP16-native kernels (would use `__m512h` and `vfmadd*ph`
+  directly on 32 FP16 lanes per 512-bit register, skipping the
+  widen-to-FP32 step). Deferred for four concrete reasons, not just
+  "lower priority":
+    1. **Deployment baseline.** AVX512FP16 is Sapphire Rapids and
+       newer (Intel server 2023+) plus very recent AMD parts. Most
+       production hosts running this library do not have it. The
+       AVX-512F path delivered here is the right default for the
+       widely-deployed AVX-512 tier, and a Sapphire-Rapids-only
+       variant would land underneath the same gating tree, not as a
+       replacement.
+    2. **Numerical fit is awkward for SQ8↔FP16.** The kernel computes
+       `Σ(q_i · y_i)` where `q_i ∈ [0,255]` (uint8) and `y_i` is
+       FP16. Each lane product can be as large as
+       `255 · 65504 ≈ 1.67e7`, which is well above the FP16 finite
+       range (`±65504`). A pure FP16 accumulator would overflow on
+       realistic data; the only safe path is to accumulate in FP32
+       after a per-chunk `vcvtph2ps`-equivalent — which is exactly
+       what the AVX-512F path already does. AVX512FP16 mainly buys
+       FP16-native multiply-add, which we cannot safely use here.
+    3. **Marginal speedup over the AVX-512F path proposed here.**
+       The widening cost is one `_mm512_cvtph_ps` per 16-element
+       chunk against a kernel that is already memory-bandwidth-bound
+       (16 bytes of SQ8 storage + 32 bytes of FP16 query per chunk).
+       Eliminating that one conversion saves a few cycles per chunk
+       on a path that is gated on memory, not arithmetic throughput.
+    4. **Ticket scope.** MOD-14954 enumerates AVX-512, AVX2+FMA, and
+       SSE4; the plain-AVX2 tier was added during brainstorming as
+       free coverage. An AVX512FP16 variant is its own ISA tier with
+       its own gating column in the dispatcher and its own residual
+       table, and warrants a separate design / benchmarking pass
+       once the deployment baseline justifies the maintenance cost.
+  Pure FP16↔FP16 (no SQ8 involved) already has an AVX512FP16_VL path
+  at `src/VecSim/spaces/functions/AVX512FP16_VL.cpp`; that file is the
+  natural home should we revisit this later.
+- ARM SQ8_FP16 (MOD-14972).
+- Reranking flow integration tests under HNSW (separate ticket).

From 97467b25b42a5408672ea75eaef6551692edcd42 Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Tue, 26 May 2026 11:24:06 +0300
Subject: [PATCH 02/24] Append -mf16c to AVX2_FMA/AVX2/SSE4 dispatcher sources
 [MOD-14954]
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Enables _mm{,256}_cvtph_ps in the AVX2+FMA, AVX2, and SSE4 dispatcher
translation units so the upcoming SQ8↔FP16 kernels can widen FP16 lanes
to FP32. The flag is appended only when CXX_F16C is detected; existing
SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 sources contain no F16C intrinsics so
emitted code for those kernels is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 src/VecSim/spaces/CMakeLists.txt | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/src/VecSim/spaces/CMakeLists.txt b/src/VecSim/spaces/CMakeLists.txt
index fe354ded5..8babf844b 100644
--- a/src/VecSim/spaces/CMakeLists.txt
+++ b/src/VecSim/spaces/CMakeLists.txt
@@ -51,14 +51,26 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)")
 	endif()
 
 	if(CXX_AVX2)
-		message("Building with AVX2")
-		set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS -mavx2)
+		set(_avx2_flags "-mavx2")
+		if(CXX_F16C)
+			message("Building functions/AVX2.cpp with AVX2 and F16C")
+			set(_avx2_flags "${_avx2_flags} -mf16c")
+		else()
+			message("Building functions/AVX2.cpp with AVX2 (no F16C)")
+		endif()
+		set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS "${_avx2_flags}")
 		list(APPEND OPTIMIZATIONS functions/AVX2.cpp)
 	endif()
 
 	if(CXX_AVX2 AND CXX_FMA)
-		message("Building with AVX2 and FMA")
-		set_source_files_properties(functions/AVX2_FMA.cpp PROPERTIES COMPILE_FLAGS "-mavx2 -mfma")
+		set(_avx2_fma_flags "-mavx2 -mfma")
+		if(CXX_F16C)
+			message("Building functions/AVX2_FMA.cpp with AVX2, FMA, and F16C")
+			set(_avx2_fma_flags "${_avx2_fma_flags} -mf16c")
+		else()
+			message("Building functions/AVX2_FMA.cpp with AVX2 and FMA (no F16C)")
+		endif()
+		set_source_files_properties(functions/AVX2_FMA.cpp PROPERTIES COMPILE_FLAGS "${_avx2_fma_flags}")
 		list(APPEND OPTIMIZATIONS functions/AVX2_FMA.cpp)
 	endif()
 
@@ -81,8 +93,14 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)")
 	endif()
 
 	if(CXX_SSE4)
-		message("Building with SSE4")
-		set_source_files_properties(functions/SSE4.cpp PROPERTIES COMPILE_FLAGS -msse4.1)
+		set(_sse4_flags "-msse4.1")
+		if(CXX_F16C)
+			message("Building functions/SSE4.cpp with SSE4.1 and F16C")
+			set(_sse4_flags "${_sse4_flags} -mf16c")
+		else()
+			message("Building functions/SSE4.cpp with SSE4.1 (no F16C)")
+		endif()
+		set_source_files_properties(functions/SSE4.cpp PROPERTIES COMPILE_FLAGS "${_sse4_flags}")
 		list(APPEND OPTIMIZATIONS functions/SSE4.cpp)
 	endif()
 

From bab74734c1a2a7a856dee66dc609b2468416f699 Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Tue, 26 May 2026 11:26:05 +0300
Subject: [PATCH 03/24] Add SQ8_FP16_SpacesOptimizationTest skeleton
 [MOD-14954]

Parameterised gtest fixture mirroring SQ8_FP32_SpacesOptimizationTest;
currently asserts only the scalar fallback path. Per-tier SIMD
assertion blocks (AVX-512, AVX2+FMA, AVX2, SSE4) are added alongside
the kernel implementations in subsequent commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 tests/unit/test_spaces.cpp | 95 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)

diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp
index a6bb88cef..dfbe81f5d 100644
--- a/tests/unit/test_spaces.cpp
+++ b/tests/unit/test_spaces.cpp
@@ -3070,6 +3070,101 @@ INSTANTIATE_TEST_SUITE_P(SQ8_FP16_NoOpt, SQ8_FP16_NoOptimizationSpacesTest,
                          testing::Values(1, 5, 7, 8, 9, 15, 16, 17, 31, 32, 33, 47, 48, 49, 63, 64,
                                          65, 127, 128));
 
+/* ======================== SQ8_FP16 SIMD optimisation tests ========================= */
+
+// Walks down the x86 ISA tiers (AVX-512 → AVX2+FMA → AVX2 → SSE4 → scalar) and asserts
+// that {IP,Cosine,L2}_SQ8_FP16_GetDistFunc returns the expected Choose_* symbol and that
+// its output matches the scalar baseline within 0.01.
+class SQ8_FP16_SpacesOptimizationTest : public testing::TestWithParam<size_t> {};
+
+TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) {
+    auto optimization = getCpuOptimizationFeatures();
+    size_t dim = GetParam();
+
+    size_t query_count =
+        dim + sq8::query_metadata_count<VecSimMetric_L2>() * (sizeof(float) / sizeof(float16));
+    std::vector<float16> v1_query(query_count);
+    test_utils::populate_sq8_fp16_query(v1_query.data(), dim, false, 1234);
+
+    size_t quantized_size =
+        dim * sizeof(uint8_t) + sq8::storage_metadata_count<VecSimMetric_L2>() * sizeof(float);
+    std::vector<uint8_t> v2_compressed(quantized_size);
+    test_utils::populate_float_vec_to_sq8_with_metadata(v2_compressed.data(), dim, false, 5678);
+
+    dist_func_t<float> arch_opt_func;
+    float baseline = SQ8_FP16_L2Sqr(v2_compressed.data(), v1_query.data(), dim);
+
+    // Per-tier assertion blocks are added by Tasks 3–6.
+
+    // Scalar fallback.
+    unsigned char alignment = 0;
+    arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
+    ASSERT_EQ(arch_opt_func, SQ8_FP16_L2Sqr)
+        << "Unexpected scalar fallback function for dim " << dim;
+    ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
+        << "Scalar fallback with dim " << dim;
+    ASSERT_EQ(alignment, 0) << "Scalar fallback should set no alignment hint for dim " << dim;
+}
+
+TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) {
+    auto optimization = getCpuOptimizationFeatures();
+    size_t dim = GetParam();
+
+    size_t query_count =
+        dim + sq8::query_metadata_count<VecSimMetric_L2>() * (sizeof(float) / sizeof(float16));
+    std::vector<float16> v1_query(query_count);
+    test_utils::populate_sq8_fp16_query(v1_query.data(), dim, true, 1234);
+
+    size_t quantized_size =
+        dim * sizeof(uint8_t) + sq8::storage_metadata_count<VecSimMetric_L2>() * sizeof(float);
+    std::vector<uint8_t> v2_compressed(quantized_size);
+    test_utils::populate_float_vec_to_sq8_with_metadata(v2_compressed.data(), dim, true, 5678);
+
+    dist_func_t<float> arch_opt_func;
+    float baseline = SQ8_FP16_InnerProduct(v2_compressed.data(), v1_query.data(), dim);
+
+    // Per-tier assertion blocks are added by Tasks 3–6.
+
+    unsigned char alignment = 0;
+    arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
+    ASSERT_EQ(arch_opt_func, SQ8_FP16_InnerProduct)
+        << "Unexpected scalar fallback function for dim " << dim;
+    ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
+        << "Scalar fallback with dim " << dim;
+    ASSERT_EQ(alignment, 0) << "Scalar fallback should set no alignment hint for dim " << dim;
+}
+
+TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) {
+    auto optimization = getCpuOptimizationFeatures();
+    size_t dim = GetParam();
+
+    size_t query_count =
+        dim + sq8::query_metadata_count<VecSimMetric_L2>() * (sizeof(float) / sizeof(float16));
+    std::vector<float16> v1_query(query_count);
+    test_utils::populate_sq8_fp16_query(v1_query.data(), dim, true, 1234);
+
+    size_t quantized_size =
+        dim * sizeof(uint8_t) + sq8::storage_metadata_count<VecSimMetric_L2>() * sizeof(float);
+    std::vector<uint8_t> v2_compressed(quantized_size);
+    test_utils::populate_float_vec_to_sq8_with_metadata(v2_compressed.data(), dim, true, 5678);
+
+    dist_func_t<float> arch_opt_func;
+    float baseline = SQ8_FP16_Cosine(v2_compressed.data(), v1_query.data(), dim);
+
+    // Per-tier assertion blocks are added by Tasks 3–6.
+
+    unsigned char alignment = 0;
+    arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
+    ASSERT_EQ(arch_opt_func, SQ8_FP16_Cosine)
+        << "Unexpected scalar fallback function for dim " << dim;
+    ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
+        << "Scalar fallback with dim " << dim;
+    ASSERT_EQ(alignment, 0) << "Scalar fallback should set no alignment hint for dim " << dim;
+}
+
+INSTANTIATE_TEST_SUITE_P(SQ8_FP16_SIMD, SQ8_FP16_SpacesOptimizationTest,
+                         testing::Range(16UL, 16 * 2UL + 1));
+
 /* ======================== Tests SQ8_FP16 (edge cases) ========================= */
 
 // Zero FP16 query against a non-zero SQ8 storage. IP must be exactly 1.0 (1 - 0),

From 671a7cc3cef3bf313f373ff5e51a899bed12d7bb Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Tue, 26 May 2026 11:30:13 +0300
Subject: [PATCH 04/24] =?UTF-8?q?Add=20AVX-512=20SQ8=E2=86=94FP16=20SIMD?=
 =?UTF-8?q?=20distance=20kernels=20[MOD-14954]?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Implements asymmetric SQ8 (storage) ↔ FP16 (query) Inner Product,
Cosine, and L2² kernels for the AVX-512 F+BW+VL+VNNI tier. Each chunk
widens 16 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 16 FP16 lanes
via _mm512_cvtph_ps, then fmadds into a 16-lane FP32 accumulator. SQ8
storage and FP16 query metadata reads use load_unaligned to tolerate
odd dimensions. Dispatcher branches in IP_space.cpp / L2_space.cpp
select the new Choose_SQ8_FP16_*_implementation_AVX512F_BW_VL_VNNI
when features.avx512f && features.avx512bw && features.avx512vl &&
features.avx512vnni; otherwise behaviour is unchanged from MOD-15141.
A parameterised gtest fixture exercises every residual class in
[16, 32] against the scalar baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h       | 102 ++++++++++++++++++
 src/VecSim/spaces/IP_space.cpp                |  45 ++++++--
 .../L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h       |  36 +++++++
 src/VecSim/spaces/L2_space.cpp                |  23 +++-
 .../spaces/functions/AVX512F_BW_VL_VNNI.cpp   |  21 ++++
 .../spaces/functions/AVX512F_BW_VL_VNNI.h     |   4 +
 tests/unit/test_spaces.cpp                    |  39 ++++++-
 7 files changed, 252 insertions(+), 18 deletions(-)
 create mode 100644 src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h
 create mode 100644 src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h

diff --git a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h
new file mode 100644
index 000000000..55d63d711
--- /dev/null
+++ b/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h
@@ -0,0 +1,102 @@
+/*
+ * Copyright (c) 2006-Present, Redis Ltd.
+ * All rights reserved.
+ *
+ * Licensed under your choice of the Redis Source Available License 2.0
+ * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the
+ * GNU Affero General Public License v3 (AGPLv3).
+ */
+#pragma once
+#include "VecSim/spaces/space_includes.h"
+#include "VecSim/types/sq8.h"
+#include "VecSim/types/float16.h"
+#include "VecSim/utils/alignment.h"
+#include <immintrin.h>
+
+using sq8 = vecsim_types::sq8;
+using float16 = vecsim_types::float16;
+
+// Helper: load 16 SQ8 + 16 FP16 lanes, widen both to FP32, fused-multiply-add into sum.
+static inline void SQ8_FP16_InnerProductStep_AVX512(const uint8_t *&pVec1,
+                                                    const float16 *&pVec2,
+                                                    __m512 &sum) {
+    // 16 uint8 -> 16 fp32
+    __m128i v1_128 = _mm_loadu_si128(reinterpret_cast<const __m128i *>(pVec1));
+    __m512i v1_512 = _mm512_cvtepu8_epi32(v1_128);
+    __m512 v1_f = _mm512_cvtepi32_ps(v1_512);
+
+    // 16 fp16 -> 16 fp32. _mm512_cvtph_ps is part of AVX512F.
+    __m256i v2_16 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(pVec2));
+    __m512 v2_f = _mm512_cvtph_ps(v2_16);
+
+    sum = _mm512_fmadd_ps(v1_f, v2_f, sum);
+
+    pVec1 += 16;
+    pVec2 += 16;
+}
+
+// Raw inner product Σ((min + delta * q_i) * y_i). Used by both InnerProduct/Cosine wrappers
+// and by the L2 kernel.
+template <unsigned char residual> // 0..15
+float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, size_t dimension) {
+    const uint8_t *pVec1 = static_cast<const uint8_t *>(pVec1v); // SQ8 storage
+    const float16 *pVec2 = static_cast<const float16 *>(pVec2v); // FP16 query
+    const uint8_t *pEnd1 = pVec1 + dimension;
+
+    __m512 sum = _mm512_setzero_ps();
+
+    if constexpr (residual > 0) {
+        __mmask16 mask = (1U << residual) - 1;
+
+        __m128i v1_128 = _mm_loadu_si128(reinterpret_cast<const __m128i *>(pVec1));
+        __m512i v1_512 = _mm512_cvtepu8_epi32(v1_128);
+        __m512 v1_f = _mm512_cvtepi32_ps(v1_512);
+
+        // Safe to read the full 32-byte FP16 chunk: dim >= 16 and the FP16 metadata follows
+        // the lanes, so the load stays within the query blob.
+        __m256i v2_16 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(pVec2));
+        __m512 v2_f = _mm512_cvtph_ps(v2_16);
+
+        // Mask out unused lanes by folding the mask into the multiply.
+        sum = _mm512_maskz_mul_ps(mask, v1_f, v2_f);
+
+        pVec1 += residual;
+        pVec2 += residual;
+    }
+
+    do {
+        SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum);
+    } while (pVec1 < pEnd1);
+
+    float quantized_dot = _mm512_reduce_add_ps(sum);
+
+    // SQ8 metadata starts at byte offset `dimension`; for odd `dimension` it is not
+    // 4-byte aligned, so use load_unaligned. Mirrors the scalar SQ8_FP16_Impl pattern.
+    const uint8_t *pVec1Base = static_cast<const uint8_t *>(pVec1v);
+    const uint8_t *params_bytes = pVec1Base + dimension;
+    const float min_val = load_unaligned<float>(params_bytes + sq8::MIN_VAL * sizeof(float));
+    const float delta = load_unaligned<float>(params_bytes + sq8::DELTA * sizeof(float));
+
+    // FP16 query metadata sits at byte offset 2*dimension; for odd `dimension` it is
+    // 2-byte aligned only.
+    const float16 *pVec2Base = static_cast<const float16 *>(pVec2v);
+    const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVec2Base + dimension);
+    const float y_sum =
+        load_unaligned<float>(query_meta_bytes + sq8::SUM_QUERY * sizeof(float));
+
+    return min_val * y_sum + delta * quantized_dot;
+}
+
+template <unsigned char residual> // 0..15
+float SQ8_FP16_InnerProductSIMD16_AVX512F_BW_VL_VNNI(const void *pVec1v, const void *pVec2v,
+                                                     size_t dimension) {
+    return 1.0f - SQ8_FP16_InnerProductImp_AVX512<residual>(pVec1v, pVec2v, dimension);
+}
+
+template <unsigned char residual> // 0..15
+float SQ8_FP16_CosineSIMD16_AVX512F_BW_VL_VNNI(const void *pVec1v, const void *pVec2v,
+                                               size_t dimension) {
+    // Cosine distance = 1 - IP for pre-normalised vectors. Aliases InnerProduct, matching the
+    // SQ8_FP32 pattern.
+    return SQ8_FP16_InnerProductSIMD16_AVX512F_BW_VL_VNNI<residual>(pVec1v, pVec2v, dimension);
+}
diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp
index 55979e25a..a99241180 100644
--- a/src/VecSim/spaces/IP_space.cpp
+++ b/src/VecSim/spaces/IP_space.cpp
@@ -172,31 +172,56 @@ dist_func_t<float> Cosine_SQ8_FP32_GetDistFunc(size_t dim, unsigned char *alignm
 }
 
 // SQ8-FP16: asymmetric inner product distance between SQ8 storage and FP16 query.
-// SIMD chooser slots are added by P1b (MOD-15152) / P1c (MOD-15153); for now this always
-// returns the scalar implementation.
 dist_func_t<float> IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment,
                                            const void *arch_opt) {
     unsigned char dummy_alignment;
     if (alignment == nullptr) {
         alignment = &dummy_alignment;
     }
-    (void)dim;
-    (void)arch_opt;
-    return SQ8_FP16_InnerProduct;
+
+    dist_func_t<float> ret_dist_func = SQ8_FP16_InnerProduct;
+    [[maybe_unused]] auto features = getCpuOptimizationFeatures(arch_opt);
+
+#ifdef CPU_FEATURES_ARCH_X86_64
+    if (dim < 16) {
+        return ret_dist_func;
+    }
+    // Alignment hints below refer to the SQ8 (first) operand per the GetDistFunc contract.
+#ifdef OPT_AVX512_F_BW_VL_VNNI
+    if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) {
+        if (dim % 16 == 0) // SQ8 chunk = 16 bytes
+            *alignment = 16 * sizeof(uint8_t);
+        return Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim);
+    }
+#endif
+#endif // x86_64
+    return ret_dist_func;
 }
 
 // SQ8-FP16: asymmetric cosine distance between SQ8 storage and FP16 query.
-// SIMD chooser slots are added by P1b (MOD-15152) / P1c (MOD-15153); for now this always
-// returns the scalar implementation.
 dist_func_t<float> Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment,
                                                const void *arch_opt) {
     unsigned char dummy_alignment;
     if (alignment == nullptr) {
         alignment = &dummy_alignment;
     }
-    (void)dim;
-    (void)arch_opt;
-    return SQ8_FP16_Cosine;
+
+    dist_func_t<float> ret_dist_func = SQ8_FP16_Cosine;
+    [[maybe_unused]] auto features = getCpuOptimizationFeatures(arch_opt);
+
+#ifdef CPU_FEATURES_ARCH_X86_64
+    if (dim < 16) {
+        return ret_dist_func;
+    }
+#ifdef OPT_AVX512_F_BW_VL_VNNI
+    if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) {
+        if (dim % 16 == 0)
+            *alignment = 16 * sizeof(uint8_t);
+        return Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(dim);
+    }
+#endif
+#endif // x86_64
+    return ret_dist_func;
 }
 
 // SQ8-to-SQ8 Inner Product distance function (both vectors are uint8 quantized with precomputed
diff --git a/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h
new file mode 100644
index 000000000..101bf285e
--- /dev/null
+++ b/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h
@@ -0,0 +1,36 @@
+/*
+ * Copyright (c) 2006-Present, Redis Ltd.
+ * All rights reserved.
+ *
+ * Licensed under your choice of the Redis Source Available License 2.0
+ * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the
+ * GNU Affero General Public License v3 (AGPLv3).
+ */
+#pragma once
+#include "VecSim/spaces/space_includes.h"
+#include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h"
+#include "VecSim/types/sq8.h"
+#include "VecSim/types/float16.h"
+#include "VecSim/utils/alignment.h"
+
+using sq8 = vecsim_types::sq8;
+using float16 = vecsim_types::float16;
+
+// L2² = x_sum_squares + y_sum_squares - 2 * IP(x, y), computed via the AVX-512 IP impl above.
+template <unsigned char residual> // 0..15
+float SQ8_FP16_L2SqrSIMD16_AVX512F_BW_VL_VNNI(const void *pVect1v, const void *pVect2v,
+                                              size_t dimension) {
+    const float ip = SQ8_FP16_InnerProductImp_AVX512<residual>(pVect1v, pVect2v, dimension);
+
+    const uint8_t *pVect1 = static_cast<const uint8_t *>(pVect1v);
+    const uint8_t *params_bytes = pVect1 + dimension;
+    const float x_sum_sq =
+        load_unaligned<float>(params_bytes + sq8::SUM_SQUARES * sizeof(float));
+
+    const float16 *pVect2 = static_cast<const float16 *>(pVect2v);
+    const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVect2 + dimension);
+    const float y_sum_sq =
+        load_unaligned<float>(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float));
+
+    return x_sum_sq + y_sum_sq - 2.0f * ip;
+}
diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp
index ba3dd7cab..eaf383443 100644
--- a/src/VecSim/spaces/L2_space.cpp
+++ b/src/VecSim/spaces/L2_space.cpp
@@ -104,17 +104,30 @@ dist_func_t<float> L2_SQ8_FP32_GetDistFunc(size_t dim, unsigned char *alignment,
 }
 
 // SQ8-FP16: asymmetric L2 distance between SQ8 storage and FP16 query.
-// SIMD chooser slots are added by P1b (MOD-15152) / P1c (MOD-15153); for now this always
-// returns the scalar implementation.
 dist_func_t<float> L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment,
                                            const void *arch_opt) {
     unsigned char dummy_alignment;
     if (!alignment) {
         alignment = &dummy_alignment;
     }
-    (void)dim;
-    (void)arch_opt;
-    return SQ8_FP16_L2Sqr;
+
+    dist_func_t<float> ret_dist_func = SQ8_FP16_L2Sqr;
+    [[maybe_unused]] auto features = getCpuOptimizationFeatures(arch_opt);
+
+#ifdef CPU_FEATURES_ARCH_X86_64
+    if (dim < 16) {
+        return ret_dist_func;
+    }
+    // Alignment hints below refer to the SQ8 (first) operand per the GetDistFunc contract.
+#ifdef OPT_AVX512_F_BW_VL_VNNI
+    if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) {
+        if (dim % 16 == 0)
+            *alignment = 16 * sizeof(uint8_t);
+        return Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(dim);
+    }
+#endif
+#endif // x86_64
+    return ret_dist_func;
 }
 
 dist_func_t<float> L2_FP32_GetDistFunc(size_t dim, unsigned char *alignment, const void *arch_opt) {
diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp
index 3b8813b89..e5e8bb1c2 100644
--- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp
+++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp
@@ -17,6 +17,9 @@
 #include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP32.h"
 #include "VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP32.h"
 
+#include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h"
+#include "VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h"
+
 #include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_SQ8.h"
 #include "VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_SQ8.h"
 
@@ -75,6 +78,24 @@ dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t d
     CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP32_L2SqrSIMD16_AVX512F_BW_VL_VNNI);
     return ret_dist_func;
 }
+
+// SQ8-to-FP16 distance functions
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX512F_BW_VL_VNNI);
+    return ret_dist_func;
+}
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX512F_BW_VL_VNNI);
+    return ret_dist_func;
+}
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX512F_BW_VL_VNNI);
+    return ret_dist_func;
+}
+
 // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum)
 dist_func_t<float> Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim) {
     dist_func_t<float> ret_dist_func;
diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h
index fe1583491..b68bfd0a4 100644
--- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h
+++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h
@@ -24,6 +24,10 @@ dist_func_t<float> Choose_SQ8_FP32_IP_implementation_AVX512F_BW_VL_VNNI(size_t d
 dist_func_t<float> Choose_SQ8_FP32_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim);
 dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t dim);
 
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(size_t dim);
+
 // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum)
 dist_func_t<float> Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim);
 dist_func_t<float> Choose_SQ8_SQ8_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim);
diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp
index dfbe81f5d..117886dba 100644
--- a/tests/unit/test_spaces.cpp
+++ b/tests/unit/test_spaces.cpp
@@ -3094,7 +3094,18 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) {
     dist_func_t<float> arch_opt_func;
     float baseline = SQ8_FP16_L2Sqr(v2_compressed.data(), v1_query.data(), dim);
 
-    // Per-tier assertion blocks are added by Tasks 3–6.
+#ifdef OPT_AVX512_F_BW_VL_VNNI
+    if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl &&
+        optimization.avx512vnni) {
+        unsigned char alignment = 0;
+        arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
+        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(dim))
+            << "Unexpected distance function chosen for dim " << dim;
+        ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
+            << "AVX512 with dim " << dim;
+        optimization.avx512f = 0;
+    }
+#endif
 
     // Scalar fallback.
     unsigned char alignment = 0;
@@ -3123,7 +3134,18 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) {
     dist_func_t<float> arch_opt_func;
     float baseline = SQ8_FP16_InnerProduct(v2_compressed.data(), v1_query.data(), dim);
 
-    // Per-tier assertion blocks are added by Tasks 3–6.
+#ifdef OPT_AVX512_F_BW_VL_VNNI
+    if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl &&
+        optimization.avx512vnni) {
+        unsigned char alignment = 0;
+        arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
+        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim))
+            << "Unexpected distance function chosen for dim " << dim;
+        ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
+            << "AVX512 with dim " << dim;
+        optimization.avx512f = 0;
+    }
+#endif
 
     unsigned char alignment = 0;
     arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
@@ -3151,7 +3173,18 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) {
     dist_func_t<float> arch_opt_func;
     float baseline = SQ8_FP16_Cosine(v2_compressed.data(), v1_query.data(), dim);
 
-    // Per-tier assertion blocks are added by Tasks 3–6.
+#ifdef OPT_AVX512_F_BW_VL_VNNI
+    if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl &&
+        optimization.avx512vnni) {
+        unsigned char alignment = 0;
+        arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
+        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(dim))
+            << "Unexpected distance function chosen for dim " << dim;
+        ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
+            << "AVX512 with dim " << dim;
+        optimization.avx512f = 0;
+    }
+#endif
 
     unsigned char alignment = 0;
     arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);

From c2f8340efbaacb4e3846a6e8c176e039195fe41b Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Tue, 26 May 2026 11:33:48 +0300
Subject: [PATCH 05/24] =?UTF-8?q?Add=20AVX2+FMA=20SQ8=E2=86=94FP16=20SIMD?=
 =?UTF-8?q?=20distance=20kernels=20[MOD-14954]?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

8-wide AVX2+FMA kernels widen 8 SQ8 lanes via cvtepu8_epi32 +
cvtepi32_ps and 8 FP16 lanes via _mm256_cvtph_ps, then fmadd into a
256-bit FP32 accumulator. Residual (< 8) lanes load the full 16-byte
FP16 block, convert, then blend zero across unused lanes — mirroring
the existing F16C FP16 kernel pattern. Dispatcher branch in
{IP,Cosine,L2}_SQ8_FP16_GetDistFunc selects the new
Choose_SQ8_FP16_*_implementation_AVX2_FMA when features.avx2 &&
features.fma3 && features.f16c.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h | 95 +++++++++++++++++++++
 src/VecSim/spaces/IP_space.cpp              | 18 ++++
 src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h | 35 ++++++++
 src/VecSim/spaces/L2_space.cpp              |  9 ++
 src/VecSim/spaces/functions/AVX2_FMA.cpp    | 23 +++++
 src/VecSim/spaces/functions/AVX2_FMA.h      |  6 ++
 tests/unit/test_spaces.cpp                  | 39 +++++++++
 7 files changed, 225 insertions(+)
 create mode 100644 src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
 create mode 100644 src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h

diff --git a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
new file mode 100644
index 000000000..1d6b4e676
--- /dev/null
+++ b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2006-Present, Redis Ltd.
+ * All rights reserved.
+ *
+ * Licensed under your choice of the Redis Source Available License 2.0
+ * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the
+ * GNU Affero General Public License v3 (AGPLv3).
+ */
+#pragma once
+#include "VecSim/spaces/space_includes.h"
+#include "VecSim/spaces/AVX_utils.h"
+#include "VecSim/types/sq8.h"
+#include "VecSim/types/float16.h"
+#include "VecSim/utils/alignment.h"
+
+using sq8 = vecsim_types::sq8;
+using float16 = vecsim_types::float16;
+
+// 8-wide AVX2+FMA step: 8 SQ8 lanes + 8 FP16 lanes -> 8 FP32 fused-multiply-add.
+static inline void SQ8_FP16_InnerProductStep_AVX2_FMA(const uint8_t *&pVect1,
+                                                      const float16 *&pVect2,
+                                                      __m256 &sum256) {
+    __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast<const __m128i *>(pVect1));
+    pVect1 += 8;
+    __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128);
+    __m256 v1_f = _mm256_cvtepi32_ps(v1_256);
+
+    __m128i v2_128 = _mm_loadu_si128(reinterpret_cast<const __m128i *>(pVect2));
+    __m256 v2_f = _mm256_cvtph_ps(v2_128);
+    pVect2 += 8;
+
+    sum256 = _mm256_fmadd_ps(v1_f, v2_f, sum256);
+}
+
+template <unsigned char residual> // 0..15
+float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v, size_t dimension) {
+    const uint8_t *pVec1 = static_cast<const uint8_t *>(pVec1v);
+    const float16 *pVec2 = static_cast<const float16 *>(pVec2v);
+    const uint8_t *pEnd1 = pVec1 + dimension;
+
+    __m256 sum256 = _mm256_setzero_ps();
+
+    if constexpr (residual % 8) {
+        constexpr int mask = (1 << (residual % 8)) - 1;
+
+        // SQ8 side: load 8 bytes regardless of residual; unused lanes are zeroed by the blend on
+        // the FP32 query.
+        __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast<const __m128i *>(pVec1));
+        pVec1 += residual % 8;
+        __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128);
+        __m256 v1_f = _mm256_cvtepi32_ps(v1_256);
+
+        // FP16 side: load full 16-byte block (safe — dim >= 16 and metadata follows).
+        __m128i v2_128 = _mm_loadu_si128(reinterpret_cast<const __m128i *>(pVec2));
+        __m256 v2_f = _mm256_cvtph_ps(v2_128);
+        v2_f = _mm256_blend_ps(_mm256_setzero_ps(), v2_f, mask);
+        pVec2 += residual % 8;
+
+        sum256 = _mm256_mul_ps(v1_f, v2_f);
+    }
+
+    if constexpr (residual >= 8) {
+        SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum256);
+    }
+
+    do {
+        SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum256);
+        SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum256);
+    } while (pVec1 < pEnd1);
+
+    float quantized_dot = my_mm256_reduce_add_ps(sum256);
+
+    const uint8_t *pVec1Base = static_cast<const uint8_t *>(pVec1v);
+    const uint8_t *params_bytes = pVec1Base + dimension;
+    const float min_val = load_unaligned<float>(params_bytes + sq8::MIN_VAL * sizeof(float));
+    const float delta = load_unaligned<float>(params_bytes + sq8::DELTA * sizeof(float));
+
+    const float16 *pVec2Base = static_cast<const float16 *>(pVec2v);
+    const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVec2Base + dimension);
+    const float y_sum =
+        load_unaligned<float>(query_meta_bytes + sq8::SUM_QUERY * sizeof(float));
+
+    return min_val * y_sum + delta * quantized_dot;
+}
+
+template <unsigned char residual> // 0..15
+float SQ8_FP16_InnerProductSIMD16_AVX2_FMA(const void *pVec1v, const void *pVec2v,
+                                           size_t dimension) {
+    return 1.0f - SQ8_FP16_InnerProductImp_AVX2_FMA<residual>(pVec1v, pVec2v, dimension);
+}
+
+template <unsigned char residual> // 0..15
+float SQ8_FP16_CosineSIMD16_AVX2_FMA(const void *pVec1v, const void *pVec2v, size_t dimension) {
+    return SQ8_FP16_InnerProductSIMD16_AVX2_FMA<residual>(pVec1v, pVec2v, dimension);
+}
diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp
index a99241180..1af5d2c35 100644
--- a/src/VecSim/spaces/IP_space.cpp
+++ b/src/VecSim/spaces/IP_space.cpp
@@ -194,6 +194,15 @@ dist_func_t<float> IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment,
         return Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim);
     }
 #endif
+#ifdef OPT_AVX2_FMA
+#ifdef OPT_F16C
+    if (features.avx2 && features.fma3 && features.f16c) {
+        if (dim % 8 == 0) // SQ8 chunk = 8 bytes
+            *alignment = 8 * sizeof(uint8_t);
+        return Choose_SQ8_FP16_IP_implementation_AVX2_FMA(dim);
+    }
+#endif
+#endif
 #endif // x86_64
     return ret_dist_func;
 }
@@ -220,6 +229,15 @@ dist_func_t<float> Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignm
         return Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(dim);
     }
 #endif
+#ifdef OPT_AVX2_FMA
+#ifdef OPT_F16C
+    if (features.avx2 && features.fma3 && features.f16c) {
+        if (dim % 8 == 0)
+            *alignment = 8 * sizeof(uint8_t);
+        return Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(dim);
+    }
+#endif
+#endif
 #endif // x86_64
     return ret_dist_func;
 }
diff --git a/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h
new file mode 100644
index 000000000..5f9ad0db6
--- /dev/null
+++ b/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h
@@ -0,0 +1,35 @@
+/*
+ * Copyright (c) 2006-Present, Redis Ltd.
+ * All rights reserved.
+ *
+ * Licensed under your choice of the Redis Source Available License 2.0
+ * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the
+ * GNU Affero General Public License v3 (AGPLv3).
+ */
+#pragma once
+#include "VecSim/spaces/space_includes.h"
+#include "VecSim/spaces/AVX_utils.h"
+#include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h"
+#include "VecSim/types/sq8.h"
+#include "VecSim/types/float16.h"
+#include "VecSim/utils/alignment.h"
+
+using sq8 = vecsim_types::sq8;
+using float16 = vecsim_types::float16;
+
+template <unsigned char residual> // 0..15
+float SQ8_FP16_L2SqrSIMD16_AVX2_FMA(const void *pVect1v, const void *pVect2v, size_t dimension) {
+    const float ip = SQ8_FP16_InnerProductImp_AVX2_FMA<residual>(pVect1v, pVect2v, dimension);
+
+    const uint8_t *pVect1 = static_cast<const uint8_t *>(pVect1v);
+    const uint8_t *params_bytes = pVect1 + dimension;
+    const float x_sum_sq =
+        load_unaligned<float>(params_bytes + sq8::SUM_SQUARES * sizeof(float));
+
+    const float16 *pVect2 = static_cast<const float16 *>(pVect2v);
+    const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVect2 + dimension);
+    const float y_sum_sq =
+        load_unaligned<float>(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float));
+
+    return x_sum_sq + y_sum_sq - 2.0f * ip;
+}
diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp
index eaf383443..79f6c21ae 100644
--- a/src/VecSim/spaces/L2_space.cpp
+++ b/src/VecSim/spaces/L2_space.cpp
@@ -126,6 +126,15 @@ dist_func_t<float> L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment,
         return Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(dim);
     }
 #endif
+#ifdef OPT_AVX2_FMA
+#ifdef OPT_F16C
+    if (features.avx2 && features.fma3 && features.f16c) {
+        if (dim % 8 == 0)
+            *alignment = 8 * sizeof(uint8_t);
+        return Choose_SQ8_FP16_L2_implementation_AVX2_FMA(dim);
+    }
+#endif
+#endif
 #endif // x86_64
     return ret_dist_func;
 }
diff --git a/src/VecSim/spaces/functions/AVX2_FMA.cpp b/src/VecSim/spaces/functions/AVX2_FMA.cpp
index c859128b2..5745a4ddf 100644
--- a/src/VecSim/spaces/functions/AVX2_FMA.cpp
+++ b/src/VecSim/spaces/functions/AVX2_FMA.cpp
@@ -10,6 +10,11 @@
 #include "VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP32.h"
 #include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP32.h"
 
+#ifdef OPT_F16C
+#include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h"
+#include "VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h"
+#endif
+
 namespace spaces {
 
 #include "implementation_chooser.h"
@@ -31,6 +36,24 @@ dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX2_FMA(size_t dim) {
     return ret_dist_func;
 }
 
+#ifdef OPT_F16C
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX2_FMA(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX2_FMA);
+    return ret_dist_func;
+}
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX2_FMA);
+    return ret_dist_func;
+}
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX2_FMA(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX2_FMA);
+    return ret_dist_func;
+}
+#endif
+
 #include "implementation_chooser_cleanup.h"
 
 } // namespace spaces
diff --git a/src/VecSim/spaces/functions/AVX2_FMA.h b/src/VecSim/spaces/functions/AVX2_FMA.h
index b20b1a588..413f55081 100644
--- a/src/VecSim/spaces/functions/AVX2_FMA.h
+++ b/src/VecSim/spaces/functions/AVX2_FMA.h
@@ -16,4 +16,10 @@ dist_func_t<float> Choose_SQ8_FP32_IP_implementation_AVX2_FMA(size_t dim);
 dist_func_t<float> Choose_SQ8_FP32_Cosine_implementation_AVX2_FMA(size_t dim);
 dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX2_FMA(size_t dim);
 
+#ifdef OPT_F16C
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX2_FMA(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX2_FMA(size_t dim);
+#endif
+
 } // namespace spaces
diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp
index 117886dba..32f5e6991 100644
--- a/tests/unit/test_spaces.cpp
+++ b/tests/unit/test_spaces.cpp
@@ -3105,6 +3105,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) {
             << "AVX512 with dim " << dim;
         optimization.avx512f = 0;
     }
+#endif
+#ifdef OPT_AVX2_FMA
+#ifdef OPT_F16C
+    if (optimization.avx2 && optimization.fma3 && optimization.f16c) {
+        unsigned char alignment = 0;
+        arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
+        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_AVX2_FMA(dim))
+            << "Unexpected distance function chosen for dim " << dim;
+        ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
+            << "AVX2+FMA with dim " << dim;
+        optimization.fma3 = 0;
+    }
+#endif
 #endif
 
     // Scalar fallback.
@@ -3145,6 +3158,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) {
             << "AVX512 with dim " << dim;
         optimization.avx512f = 0;
     }
+#endif
+#ifdef OPT_AVX2_FMA
+#ifdef OPT_F16C
+    if (optimization.avx2 && optimization.fma3 && optimization.f16c) {
+        unsigned char alignment = 0;
+        arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
+        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_AVX2_FMA(dim))
+            << "Unexpected distance function chosen for dim " << dim;
+        ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
+            << "AVX2+FMA with dim " << dim;
+        optimization.fma3 = 0;
+    }
+#endif
 #endif
 
     unsigned char alignment = 0;
@@ -3184,6 +3210,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) {
             << "AVX512 with dim " << dim;
         optimization.avx512f = 0;
     }
+#endif
+#ifdef OPT_AVX2_FMA
+#ifdef OPT_F16C
+    if (optimization.avx2 && optimization.fma3 && optimization.f16c) {
+        unsigned char alignment = 0;
+        arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
+        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(dim))
+            << "Unexpected distance function chosen for dim " << dim;
+        ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
+            << "AVX2+FMA with dim " << dim;
+        optimization.fma3 = 0;
+    }
+#endif
 #endif
 
     unsigned char alignment = 0;

From 415c2ed64397656126c04e8811a0d5deef0acf11 Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Tue, 26 May 2026 11:37:19 +0300
Subject: [PATCH 06/24] =?UTF-8?q?Add=20AVX2=20(no=20FMA)=20SQ8=E2=86=94FP1?=
 =?UTF-8?q?6=20SIMD=20distance=20kernels=20[MOD-14954]?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Mirrors the AVX2+FMA kernels but uses _mm256_mul_ps + _mm256_add_ps
instead of _mm256_fmadd_ps so it can run on Haswell-era AVX2 hardware
without FMA support (uncommon but matches the existing SQ8_FP32
tiering). Dispatcher gate requires features.avx2 && features.f16c
and runs between the AVX2+FMA and SSE4 tiers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h | 91 +++++++++++++++++++++++++
 src/VecSim/spaces/IP_space.cpp          | 18 +++++
 src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h | 35 ++++++++++
 src/VecSim/spaces/L2_space.cpp          |  9 +++
 src/VecSim/spaces/functions/AVX2.cpp    | 23 +++++++
 src/VecSim/spaces/functions/AVX2.h      |  6 ++
 tests/unit/test_spaces.cpp              | 39 +++++++++++
 7 files changed, 221 insertions(+)
 create mode 100644 src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
 create mode 100644 src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h

diff --git a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
new file mode 100644
index 000000000..e68e5fa11
--- /dev/null
+++ b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
@@ -0,0 +1,91 @@
+/*
+ * Copyright (c) 2006-Present, Redis Ltd.
+ * All rights reserved.
+ *
+ * Licensed under your choice of the Redis Source Available License 2.0
+ * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the
+ * GNU Affero General Public License v3 (AGPLv3).
+ */
+#pragma once
+#include "VecSim/spaces/space_includes.h"
+#include "VecSim/spaces/AVX_utils.h"
+#include "VecSim/types/sq8.h"
+#include "VecSim/types/float16.h"
+#include "VecSim/utils/alignment.h"
+
+using sq8 = vecsim_types::sq8;
+using float16 = vecsim_types::float16;
+
+// 8-wide AVX2 step (no FMA): 8 SQ8 lanes + 8 FP16 lanes -> mul + add into sum.
+static inline void SQ8_FP16_InnerProductStep_AVX2(const uint8_t *&pVect1,
+                                                  const float16 *&pVect2,
+                                                  __m256 &sum256) {
+    __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast<const __m128i *>(pVect1));
+    pVect1 += 8;
+    __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128);
+    __m256 v1_f = _mm256_cvtepi32_ps(v1_256);
+
+    __m128i v2_128 = _mm_loadu_si128(reinterpret_cast<const __m128i *>(pVect2));
+    __m256 v2_f = _mm256_cvtph_ps(v2_128);
+    pVect2 += 8;
+
+    sum256 = _mm256_add_ps(sum256, _mm256_mul_ps(v1_f, v2_f));
+}
+
+template <unsigned char residual> // 0..15
+float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size_t dimension) {
+    const uint8_t *pVec1 = static_cast<const uint8_t *>(pVec1v);
+    const float16 *pVec2 = static_cast<const float16 *>(pVec2v);
+    const uint8_t *pEnd1 = pVec1 + dimension;
+
+    __m256 sum256 = _mm256_setzero_ps();
+
+    if constexpr (residual % 8) {
+        constexpr int mask = (1 << (residual % 8)) - 1;
+
+        __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast<const __m128i *>(pVec1));
+        pVec1 += residual % 8;
+        __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128);
+        __m256 v1_f = _mm256_cvtepi32_ps(v1_256);
+
+        __m128i v2_128 = _mm_loadu_si128(reinterpret_cast<const __m128i *>(pVec2));
+        __m256 v2_f = _mm256_cvtph_ps(v2_128);
+        v2_f = _mm256_blend_ps(_mm256_setzero_ps(), v2_f, mask);
+        pVec2 += residual % 8;
+
+        sum256 = _mm256_mul_ps(v1_f, v2_f);
+    }
+
+    if constexpr (residual >= 8) {
+        SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum256);
+    }
+
+    do {
+        SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum256);
+        SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum256);
+    } while (pVec1 < pEnd1);
+
+    float quantized_dot = my_mm256_reduce_add_ps(sum256);
+
+    const uint8_t *pVec1Base = static_cast<const uint8_t *>(pVec1v);
+    const uint8_t *params_bytes = pVec1Base + dimension;
+    const float min_val = load_unaligned<float>(params_bytes + sq8::MIN_VAL * sizeof(float));
+    const float delta = load_unaligned<float>(params_bytes + sq8::DELTA * sizeof(float));
+
+    const float16 *pVec2Base = static_cast<const float16 *>(pVec2v);
+    const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVec2Base + dimension);
+    const float y_sum =
+        load_unaligned<float>(query_meta_bytes + sq8::SUM_QUERY * sizeof(float));
+
+    return min_val * y_sum + delta * quantized_dot;
+}
+
+template <unsigned char residual> // 0..15
+float SQ8_FP16_InnerProductSIMD16_AVX2(const void *pVec1v, const void *pVec2v, size_t dimension) {
+    return 1.0f - SQ8_FP16_InnerProductImp_AVX2<residual>(pVec1v, pVec2v, dimension);
+}
+
+template <unsigned char residual> // 0..15
+float SQ8_FP16_CosineSIMD16_AVX2(const void *pVec1v, const void *pVec2v, size_t dimension) {
+    return SQ8_FP16_InnerProductSIMD16_AVX2<residual>(pVec1v, pVec2v, dimension);
+}
diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp
index 1af5d2c35..68308d0b0 100644
--- a/src/VecSim/spaces/IP_space.cpp
+++ b/src/VecSim/spaces/IP_space.cpp
@@ -203,6 +203,15 @@ dist_func_t<float> IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment,
     }
 #endif
 #endif
+#ifdef OPT_AVX2
+#ifdef OPT_F16C
+    if (features.avx2 && features.f16c) {
+        if (dim % 8 == 0)
+            *alignment = 8 * sizeof(uint8_t);
+        return Choose_SQ8_FP16_IP_implementation_AVX2(dim);
+    }
+#endif
+#endif
 #endif // x86_64
     return ret_dist_func;
 }
@@ -238,6 +247,15 @@ dist_func_t<float> Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignm
     }
 #endif
 #endif
+#ifdef OPT_AVX2
+#ifdef OPT_F16C
+    if (features.avx2 && features.f16c) {
+        if (dim % 8 == 0)
+            *alignment = 8 * sizeof(uint8_t);
+        return Choose_SQ8_FP16_Cosine_implementation_AVX2(dim);
+    }
+#endif
+#endif
 #endif // x86_64
     return ret_dist_func;
 }
diff --git a/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h
new file mode 100644
index 000000000..86ec4b66e
--- /dev/null
+++ b/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h
@@ -0,0 +1,35 @@
+/*
+ * Copyright (c) 2006-Present, Redis Ltd.
+ * All rights reserved.
+ *
+ * Licensed under your choice of the Redis Source Available License 2.0
+ * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the
+ * GNU Affero General Public License v3 (AGPLv3).
+ */
+#pragma once
+#include "VecSim/spaces/space_includes.h"
+#include "VecSim/spaces/AVX_utils.h"
+#include "VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h"
+#include "VecSim/types/sq8.h"
+#include "VecSim/types/float16.h"
+#include "VecSim/utils/alignment.h"
+
+using sq8 = vecsim_types::sq8;
+using float16 = vecsim_types::float16;
+
+template <unsigned char residual> // 0..15
+float SQ8_FP16_L2SqrSIMD16_AVX2(const void *pVect1v, const void *pVect2v, size_t dimension) {
+    const float ip = SQ8_FP16_InnerProductImp_AVX2<residual>(pVect1v, pVect2v, dimension);
+
+    const uint8_t *pVect1 = static_cast<const uint8_t *>(pVect1v);
+    const uint8_t *params_bytes = pVect1 + dimension;
+    const float x_sum_sq =
+        load_unaligned<float>(params_bytes + sq8::SUM_SQUARES * sizeof(float));
+
+    const float16 *pVect2 = static_cast<const float16 *>(pVect2v);
+    const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVect2 + dimension);
+    const float y_sum_sq =
+        load_unaligned<float>(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float));
+
+    return x_sum_sq + y_sum_sq - 2.0f * ip;
+}
diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp
index 79f6c21ae..2b6a31166 100644
--- a/src/VecSim/spaces/L2_space.cpp
+++ b/src/VecSim/spaces/L2_space.cpp
@@ -135,6 +135,15 @@ dist_func_t<float> L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment,
     }
 #endif
 #endif
+#ifdef OPT_AVX2
+#ifdef OPT_F16C
+    if (features.avx2 && features.f16c) {
+        if (dim % 8 == 0)
+            *alignment = 8 * sizeof(uint8_t);
+        return Choose_SQ8_FP16_L2_implementation_AVX2(dim);
+    }
+#endif
+#endif
 #endif // x86_64
     return ret_dist_func;
 }
diff --git a/src/VecSim/spaces/functions/AVX2.cpp b/src/VecSim/spaces/functions/AVX2.cpp
index 322ed0aec..7e229b003 100644
--- a/src/VecSim/spaces/functions/AVX2.cpp
+++ b/src/VecSim/spaces/functions/AVX2.cpp
@@ -13,6 +13,11 @@
 #include "VecSim/spaces/IP/IP_AVX2_SQ8_FP32.h"
 #include "VecSim/spaces/L2/L2_AVX2_SQ8_FP32.h"
 
+#ifdef OPT_F16C
+#include "VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h"
+#include "VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h"
+#endif
+
 namespace spaces {
 
 #include "implementation_chooser.h"
@@ -47,6 +52,24 @@ dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX2(size_t dim) {
     return ret_dist_func;
 }
 
+#ifdef OPT_F16C
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX2(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX2);
+    return ret_dist_func;
+}
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX2(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX2);
+    return ret_dist_func;
+}
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX2(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX2);
+    return ret_dist_func;
+}
+#endif
+
 #include "implementation_chooser_cleanup.h"
 
 } // namespace spaces
diff --git a/src/VecSim/spaces/functions/AVX2.h b/src/VecSim/spaces/functions/AVX2.h
index 081c42a4e..45fa2c951 100644
--- a/src/VecSim/spaces/functions/AVX2.h
+++ b/src/VecSim/spaces/functions/AVX2.h
@@ -19,4 +19,10 @@ dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX2(size_t dim);
 dist_func_t<float> Choose_BF16_IP_implementation_AVX2(size_t dim);
 dist_func_t<float> Choose_BF16_L2_implementation_AVX2(size_t dim);
 
+#ifdef OPT_F16C
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX2(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX2(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX2(size_t dim);
+#endif
+
 } // namespace spaces
diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp
index 32f5e6991..968294eac 100644
--- a/tests/unit/test_spaces.cpp
+++ b/tests/unit/test_spaces.cpp
@@ -3118,6 +3118,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) {
         optimization.fma3 = 0;
     }
 #endif
+#endif
+#ifdef OPT_AVX2
+#ifdef OPT_F16C
+    if (optimization.avx2 && optimization.f16c) {
+        unsigned char alignment = 0;
+        arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
+        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_AVX2(dim))
+            << "Unexpected distance function chosen for dim " << dim;
+        ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
+            << "AVX2 with dim " << dim;
+        optimization.avx2 = 0;
+    }
+#endif
 #endif
 
     // Scalar fallback.
@@ -3171,6 +3184,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) {
         optimization.fma3 = 0;
     }
 #endif
+#endif
+#ifdef OPT_AVX2
+#ifdef OPT_F16C
+    if (optimization.avx2 && optimization.f16c) {
+        unsigned char alignment = 0;
+        arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
+        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_AVX2(dim))
+            << "Unexpected distance function chosen for dim " << dim;
+        ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
+            << "AVX2 with dim " << dim;
+        optimization.avx2 = 0;
+    }
+#endif
 #endif
 
     unsigned char alignment = 0;
@@ -3223,6 +3249,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) {
         optimization.fma3 = 0;
     }
 #endif
+#endif
+#ifdef OPT_AVX2
+#ifdef OPT_F16C
+    if (optimization.avx2 && optimization.f16c) {
+        unsigned char alignment = 0;
+        arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
+        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_AVX2(dim))
+            << "Unexpected distance function chosen for dim " << dim;
+        ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
+            << "AVX2 with dim " << dim;
+        optimization.avx2 = 0;
+    }
+#endif
 #endif
 
     unsigned char alignment = 0;

From 25c5a96d6cfe1deac8cd3275633f247bb0c39e52 Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Tue, 26 May 2026 11:41:10 +0300
Subject: [PATCH 07/24] =?UTF-8?q?Add=20SSE4+F16C=20SQ8=E2=86=94FP16=20SIMD?=
 =?UTF-8?q?=20distance=20kernels=20[MOD-14954]?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

4-wide SSE4 kernels widen 4 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps
and 4 FP16 lanes via _mm_cvtph_ps (F16C), then mul+add into a 128-bit
FP32 accumulator (SSE4 has no FMA). Residual % 4 lanes are materialised
via _mm_set_ps + the scalar FP16_to_FP32 helper, mirroring the existing
SSE4 SQ8_FP32 residual pattern. Dispatcher gate requires
features.sse4_1 && features.f16c && features.avx since F16C is
VEX-encoded — matches the existing F16C/FP16 dispatcher gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h | 109 ++++++++++++++++++++++++
 src/VecSim/spaces/IP_space.cpp          |  19 +++++
 src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h |  35 ++++++++
 src/VecSim/spaces/L2_space.cpp          |   9 ++
 src/VecSim/spaces/functions/SSE4.cpp    |  23 +++++
 src/VecSim/spaces/functions/SSE4.h      |   6 ++
 tests/unit/test_spaces.cpp              |  39 +++++++++
 7 files changed, 240 insertions(+)
 create mode 100644 src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
 create mode 100644 src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h

diff --git a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
new file mode 100644
index 000000000..8fd0e56c1
--- /dev/null
+++ b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
@@ -0,0 +1,109 @@
+/*
+ * Copyright (c) 2006-Present, Redis Ltd.
+ * All rights reserved.
+ *
+ * Licensed under your choice of the Redis Source Available License 2.0
+ * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the
+ * GNU Affero General Public License v3 (AGPLv3).
+ */
+#pragma once
+#include "VecSim/spaces/space_includes.h"
+#include "VecSim/types/sq8.h"
+#include "VecSim/types/float16.h"
+#include "VecSim/utils/alignment.h"
+
+using sq8 = vecsim_types::sq8;
+using float16 = vecsim_types::float16;
+
+// 4-wide SSE4+F16C step: 4 SQ8 lanes + 4 FP16 lanes -> mul + add into sum.
+static inline void SQ8_FP16_InnerProductStep_SSE4(const uint8_t *&pVect1,
+                                                  const float16 *&pVect2,
+                                                  __m128 &sum) {
+    __m128i v1_i =
+        _mm_cvtepu8_epi32(_mm_cvtsi32_si128(*reinterpret_cast<const int32_t *>(pVect1)));
+    pVect1 += 4;
+    __m128 v1_f = _mm_cvtepi32_ps(v1_i);
+
+    __m128i v2_8 = _mm_loadl_epi64(reinterpret_cast<const __m128i *>(pVect2));
+    __m128 v2_f = _mm_cvtph_ps(v2_8);
+    pVect2 += 4;
+
+    sum = _mm_add_ps(sum, _mm_mul_ps(v1_f, v2_f));
+}
+
+template <unsigned char residual> // 0..15
+float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2v,
+                                           size_t dimension) {
+    const uint8_t *pVec1 = static_cast<const uint8_t *>(pVec1v);
+    const float16 *pVec2 = static_cast<const float16 *>(pVec2v);
+    const uint8_t *pEnd1 = pVec1 + dimension;
+
+    __m128 sum = _mm_setzero_ps();
+
+    if constexpr (residual % 4) {
+        __m128 v1_f;
+        __m128 v2_f;
+
+        if constexpr (residual % 4 == 3) {
+            v1_f = _mm_set_ps(0.0f, static_cast<float>(pVec1[2]),
+                              static_cast<float>(pVec1[1]),
+                              static_cast<float>(pVec1[0]));
+            v2_f = _mm_set_ps(0.0f, vecsim_types::FP16_to_FP32(pVec2[2]),
+                              vecsim_types::FP16_to_FP32(pVec2[1]),
+                              vecsim_types::FP16_to_FP32(pVec2[0]));
+        } else if constexpr (residual % 4 == 2) {
+            v1_f = _mm_set_ps(0.0f, 0.0f, static_cast<float>(pVec1[1]),
+                              static_cast<float>(pVec1[0]));
+            v2_f = _mm_set_ps(0.0f, 0.0f, vecsim_types::FP16_to_FP32(pVec2[1]),
+                              vecsim_types::FP16_to_FP32(pVec2[0]));
+        } else if constexpr (residual % 4 == 1) {
+            v1_f = _mm_set_ps(0.0f, 0.0f, 0.0f, static_cast<float>(pVec1[0]));
+            v2_f = _mm_set_ps(0.0f, 0.0f, 0.0f, vecsim_types::FP16_to_FP32(pVec2[0]));
+        }
+
+        pVec1 += residual % 4;
+        pVec2 += residual % 4;
+
+        sum = _mm_mul_ps(v1_f, v2_f);
+    }
+
+    if constexpr (residual >= 4) {
+        SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum);
+    }
+    if constexpr (residual >= 8) {
+        SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum);
+    }
+    if constexpr (residual >= 12) {
+        SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum);
+    }
+
+    do {
+        SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum);
+    } while (pVec1 < pEnd1);
+
+    float PORTABLE_ALIGN16 TmpRes[4];
+    _mm_store_ps(TmpRes, sum);
+    float quantized_dot = TmpRes[0] + TmpRes[1] + TmpRes[2] + TmpRes[3];
+
+    const uint8_t *pVec1Base = static_cast<const uint8_t *>(pVec1v);
+    const uint8_t *params_bytes = pVec1Base + dimension;
+    const float min_val = load_unaligned<float>(params_bytes + sq8::MIN_VAL * sizeof(float));
+    const float delta = load_unaligned<float>(params_bytes + sq8::DELTA * sizeof(float));
+
+    const float16 *pVec2Base = static_cast<const float16 *>(pVec2v);
+    const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVec2Base + dimension);
+    const float y_sum =
+        load_unaligned<float>(query_meta_bytes + sq8::SUM_QUERY * sizeof(float));
+
+    return min_val * y_sum + delta * quantized_dot;
+}
+
+template <unsigned char residual> // 0..15
+float SQ8_FP16_InnerProductSIMD16_SSE4(const void *pVec1v, const void *pVec2v, size_t dimension) {
+    return 1.0f - SQ8_FP16_InnerProductSIMD16_SSE4_IMP<residual>(pVec1v, pVec2v, dimension);
+}
+
+template <unsigned char residual> // 0..15
+float SQ8_FP16_CosineSIMD16_SSE4(const void *pVec1v, const void *pVec2v, size_t dimension) {
+    return SQ8_FP16_InnerProductSIMD16_SSE4<residual>(pVec1v, pVec2v, dimension);
+}
diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp
index 68308d0b0..37ffc9ed4 100644
--- a/src/VecSim/spaces/IP_space.cpp
+++ b/src/VecSim/spaces/IP_space.cpp
@@ -212,6 +212,16 @@ dist_func_t<float> IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment,
     }
 #endif
 #endif
+#ifdef OPT_SSE4
+#ifdef OPT_F16C
+    // F16C is VEX-encoded — require AVX as well, matching the existing F16C/FP16 dispatcher.
+    if (features.sse4_1 && features.f16c && features.avx) {
+        if (dim % 4 == 0)
+            *alignment = 4 * sizeof(uint8_t);
+        return Choose_SQ8_FP16_IP_implementation_SSE4(dim);
+    }
+#endif
+#endif
 #endif // x86_64
     return ret_dist_func;
 }
@@ -256,6 +266,15 @@ dist_func_t<float> Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignm
     }
 #endif
 #endif
+#ifdef OPT_SSE4
+#ifdef OPT_F16C
+    if (features.sse4_1 && features.f16c && features.avx) {
+        if (dim % 4 == 0)
+            *alignment = 4 * sizeof(uint8_t);
+        return Choose_SQ8_FP16_Cosine_implementation_SSE4(dim);
+    }
+#endif
+#endif
 #endif // x86_64
     return ret_dist_func;
 }
diff --git a/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h
new file mode 100644
index 000000000..b43492858
--- /dev/null
+++ b/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h
@@ -0,0 +1,35 @@
+/*
+ * Copyright (c) 2006-Present, Redis Ltd.
+ * All rights reserved.
+ *
+ * Licensed under your choice of the Redis Source Available License 2.0
+ * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the
+ * GNU Affero General Public License v3 (AGPLv3).
+ */
+#pragma once
+#include "VecSim/spaces/space_includes.h"
+#include "VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h"
+#include "VecSim/types/sq8.h"
+#include "VecSim/types/float16.h"
+#include "VecSim/utils/alignment.h"
+
+using sq8 = vecsim_types::sq8;
+using float16 = vecsim_types::float16;
+
+template <unsigned char residual> // 0..15
+float SQ8_FP16_L2SqrSIMD16_SSE4(const void *pVect1v, const void *pVect2v, size_t dimension) {
+    const float ip =
+        SQ8_FP16_InnerProductSIMD16_SSE4_IMP<residual>(pVect1v, pVect2v, dimension);
+
+    const uint8_t *pVect1 = static_cast<const uint8_t *>(pVect1v);
+    const uint8_t *params_bytes = pVect1 + dimension;
+    const float x_sum_sq =
+        load_unaligned<float>(params_bytes + sq8::SUM_SQUARES * sizeof(float));
+
+    const float16 *pVect2 = static_cast<const float16 *>(pVect2v);
+    const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVect2 + dimension);
+    const float y_sum_sq =
+        load_unaligned<float>(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float));
+
+    return x_sum_sq + y_sum_sq - 2.0f * ip;
+}
diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp
index 2b6a31166..ab5188800 100644
--- a/src/VecSim/spaces/L2_space.cpp
+++ b/src/VecSim/spaces/L2_space.cpp
@@ -144,6 +144,15 @@ dist_func_t<float> L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment,
     }
 #endif
 #endif
+#ifdef OPT_SSE4
+#ifdef OPT_F16C
+    if (features.sse4_1 && features.f16c && features.avx) {
+        if (dim % 4 == 0)
+            *alignment = 4 * sizeof(uint8_t);
+        return Choose_SQ8_FP16_L2_implementation_SSE4(dim);
+    }
+#endif
+#endif
 #endif // x86_64
     return ret_dist_func;
 }
diff --git a/src/VecSim/spaces/functions/SSE4.cpp b/src/VecSim/spaces/functions/SSE4.cpp
index 5f5bbc1ba..e41762955 100644
--- a/src/VecSim/spaces/functions/SSE4.cpp
+++ b/src/VecSim/spaces/functions/SSE4.cpp
@@ -10,6 +10,11 @@
 #include "VecSim/spaces/IP/IP_SSE4_SQ8_FP32.h"
 #include "VecSim/spaces/L2/L2_SSE4_SQ8_FP32.h"
 
+#ifdef OPT_F16C
+#include "VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h"
+#include "VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h"
+#endif
+
 namespace spaces {
 
 #include "implementation_chooser.h"
@@ -32,6 +37,24 @@ dist_func_t<float> Choose_SQ8_FP32_L2_implementation_SSE4(size_t dim) {
     return ret_dist_func;
 }
 
+#ifdef OPT_F16C
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_SSE4(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_SSE4);
+    return ret_dist_func;
+}
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_SSE4(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_SSE4);
+    return ret_dist_func;
+}
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_SSE4(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_SSE4);
+    return ret_dist_func;
+}
+#endif
+
 #include "implementation_chooser_cleanup.h"
 
 } // namespace spaces
diff --git a/src/VecSim/spaces/functions/SSE4.h b/src/VecSim/spaces/functions/SSE4.h
index e47948137..c33187983 100644
--- a/src/VecSim/spaces/functions/SSE4.h
+++ b/src/VecSim/spaces/functions/SSE4.h
@@ -16,4 +16,10 @@ dist_func_t<float> Choose_SQ8_FP32_IP_implementation_SSE4(size_t dim);
 dist_func_t<float> Choose_SQ8_FP32_Cosine_implementation_SSE4(size_t dim);
 dist_func_t<float> Choose_SQ8_FP32_L2_implementation_SSE4(size_t dim);
 
+#ifdef OPT_F16C
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_SSE4(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_SSE4(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_SSE4(size_t dim);
+#endif
+
 } // namespace spaces
diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp
index 968294eac..61d3ce4af 100644
--- a/tests/unit/test_spaces.cpp
+++ b/tests/unit/test_spaces.cpp
@@ -3131,6 +3131,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) {
         optimization.avx2 = 0;
     }
 #endif
+#endif
+#ifdef OPT_SSE4
+#ifdef OPT_F16C
+    if (optimization.sse4_1 && optimization.f16c && optimization.avx) {
+        unsigned char alignment = 0;
+        arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
+        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_SSE4(dim))
+            << "Unexpected distance function chosen for dim " << dim;
+        ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
+            << "SSE4 with dim " << dim;
+        optimization.sse4_1 = 0;
+    }
+#endif
 #endif
 
     // Scalar fallback.
@@ -3197,6 +3210,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) {
         optimization.avx2 = 0;
     }
 #endif
+#endif
+#ifdef OPT_SSE4
+#ifdef OPT_F16C
+    if (optimization.sse4_1 && optimization.f16c && optimization.avx) {
+        unsigned char alignment = 0;
+        arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
+        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_SSE4(dim))
+            << "Unexpected distance function chosen for dim " << dim;
+        ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
+            << "SSE4 with dim " << dim;
+        optimization.sse4_1 = 0;
+    }
+#endif
 #endif
 
     unsigned char alignment = 0;
@@ -3262,6 +3288,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) {
         optimization.avx2 = 0;
     }
 #endif
+#endif
+#ifdef OPT_SSE4
+#ifdef OPT_F16C
+    if (optimization.sse4_1 && optimization.f16c && optimization.avx) {
+        unsigned char alignment = 0;
+        arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
+        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_SSE4(dim))
+            << "Unexpected distance function chosen for dim " << dim;
+        ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
+            << "SSE4 with dim " << dim;
+        optimization.sse4_1 = 0;
+    }
+#endif
 #endif
 
     unsigned char alignment = 0;

From 4b7f3eb537c2c2c9e18aca79230d660b83a27639 Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Tue, 26 May 2026 11:42:52 +0300
Subject: [PATCH 08/24] Update SQ8_FP16 dispatcher assertions to walk SIMD
 tiers [MOD-14954]

The SQ8_FP16 GetDistFunc dispatcher now returns AVX-512 / AVX2+FMA /
AVX2 / SSE4 SIMD kernels when the corresponding feature flags are set
(only scalar previously). Updates the GetDistFunc_*_SQ8_FP16 asserts
to compute the expected function for the host's highest supported tier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 tests/unit/test_spaces.cpp | 54 +++++++++++++++++++++++++++++++++-----
 1 file changed, 48 insertions(+), 6 deletions(-)

diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp
index 61d3ce4af..53c3a011b 100644
--- a/tests/unit/test_spaces.cpp
+++ b/tests/unit/test_spaces.cpp
@@ -560,19 +560,61 @@ TEST_F(SpacesTest, GetDistFuncSQ8Asymmetric) {
 }
 
 TEST_F(SpacesTest, GetDistFuncSQ8FP16Asymmetric) {
-    // SQ8 storage with FP16 query (asymmetric) - should return scalar SQ8_FP16 functions.
-    // SIMD chooser slots are added by P1b (MOD-15152) / P1c (MOD-15153); for now the
-    // dispatcher returns the scalar implementations regardless of dim or arch.
+    // SQ8 storage with FP16 query (asymmetric). The dispatcher now returns the highest SIMD
+    // tier available at runtime; assert that and fall back to scalar only if no tier matches.
     size_t dim = 128;
     auto l2_func = spaces::GetDistFunc<sq8, float, float16>(VecSimMetric_L2, dim, nullptr);
     auto ip_func = spaces::GetDistFunc<sq8, float, float16>(VecSimMetric_IP, dim, nullptr);
     auto cosine_func = spaces::GetDistFunc<sq8, float, float16>(VecSimMetric_Cosine, dim, nullptr);
+
+    auto optimization = getCpuOptimizationFeatures();
+    dist_func_t<float> expected_l2 = SQ8_FP16_L2Sqr;
+    dist_func_t<float> expected_ip = SQ8_FP16_InnerProduct;
+    dist_func_t<float> expected_cos = SQ8_FP16_Cosine;
+
+#ifdef CPU_FEATURES_ARCH_X86_64
+    if (dim >= 16) {
+#ifdef OPT_AVX512_F_BW_VL_VNNI
+        if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl &&
+            optimization.avx512vnni) {
+            expected_l2 = Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(dim);
+            expected_ip = Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim);
+            expected_cos = Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(dim);
+        } else
+#endif
+#if defined(OPT_AVX2_FMA) && defined(OPT_F16C)
+            if (optimization.avx2 && optimization.fma3 && optimization.f16c) {
+            expected_l2 = Choose_SQ8_FP16_L2_implementation_AVX2_FMA(dim);
+            expected_ip = Choose_SQ8_FP16_IP_implementation_AVX2_FMA(dim);
+            expected_cos = Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(dim);
+        } else
+#endif
+#if defined(OPT_AVX2) && defined(OPT_F16C)
+            if (optimization.avx2 && optimization.f16c) {
+            expected_l2 = Choose_SQ8_FP16_L2_implementation_AVX2(dim);
+            expected_ip = Choose_SQ8_FP16_IP_implementation_AVX2(dim);
+            expected_cos = Choose_SQ8_FP16_Cosine_implementation_AVX2(dim);
+        } else
+#endif
+#if defined(OPT_SSE4) && defined(OPT_F16C)
+            if (optimization.sse4_1 && optimization.f16c && optimization.avx) {
+            expected_l2 = Choose_SQ8_FP16_L2_implementation_SSE4(dim);
+            expected_ip = Choose_SQ8_FP16_IP_implementation_SSE4(dim);
+            expected_cos = Choose_SQ8_FP16_Cosine_implementation_SSE4(dim);
+        } else
+#endif
+        {
+            // Falls through to scalar.
+        }
+    }
+#endif // x86_64
+
     ASSERT_EQ(l2_func, L2_SQ8_FP16_GetDistFunc(dim, nullptr));
     ASSERT_EQ(ip_func, IP_SQ8_FP16_GetDistFunc(dim, nullptr));
     ASSERT_EQ(cosine_func, Cosine_SQ8_FP16_GetDistFunc(dim, nullptr));
-    ASSERT_EQ(l2_func, SQ8_FP16_L2Sqr);
-    ASSERT_EQ(ip_func, SQ8_FP16_InnerProduct);
-    ASSERT_EQ(cosine_func, SQ8_FP16_Cosine);
+    ASSERT_EQ(l2_func, expected_l2);
+    ASSERT_EQ(ip_func, expected_ip);
+    ASSERT_EQ(cosine_func, expected_cos);
 }
 
 #ifdef CPU_FEATURES_ARCH_X86_64

From e21cb3b1566bc418870090b66ff823d5a4e68885 Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Tue, 26 May 2026 11:44:10 +0300
Subject: [PATCH 09/24] =?UTF-8?q?Register=20per-ISA=20SQ8=E2=86=94FP16=20m?=
 =?UTF-8?q?icrobenchmarks=20[MOD-14954]?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds AVX-512 / AVX2+FMA / AVX2 / SSE4 benchmark registrations to
bm_spaces_sq8_fp16.cpp, mirroring the SQ8_FP32 layout. Gates each tier
on the corresponding OPT_* defines plus the runtime feature checks
that mirror the dispatcher in IP_space.cpp / L2_space.cpp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../spaces_benchmarks/bm_spaces_sq8_fp16.cpp  | 38 ++++++++++++++++++-
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp
index 2133a047e..75ede0eb8 100644
--- a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp
+++ b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp
@@ -50,8 +50,42 @@ class BM_VecSimSpaces_SQ8_FP16 : public benchmark::Fixture {
     }
 };
 
-// Naive (scalar) algorithms. SIMD chooser slots will be added by P1b (MOD-15152) and
-// P1c (MOD-15153), following the SQ8_FP32 layout in bm_spaces_sq8_fp32.cpp.
+#ifdef CPU_FEATURES_ARCH_X86_64
+cpu_features::X86Features opt = cpu_features::GetX86Info().features;
+
+// AVX-512 F+BW+VL+VNNI (no F16C requirement — _mm512_cvtph_ps is part of AVX512F).
+#ifdef OPT_AVX512_F_BW_VL_VNNI
+bool avx512_f_bw_vl_vnni_supported = opt.avx512f && opt.avx512bw && opt.avx512vl && opt.avx512vnni;
+INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F_BW_VL_VNNI, 16,
+                                avx512_f_bw_vl_vnni_supported);
+INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F_BW_VL_VNNI, 16,
+                                 avx512_f_bw_vl_vnni_supported);
+#endif
+
+#ifdef OPT_F16C
+#ifdef OPT_AVX2_FMA
+bool avx2_fma3_f16c_supported = opt.avx2 && opt.fma3 && opt.f16c;
+INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2_FMA, 16,
+                                avx2_fma3_f16c_supported);
+INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2_FMA, 16,
+                                 avx2_fma3_f16c_supported);
+#endif
+
+#ifdef OPT_AVX2
+bool avx2_f16c_supported = opt.avx2 && opt.f16c;
+INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2, 16, avx2_f16c_supported);
+INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2, 16, avx2_f16c_supported);
+#endif
+
+#ifdef OPT_SSE4
+bool sse4_f16c_supported = opt.sse4_1 && opt.f16c && opt.avx;
+INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SSE4, 16, sse4_f16c_supported);
+INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SSE4, 16, sse4_f16c_supported);
+#endif
+#endif // OPT_F16C
+#endif // x86_64
+
+// Naive (scalar) baseline — always registered as the comparison anchor.
 
 INITIALIZE_NAIVE_BM(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, InnerProduct, 16);
 INITIALIZE_NAIVE_BM(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, Cosine, 16);

From 4c8828e8d7d76b6b388288d4b10805c1488e6b9b Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Tue, 26 May 2026 14:07:31 +0300
Subject: [PATCH 10/24] =?UTF-8?q?Reformat=20SQ8=E2=86=94FP16=20SIMD=20kern?=
 =?UTF-8?q?els=20for=20consistent=20line=20breaks?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h      |  6 ++----
 src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h          |  6 ++----
 .../spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h   |  6 ++----
 src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h          | 16 ++++++----------
 src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h      |  3 +--
 src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h          |  3 +--
 .../spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h   |  3 +--
 src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h          |  6 ++----
 8 files changed, 17 insertions(+), 32 deletions(-)

diff --git a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
index 1d6b4e676..130fe4eb0 100644
--- a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
@@ -18,8 +18,7 @@ using float16 = vecsim_types::float16;
 
 // 8-wide AVX2+FMA step: 8 SQ8 lanes + 8 FP16 lanes -> 8 FP32 fused-multiply-add.
 static inline void SQ8_FP16_InnerProductStep_AVX2_FMA(const uint8_t *&pVect1,
-                                                      const float16 *&pVect2,
-                                                      __m256 &sum256) {
+                                                      const float16 *&pVect2, __m256 &sum256) {
     __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast<const __m128i *>(pVect1));
     pVect1 += 8;
     __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128);
@@ -77,8 +76,7 @@ float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v,
 
     const float16 *pVec2Base = static_cast<const float16 *>(pVec2v);
     const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVec2Base + dimension);
-    const float y_sum =
-        load_unaligned<float>(query_meta_bytes + sq8::SUM_QUERY * sizeof(float));
+    const float y_sum = load_unaligned<float>(query_meta_bytes + sq8::SUM_QUERY * sizeof(float));
 
     return min_val * y_sum + delta * quantized_dot;
 }
diff --git a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
index e68e5fa11..1e29fe63d 100644
--- a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
@@ -17,8 +17,7 @@ using sq8 = vecsim_types::sq8;
 using float16 = vecsim_types::float16;
 
 // 8-wide AVX2 step (no FMA): 8 SQ8 lanes + 8 FP16 lanes -> mul + add into sum.
-static inline void SQ8_FP16_InnerProductStep_AVX2(const uint8_t *&pVect1,
-                                                  const float16 *&pVect2,
+static inline void SQ8_FP16_InnerProductStep_AVX2(const uint8_t *&pVect1, const float16 *&pVect2,
                                                   __m256 &sum256) {
     __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast<const __m128i *>(pVect1));
     pVect1 += 8;
@@ -74,8 +73,7 @@ float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size
 
     const float16 *pVec2Base = static_cast<const float16 *>(pVec2v);
     const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVec2Base + dimension);
-    const float y_sum =
-        load_unaligned<float>(query_meta_bytes + sq8::SUM_QUERY * sizeof(float));
+    const float y_sum = load_unaligned<float>(query_meta_bytes + sq8::SUM_QUERY * sizeof(float));
 
     return min_val * y_sum + delta * quantized_dot;
 }
diff --git a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h
index 55d63d711..62532c56c 100644
--- a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h
@@ -17,8 +17,7 @@ using sq8 = vecsim_types::sq8;
 using float16 = vecsim_types::float16;
 
 // Helper: load 16 SQ8 + 16 FP16 lanes, widen both to FP32, fused-multiply-add into sum.
-static inline void SQ8_FP16_InnerProductStep_AVX512(const uint8_t *&pVec1,
-                                                    const float16 *&pVec2,
+static inline void SQ8_FP16_InnerProductStep_AVX512(const uint8_t *&pVec1, const float16 *&pVec2,
                                                     __m512 &sum) {
     // 16 uint8 -> 16 fp32
     __m128i v1_128 = _mm_loadu_si128(reinterpret_cast<const __m128i *>(pVec1));
@@ -81,8 +80,7 @@ float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, si
     // 2-byte aligned only.
     const float16 *pVec2Base = static_cast<const float16 *>(pVec2v);
     const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVec2Base + dimension);
-    const float y_sum =
-        load_unaligned<float>(query_meta_bytes + sq8::SUM_QUERY * sizeof(float));
+    const float y_sum = load_unaligned<float>(query_meta_bytes + sq8::SUM_QUERY * sizeof(float));
 
     return min_val * y_sum + delta * quantized_dot;
 }
diff --git a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
index 8fd0e56c1..43b61fd25 100644
--- a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
@@ -16,11 +16,9 @@ using sq8 = vecsim_types::sq8;
 using float16 = vecsim_types::float16;
 
 // 4-wide SSE4+F16C step: 4 SQ8 lanes + 4 FP16 lanes -> mul + add into sum.
-static inline void SQ8_FP16_InnerProductStep_SSE4(const uint8_t *&pVect1,
-                                                  const float16 *&pVect2,
+static inline void SQ8_FP16_InnerProductStep_SSE4(const uint8_t *&pVect1, const float16 *&pVect2,
                                                   __m128 &sum) {
-    __m128i v1_i =
-        _mm_cvtepu8_epi32(_mm_cvtsi32_si128(*reinterpret_cast<const int32_t *>(pVect1)));
+    __m128i v1_i = _mm_cvtepu8_epi32(_mm_cvtsi32_si128(*reinterpret_cast<const int32_t *>(pVect1)));
     pVect1 += 4;
     __m128 v1_f = _mm_cvtepi32_ps(v1_i);
 
@@ -45,15 +43,14 @@ float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2
         __m128 v2_f;
 
         if constexpr (residual % 4 == 3) {
-            v1_f = _mm_set_ps(0.0f, static_cast<float>(pVec1[2]),
-                              static_cast<float>(pVec1[1]),
+            v1_f = _mm_set_ps(0.0f, static_cast<float>(pVec1[2]), static_cast<float>(pVec1[1]),
                               static_cast<float>(pVec1[0]));
             v2_f = _mm_set_ps(0.0f, vecsim_types::FP16_to_FP32(pVec2[2]),
                               vecsim_types::FP16_to_FP32(pVec2[1]),
                               vecsim_types::FP16_to_FP32(pVec2[0]));
         } else if constexpr (residual % 4 == 2) {
-            v1_f = _mm_set_ps(0.0f, 0.0f, static_cast<float>(pVec1[1]),
-                              static_cast<float>(pVec1[0]));
+            v1_f =
+                _mm_set_ps(0.0f, 0.0f, static_cast<float>(pVec1[1]), static_cast<float>(pVec1[0]));
             v2_f = _mm_set_ps(0.0f, 0.0f, vecsim_types::FP16_to_FP32(pVec2[1]),
                               vecsim_types::FP16_to_FP32(pVec2[0]));
         } else if constexpr (residual % 4 == 1) {
@@ -92,8 +89,7 @@ float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2
 
     const float16 *pVec2Base = static_cast<const float16 *>(pVec2v);
     const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVec2Base + dimension);
-    const float y_sum =
-        load_unaligned<float>(query_meta_bytes + sq8::SUM_QUERY * sizeof(float));
+    const float y_sum = load_unaligned<float>(query_meta_bytes + sq8::SUM_QUERY * sizeof(float));
 
     return min_val * y_sum + delta * quantized_dot;
 }
diff --git a/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h
index 5f9ad0db6..38809e9c2 100644
--- a/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h
+++ b/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h
@@ -23,8 +23,7 @@ float SQ8_FP16_L2SqrSIMD16_AVX2_FMA(const void *pVect1v, const void *pVect2v, si
 
     const uint8_t *pVect1 = static_cast<const uint8_t *>(pVect1v);
     const uint8_t *params_bytes = pVect1 + dimension;
-    const float x_sum_sq =
-        load_unaligned<float>(params_bytes + sq8::SUM_SQUARES * sizeof(float));
+    const float x_sum_sq = load_unaligned<float>(params_bytes + sq8::SUM_SQUARES * sizeof(float));
 
     const float16 *pVect2 = static_cast<const float16 *>(pVect2v);
     const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVect2 + dimension);
diff --git a/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h
index 86ec4b66e..98bb29c05 100644
--- a/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h
+++ b/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h
@@ -23,8 +23,7 @@ float SQ8_FP16_L2SqrSIMD16_AVX2(const void *pVect1v, const void *pVect2v, size_t
 
     const uint8_t *pVect1 = static_cast<const uint8_t *>(pVect1v);
     const uint8_t *params_bytes = pVect1 + dimension;
-    const float x_sum_sq =
-        load_unaligned<float>(params_bytes + sq8::SUM_SQUARES * sizeof(float));
+    const float x_sum_sq = load_unaligned<float>(params_bytes + sq8::SUM_SQUARES * sizeof(float));
 
     const float16 *pVect2 = static_cast<const float16 *>(pVect2v);
     const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVect2 + dimension);
diff --git a/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h
index 101bf285e..635f30904 100644
--- a/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h
+++ b/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h
@@ -24,8 +24,7 @@ float SQ8_FP16_L2SqrSIMD16_AVX512F_BW_VL_VNNI(const void *pVect1v, const void *p
 
     const uint8_t *pVect1 = static_cast<const uint8_t *>(pVect1v);
     const uint8_t *params_bytes = pVect1 + dimension;
-    const float x_sum_sq =
-        load_unaligned<float>(params_bytes + sq8::SUM_SQUARES * sizeof(float));
+    const float x_sum_sq = load_unaligned<float>(params_bytes + sq8::SUM_SQUARES * sizeof(float));
 
     const float16 *pVect2 = static_cast<const float16 *>(pVect2v);
     const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVect2 + dimension);
diff --git a/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h
index b43492858..75bbd46f8 100644
--- a/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h
+++ b/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h
@@ -18,13 +18,11 @@ using float16 = vecsim_types::float16;
 
 template <unsigned char residual> // 0..15
 float SQ8_FP16_L2SqrSIMD16_SSE4(const void *pVect1v, const void *pVect2v, size_t dimension) {
-    const float ip =
-        SQ8_FP16_InnerProductSIMD16_SSE4_IMP<residual>(pVect1v, pVect2v, dimension);
+    const float ip = SQ8_FP16_InnerProductSIMD16_SSE4_IMP<residual>(pVect1v, pVect2v, dimension);
 
     const uint8_t *pVect1 = static_cast<const uint8_t *>(pVect1v);
     const uint8_t *params_bytes = pVect1 + dimension;
-    const float x_sum_sq =
-        load_unaligned<float>(params_bytes + sq8::SUM_SQUARES * sizeof(float));
+    const float x_sum_sq = load_unaligned<float>(params_bytes + sq8::SUM_SQUARES * sizeof(float));
 
     const float16 *pVect2 = static_cast<const float16 *>(pVect2v);
     const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVect2 + dimension);

From fdc5c1cd04603cd6d3d007b48b52126347736cf0 Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Thu, 28 May 2026 10:56:09 +0300
Subject: [PATCH 11/24] =?UTF-8?q?Address=20PR=20review=20findings=20for=20?=
 =?UTF-8?q?SQ8=E2=86=94FP16=20x86=20kernels=20[MOD-14954]?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- CMake: gate `-mf16c` on CXX_F16C AND CXX_FMA AND CXX_AVX (matches OPT_F16C
  macro) and append `-mavx` to the SSE4 dispatcher when adding -mf16c, since
  F16C is VEX-encoded and requires AVX state. Mirrors the existing F16C.cpp
  recipe and prevents miscompiles on toolchains with F16C but without AVX.
- IP_SSE4_SQ8_FP16.h: replace `*reinterpret_cast<const int32_t *>(pVect1)`
  with `load_unaligned<int32_t>(pVect1)` to remove strict-aliasing UB on
  the uint8_t SQ8 lane load.
- IP_AVX2{,_FMA}_SQ8_FP16.h: improve the residual-mask comment to spell out
  the asymmetric-mask reasoning (SQ8 unmasked is safe because the FP16
  query blend forces those FP32 query lanes to 0 → garbage·0=0).
- IP_AVX{512,2,2_FMA,SSE4}_SQ8_FP16.h: add the `IP = min·y_sum + delta·Σ(q·y)`
  algebraic-identity comment header that AVX-512 already carried, plus a
  precondition note that callers must enforce dim >= 16 (matches the
  established SQ8_FP32 convention; no runtime assert because sibling
  SQ8_FP32 SIMD kernels also rely on the dispatcher gate).
- test_spaces.cpp: route the SQ8_FP16 edge-case tests (ZeroQuery,
  ConstantStorage, MixedSignQuery) through {IP,Cosine,L2}_SQ8_FP16_GetDistFunc
  so the runtime-selected SIMD tier is actually exercised on those inputs,
  not just the scalar reference.
- test_spaces.cpp: add SQ8_FP16_SIMD_HighDim suite with dims {64, 128, 256,
  512, 1024} so multi-iteration do-while loop bugs would fire (the existing
  [16, 32] range covers at most two AVX-512 chunk iterations).
- test_spaces.cpp: add SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised — a
  single test that emits per-tier coverage to stderr and GTEST_SKIPs when
  no SIMD tier is available, so CI runners without AVX-512 do not silently
  report zero tier-1 coverage.
- test_spaces.cpp: scalar-fallback `alignment` checks now seed the value
  with 0xFF and assert it remains 0xFF, verifying the dispatcher contract
  ("scalar leaves caller's value untouched") instead of just measuring
  that the variable's pre-zeroed init survived.
- test_spaces.cpp: drop the stale MOD-15152/MOD-15153 wiring-TODO comment
  on SQ8_FP16_NoOptimizationSpacesTest now that the SIMD tiers are wired.
- bm_spaces_sq8_fp16.cpp: drop the matching stale comment.

Out of scope (separate ticket): two-accumulator FMA refactor (also affects
SQ8_FP32) and the SSE4 residual `_mm_cvtph_ps` perf opportunity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 src/VecSim/spaces/CMakeLists.txt              |  19 ++-
 src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h   |  19 ++-
 src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h       |  18 +++
 .../IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h       |   3 +
 src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h       |  16 ++-
 .../spaces_benchmarks/bm_spaces_sq8_fp16.cpp  |   5 +-
 tests/unit/test_spaces.cpp                    | 128 ++++++++++++++----
 7 files changed, 173 insertions(+), 35 deletions(-)

diff --git a/src/VecSim/spaces/CMakeLists.txt b/src/VecSim/spaces/CMakeLists.txt
index 8babf844b..a580916d2 100644
--- a/src/VecSim/spaces/CMakeLists.txt
+++ b/src/VecSim/spaces/CMakeLists.txt
@@ -50,9 +50,16 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)")
 		list(APPEND OPTIMIZATIONS functions/AVX512F_BW_VL_VNNI.cpp)
 	endif()
 
+	# F16C is VEX-encoded and requires AVX state, so it is only meaningful when the toolchain
+	# can also emit AVX/FMA. Mirrors the OPT_F16C macro condition in x86_64InstructionFlags.cmake.
+	set(_has_full_f16c FALSE)
+	if(CXX_F16C AND CXX_FMA AND CXX_AVX)
+		set(_has_full_f16c TRUE)
+	endif()
+
 	if(CXX_AVX2)
 		set(_avx2_flags "-mavx2")
-		if(CXX_F16C)
+		if(_has_full_f16c)
 			message("Building functions/AVX2.cpp with AVX2 and F16C")
 			set(_avx2_flags "${_avx2_flags} -mf16c")
 		else()
@@ -64,7 +71,7 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)")
 
 	if(CXX_AVX2 AND CXX_FMA)
 		set(_avx2_fma_flags "-mavx2 -mfma")
-		if(CXX_F16C)
+		if(_has_full_f16c)
 			message("Building functions/AVX2_FMA.cpp with AVX2, FMA, and F16C")
 			set(_avx2_fma_flags "${_avx2_fma_flags} -mf16c")
 		else()
@@ -94,9 +101,11 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)")
 
 	if(CXX_SSE4)
 		set(_sse4_flags "-msse4.1")
-		if(CXX_F16C)
-			message("Building functions/SSE4.cpp with SSE4.1 and F16C")
-			set(_sse4_flags "${_sse4_flags} -mf16c")
+		if(_has_full_f16c)
+			# F16C is VEX-encoded → must compile with -mavx alongside -mf16c, matching the
+			# F16C.cpp recipe above.
+			message("Building functions/SSE4.cpp with SSE4.1, AVX, and F16C")
+			set(_sse4_flags "${_sse4_flags} -mavx -mf16c")
 		else()
 			message("Building functions/SSE4.cpp with SSE4.1 (no F16C)")
 		endif()
diff --git a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
index 130fe4eb0..eda8b393e 100644
--- a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
@@ -16,6 +16,17 @@
 using sq8 = vecsim_types::sq8;
 using float16 = vecsim_types::float16;
 
+/*
+ * Asymmetric SQ8 (storage) ↔ FP16 (query) inner product using algebraic identity:
+ *   IP(x, y) = Σ(x_i * y_i)
+ *            ≈ Σ((min + delta * q_i) * y_i)
+ *            = min * Σy_i + delta * Σ(q_i * y_i)
+ *            = min * y_sum + delta * quantized_dot_product
+ *
+ * FP16 query lanes are widened to FP32 per 8-lane chunk via _mm256_cvtph_ps (F16C);
+ * inner-loop arithmetic runs in FP32 with _mm256_fmadd_ps.
+ */
+
 // 8-wide AVX2+FMA step: 8 SQ8 lanes + 8 FP16 lanes -> 8 FP32 fused-multiply-add.
 static inline void SQ8_FP16_InnerProductStep_AVX2_FMA(const uint8_t *&pVect1,
                                                       const float16 *&pVect2, __m256 &sum256) {
@@ -31,6 +42,9 @@ static inline void SQ8_FP16_InnerProductStep_AVX2_FMA(const uint8_t *&pVect1,
     sum256 = _mm256_fmadd_ps(v1_f, v2_f, sum256);
 }
 
+// Precondition: dim >= 16. Caller is the dispatcher in IP_space.cpp / L2_space.cpp.
+// The residual block reads 8 SQ8 bytes and 16 FP16 bytes unconditionally; shorter blobs would
+// under-read.
 template <unsigned char residual> // 0..15
 float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v, size_t dimension) {
     const uint8_t *pVec1 = static_cast<const uint8_t *>(pVec1v);
@@ -42,8 +56,9 @@ float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v,
     if constexpr (residual % 8) {
         constexpr int mask = (1 << (residual % 8)) - 1;
 
-        // SQ8 side: load 8 bytes regardless of residual; unused lanes are zeroed by the blend on
-        // the FP32 query.
+        // Single-side mask is sufficient: SQ8 lanes beyond `residual` may hold garbage, but the
+        // FP16 query blend below forces those FP32 query lanes to 0, so garbage·0=0 contributes
+        // nothing to the dot product. SQ8 load is intentionally unmasked.
         __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast<const __m128i *>(pVec1));
         pVec1 += residual % 8;
         __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128);
diff --git a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
index 1e29fe63d..028d7d3e0 100644
--- a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
@@ -16,6 +16,18 @@
 using sq8 = vecsim_types::sq8;
 using float16 = vecsim_types::float16;
 
+/*
+ * Asymmetric SQ8 (storage) ↔ FP16 (query) inner product using algebraic identity:
+ *   IP(x, y) = Σ(x_i * y_i)
+ *            ≈ Σ((min + delta * q_i) * y_i)
+ *            = min * Σy_i + delta * Σ(q_i * y_i)
+ *            = min * y_sum + delta * quantized_dot_product
+ *
+ * FP16 query lanes are widened to FP32 per 8-lane chunk via _mm256_cvtph_ps (F16C);
+ * inner-loop arithmetic runs in FP32 with separate _mm256_mul_ps + _mm256_add_ps
+ * (no FMA tier — Haswell-era AVX2 without FMA support).
+ */
+
 // 8-wide AVX2 step (no FMA): 8 SQ8 lanes + 8 FP16 lanes -> mul + add into sum.
 static inline void SQ8_FP16_InnerProductStep_AVX2(const uint8_t *&pVect1, const float16 *&pVect2,
                                                   __m256 &sum256) {
@@ -31,6 +43,9 @@ static inline void SQ8_FP16_InnerProductStep_AVX2(const uint8_t *&pVect1, const
     sum256 = _mm256_add_ps(sum256, _mm256_mul_ps(v1_f, v2_f));
 }
 
+// Precondition: dim >= 16. Caller is the dispatcher in IP_space.cpp / L2_space.cpp.
+// The residual block reads 8 SQ8 bytes and 16 FP16 bytes unconditionally; shorter blobs would
+// under-read.
 template <unsigned char residual> // 0..15
 float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size_t dimension) {
     const uint8_t *pVec1 = static_cast<const uint8_t *>(pVec1v);
@@ -42,6 +57,9 @@ float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size
     if constexpr (residual % 8) {
         constexpr int mask = (1 << (residual % 8)) - 1;
 
+        // Single-side mask is sufficient: SQ8 lanes beyond `residual` may hold garbage, but the
+        // FP16 query blend below forces those FP32 query lanes to 0, so garbage·0=0 contributes
+        // nothing to the dot product. SQ8 load is intentionally unmasked.
         __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast<const __m128i *>(pVec1));
         pVec1 += residual % 8;
         __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128);
diff --git a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h
index 62532c56c..07f5d3456 100644
--- a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h
@@ -36,6 +36,9 @@ static inline void SQ8_FP16_InnerProductStep_AVX512(const uint8_t *&pVec1, const
 
 // Raw inner product Σ((min + delta * q_i) * y_i). Used by both InnerProduct/Cosine wrappers
 // and by the L2 kernel.
+// Precondition: dim >= 16. Caller is the dispatcher in IP_space.cpp / L2_space.cpp, which gates
+// this. The residual block reads 16 SQ8 bytes and 32 FP16 bytes unconditionally; shorter blobs
+// would under-read.
 template <unsigned char residual> // 0..15
 float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, size_t dimension) {
     const uint8_t *pVec1 = static_cast<const uint8_t *>(pVec1v); // SQ8 storage
diff --git a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
index 43b61fd25..e5ca51860 100644
--- a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
@@ -15,10 +15,22 @@
 using sq8 = vecsim_types::sq8;
 using float16 = vecsim_types::float16;
 
+/*
+ * Asymmetric SQ8 (storage) ↔ FP16 (query) inner product using algebraic identity:
+ *   IP(x, y) = Σ(x_i * y_i)
+ *            ≈ Σ((min + delta * q_i) * y_i)
+ *            = min * Σy_i + delta * Σ(q_i * y_i)
+ *            = min * y_sum + delta * quantized_dot_product
+ *
+ * FP16 query lanes are widened to FP32 per 4-lane chunk via _mm_cvtph_ps (F16C);
+ * inner-loop arithmetic runs in FP32 with separate _mm_mul_ps + _mm_add_ps (SSE4 has no FMA).
+ */
+
 // 4-wide SSE4+F16C step: 4 SQ8 lanes + 4 FP16 lanes -> mul + add into sum.
 static inline void SQ8_FP16_InnerProductStep_SSE4(const uint8_t *&pVect1, const float16 *&pVect2,
                                                   __m128 &sum) {
-    __m128i v1_i = _mm_cvtepu8_epi32(_mm_cvtsi32_si128(*reinterpret_cast<const int32_t *>(pVect1)));
+    // Alignment-safe 4-byte load of SQ8 lanes via load_unaligned<int32_t> (no strict-aliasing UB).
+    __m128i v1_i = _mm_cvtepu8_epi32(_mm_cvtsi32_si128(load_unaligned<int32_t>(pVect1)));
     pVect1 += 4;
     __m128 v1_f = _mm_cvtepi32_ps(v1_i);
 
@@ -29,6 +41,8 @@ static inline void SQ8_FP16_InnerProductStep_SSE4(const uint8_t *&pVect1, const
     sum = _mm_add_ps(sum, _mm_mul_ps(v1_f, v2_f));
 }
 
+// Precondition: dim >= 16. Caller is the dispatcher in IP_space.cpp / L2_space.cpp.
+// Shorter blobs would underflow the residual ladder + final do-while loop.
 template <unsigned char residual> // 0..15
 float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2v,
                                            size_t dimension) {
diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp
index 75ede0eb8..f81a9d89d 100644
--- a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp
+++ b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp
@@ -15,8 +15,9 @@ using float16 = vecsim_types::float16;
 
 /**
  * SQ8-to-FP16 benchmarks: SQ8 quantized storage with FP16 query.
- * Only naive (scalar) benchmarks are registered for now; SIMD chooser symbols are added
- * by P1b (MOD-15152, x86) and P1c (MOD-15153, ARM).
+ * Registers the naive (scalar) baseline plus per-ISA SIMD variants (x86: AVX-512 / AVX2+FMA /
+ * AVX2 / SSE4 — gated on the matching OPT_* defines and runtime CPU features). ARM kernels
+ * land via MOD-14972.
  */
 class BM_VecSimSpaces_SQ8_FP16 : public benchmark::Fixture {
 protected:
diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp
index 53c3a011b..2cccd1183 100644
--- a/tests/unit/test_spaces.cpp
+++ b/tests/unit/test_spaces.cpp
@@ -3042,8 +3042,9 @@ TEST(SQ8_FP32_EdgeCases, CosineExtremeValuesTest) {
 
 // Parameterized tests that verify the scalar SQ8_FP16 kernels against the not-optimized
 // baseline across multiple dimensions, including odd dimensions and SIMD-boundary residues.
-// SIMD chooser slots are added by P1b (MOD-15152) / P1c (MOD-15153); the dispatcher always
-// returns the scalar implementation for now.
+// The SIMD-tier dispatcher coverage lives in SQ8_FP16_SpacesOptimizationTest below; this
+// suite intentionally exercises the scalar reference directly to keep it as a fixed baseline
+// the SIMD tiers are compared against.
 class SQ8_FP16_NoOptimizationSpacesTest : public testing::TestWithParam<size_t> {};
 
 TEST_P(SQ8_FP16_NoOptimizationSpacesTest, SQ8_FP16_L2SqrTest) {
@@ -3188,14 +3189,18 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) {
 #endif
 #endif
 
-    // Scalar fallback.
-    unsigned char alignment = 0;
+    // Scalar fallback. Init alignment to a sentinel (0xFF) so the assert below actually verifies
+    // that the dispatcher LEAVES THE VALUE UNTOUCHED on the scalar path — initialising to 0 then
+    // asserting `== 0` would pass even if the dispatcher were a no-op.
+    unsigned char alignment = 0xFF;
     arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
     ASSERT_EQ(arch_opt_func, SQ8_FP16_L2Sqr)
         << "Unexpected scalar fallback function for dim " << dim;
     ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
         << "Scalar fallback with dim " << dim;
-    ASSERT_EQ(alignment, 0) << "Scalar fallback should set no alignment hint for dim " << dim;
+    ASSERT_EQ(alignment, 0xFF) << "Scalar fallback must leave caller's alignment value untouched "
+                                  "(dim "
+                               << dim << ")";
 }
 
 TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) {
@@ -3267,13 +3272,16 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) {
 #endif
 #endif
 
-    unsigned char alignment = 0;
+    // Scalar fallback — see L2 test for the 0xFF sentinel rationale.
+    unsigned char alignment = 0xFF;
     arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
     ASSERT_EQ(arch_opt_func, SQ8_FP16_InnerProduct)
         << "Unexpected scalar fallback function for dim " << dim;
     ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
         << "Scalar fallback with dim " << dim;
-    ASSERT_EQ(alignment, 0) << "Scalar fallback should set no alignment hint for dim " << dim;
+    ASSERT_EQ(alignment, 0xFF) << "Scalar fallback must leave caller's alignment value untouched "
+                                  "(dim "
+                               << dim << ")";
 }
 
 TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) {
@@ -3345,22 +3353,80 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) {
 #endif
 #endif
 
-    unsigned char alignment = 0;
+    // Scalar fallback — see L2 test for the 0xFF sentinel rationale.
+    unsigned char alignment = 0xFF;
     arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
     ASSERT_EQ(arch_opt_func, SQ8_FP16_Cosine)
         << "Unexpected scalar fallback function for dim " << dim;
     ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
         << "Scalar fallback with dim " << dim;
-    ASSERT_EQ(alignment, 0) << "Scalar fallback should set no alignment hint for dim " << dim;
+    ASSERT_EQ(alignment, 0xFF) << "Scalar fallback must leave caller's alignment value untouched "
+                                  "(dim "
+                               << dim << ")";
 }
 
+// Dim range [16, 32] covers every residual class for the 16-element chunk used by every tier.
 INSTANTIATE_TEST_SUITE_P(SQ8_FP16_SIMD, SQ8_FP16_SpacesOptimizationTest,
                          testing::Range(16UL, 16 * 2UL + 1));
 
+// Higher dimensions surface multi-iteration loop bugs (pointer stride, do-while termination
+// off-by-one) that the [16, 32] range does not exercise because the AVX-512 inner loop runs at
+// most twice in that range.
+INSTANTIATE_TEST_SUITE_P(SQ8_FP16_SIMD_HighDim, SQ8_FP16_SpacesOptimizationTest,
+                         testing::Values(64UL, 128UL, 256UL, 512UL, 1024UL));
+
+// Surfaces which SIMD tiers were actually exercised on the current host. Without this, a CI
+// runner that lacks AVX-512 silently passes with zero tier-1 coverage. Logs per-tier presence
+// to stderr and GTEST_SKIPs only when no SIMD tier is available at all.
+TEST(SQ8_FP16_SIMD_TierCoverage, ReportTiersExercised) {
+    auto opt = getCpuOptimizationFeatures();
+    bool any_simd = false;
+
+#ifdef CPU_FEATURES_ARCH_X86_64
+#ifdef OPT_AVX512_F_BW_VL_VNNI
+    if (opt.avx512f && opt.avx512bw && opt.avx512vl && opt.avx512vnni) {
+        std::cerr << "[SQ8_FP16] AVX-512 F+BW+VL+VNNI tier exercised\n";
+        any_simd = true;
+    } else {
+        std::cerr << "[SQ8_FP16] AVX-512 F+BW+VL+VNNI tier NOT exercised on this host\n";
+    }
+#endif
+#if defined(OPT_AVX2_FMA) && defined(OPT_F16C)
+    if (opt.avx2 && opt.fma3 && opt.f16c) {
+        std::cerr << "[SQ8_FP16] AVX2+FMA+F16C tier exercised\n";
+        any_simd = true;
+    } else {
+        std::cerr << "[SQ8_FP16] AVX2+FMA+F16C tier NOT exercised on this host\n";
+    }
+#endif
+#if defined(OPT_AVX2) && defined(OPT_F16C)
+    if (opt.avx2 && opt.f16c) {
+        std::cerr << "[SQ8_FP16] AVX2+F16C tier exercised\n";
+        any_simd = true;
+    } else {
+        std::cerr << "[SQ8_FP16] AVX2+F16C tier NOT exercised on this host\n";
+    }
+#endif
+#if defined(OPT_SSE4) && defined(OPT_F16C)
+    if (opt.sse4_1 && opt.f16c && opt.avx) {
+        std::cerr << "[SQ8_FP16] SSE4+F16C+AVX tier exercised\n";
+        any_simd = true;
+    } else {
+        std::cerr << "[SQ8_FP16] SSE4+F16C+AVX tier NOT exercised on this host\n";
+    }
+#endif
+#endif // x86_64
+
+    if (!any_simd) {
+        GTEST_SKIP() << "No SQ8_FP16 SIMD tier available on this host — scalar fallback only.";
+    }
+}
+
 /* ======================== Tests SQ8_FP16 (edge cases) ========================= */
 
 // Zero FP16 query against a non-zero SQ8 storage. IP must be exactly 1.0 (1 - 0),
-// L2² must equal Σ dequantized².
+// L2² must equal Σ dequantized². Routes through the dispatcher so the runtime-selected
+// SIMD tier (AVX-512 / AVX2+FMA / AVX2 / SSE4 / scalar) is exercised, not just scalar.
 TEST(SQ8_FP16_EdgeCases, ZeroQueryTest) {
     size_t dim = 64;
 
@@ -3375,20 +3441,24 @@ TEST(SQ8_FP16_EdgeCases, ZeroQueryTest) {
     test_utils::populate_float_vec_to_sq8_with_metadata(v_nonzero_quantized.data(), dim, false,
                                                         1234);
 
+    auto ip_func = IP_SQ8_FP16_GetDistFunc(dim, nullptr);
+    auto l2_func = L2_SQ8_FP16_GetDistFunc(dim, nullptr);
+
     float ip_baseline = test_utils::SQ8_FP16_NotOptimized_InnerProduct(v_nonzero_quantized.data(),
                                                                        v_zero_query.data(), dim);
-    float ip = SQ8_FP16_InnerProduct(v_nonzero_quantized.data(), v_zero_query.data(), dim);
-    ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Zero-query SQ8_FP16_InnerProduct mismatch";
+    float ip = ip_func(v_nonzero_quantized.data(), v_zero_query.data(), dim);
+    ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Zero-query SQ8_FP16 IP mismatch";
     ASSERT_NEAR(ip, 1.0f, 0.01f) << "Zero-query IP must equal 1.0 (1 - 0)";
 
     float l2_baseline = test_utils::SQ8_FP16_NotOptimized_L2Sqr(v_nonzero_quantized.data(),
                                                                 v_zero_query.data(), dim);
-    float l2 = SQ8_FP16_L2Sqr(v_nonzero_quantized.data(), v_zero_query.data(), dim);
-    ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Zero-query SQ8_FP16_L2Sqr mismatch";
+    float l2 = l2_func(v_nonzero_quantized.data(), v_zero_query.data(), dim);
+    ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Zero-query SQ8_FP16 L2 mismatch";
 }
 
 // Constant SQ8 storage (all values identical => delta = 0). Storage quantizer sets delta to 1.0
-// to avoid div-by-zero, so verify the kernels still match the dequantization baseline.
+// to avoid div-by-zero, so verify the kernels still match the dequantization baseline. Routes
+// through the dispatcher so the runtime-selected SIMD tier sees the edge input.
 TEST(SQ8_FP16_EdgeCases, ConstantStorageTest) {
     size_t dim = 64;
 
@@ -3404,19 +3474,23 @@ TEST(SQ8_FP16_EdgeCases, ConstantStorageTest) {
     test_utils::quantize_float_vec_to_sq8_with_metadata(v_const.data(), dim,
                                                         v_const_quantized.data());
 
+    auto ip_func = IP_SQ8_FP16_GetDistFunc(dim, nullptr);
+    auto l2_func = L2_SQ8_FP16_GetDistFunc(dim, nullptr);
+
     float ip_baseline = test_utils::SQ8_FP16_NotOptimized_InnerProduct(v_const_quantized.data(),
                                                                        v_query.data(), dim);
-    float ip = SQ8_FP16_InnerProduct(v_const_quantized.data(), v_query.data(), dim);
-    ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Constant-storage SQ8_FP16_InnerProduct mismatch";
+    float ip = ip_func(v_const_quantized.data(), v_query.data(), dim);
+    ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Constant-storage SQ8_FP16 IP mismatch";
 
     float l2_baseline =
         test_utils::SQ8_FP16_NotOptimized_L2Sqr(v_const_quantized.data(), v_query.data(), dim);
-    float l2 = SQ8_FP16_L2Sqr(v_const_quantized.data(), v_query.data(), dim);
-    ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Constant-storage SQ8_FP16_L2Sqr mismatch";
+    float l2 = l2_func(v_const_quantized.data(), v_query.data(), dim);
+    ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Constant-storage SQ8_FP16 L2 mismatch";
 }
 
 // Mixed-sign FP16 query (alternating positive/negative values) verifies sign handling
 // in the FP16->FP32 widening path and in the algebraic identity used by the kernels.
+// Routes through the dispatcher so the runtime-selected SIMD tier sees the edge input.
 TEST(SQ8_FP16_EdgeCases, MixedSignQueryTest) {
     size_t dim = 64;
 
@@ -3436,20 +3510,24 @@ TEST(SQ8_FP16_EdgeCases, MixedSignQueryTest) {
     std::vector<uint8_t> v_quantized(quantized_size);
     test_utils::populate_float_vec_to_sq8_with_metadata(v_quantized.data(), dim, false, 9876);
 
+    auto ip_func = IP_SQ8_FP16_GetDistFunc(dim, nullptr);
+    auto cos_func = Cosine_SQ8_FP16_GetDistFunc(dim, nullptr);
+    auto l2_func = L2_SQ8_FP16_GetDistFunc(dim, nullptr);
+
     float ip_baseline =
         test_utils::SQ8_FP16_NotOptimized_InnerProduct(v_quantized.data(), v_query.data(), dim);
-    float ip = SQ8_FP16_InnerProduct(v_quantized.data(), v_query.data(), dim);
-    ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Mixed-sign SQ8_FP16_InnerProduct mismatch";
+    float ip = ip_func(v_quantized.data(), v_query.data(), dim);
+    ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Mixed-sign SQ8_FP16 IP mismatch";
 
     float cos_baseline =
         test_utils::SQ8_FP16_NotOptimized_Cosine(v_quantized.data(), v_query.data(), dim);
-    float cos = SQ8_FP16_Cosine(v_quantized.data(), v_query.data(), dim);
-    ASSERT_NEAR(cos, cos_baseline, 0.01f) << "Mixed-sign SQ8_FP16_Cosine mismatch";
+    float cos = cos_func(v_quantized.data(), v_query.data(), dim);
+    ASSERT_NEAR(cos, cos_baseline, 0.01f) << "Mixed-sign SQ8_FP16 Cosine mismatch";
 
     float l2_baseline =
         test_utils::SQ8_FP16_NotOptimized_L2Sqr(v_quantized.data(), v_query.data(), dim);
-    float l2 = SQ8_FP16_L2Sqr(v_quantized.data(), v_query.data(), dim);
-    ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Mixed-sign SQ8_FP16_L2Sqr mismatch";
+    float l2 = l2_func(v_quantized.data(), v_query.data(), dim);
+    ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Mixed-sign SQ8_FP16 L2 mismatch";
 }
 
 /* ======================== Tests SQ8_SQ8 ========================= */

From ce16f6be01abe39e59bc27aa41517e536f53e9d2 Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Thu, 28 May 2026 11:24:43 +0300
Subject: [PATCH 12/24] =?UTF-8?q?Add=20multi-accumulator=20ILP=20to=20SQ8?=
 =?UTF-8?q?=E2=86=94FP16=20x86=20kernels=20[MOD-14954]?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Break the FMA / mul+add dependency chain in all four SQ8↔FP16 IP kernels
by widening the inner loop to use multiple independent accumulators.
L2 kernels inherit the change through their `…InnerProductImp_…` call.

- IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h: 1 → 4 accumulators, unroll-4 main loop
  (64 lanes/iter) with a 16-lane tail for the 0..3 remaining chunks.
- IP_AVX2_FMA_SQ8_FP16.h, IP_AVX2_SQ8_FP16.h: 1 → 2 accumulators; the
  existing 2-step unrolled body now routes each step to an independent
  accumulator. The `residual >= 8` half-chunk feeds the second accumulator
  so the prologue also breaks the dependency chain.
- IP_SSE4_SQ8_FP16.h: 1 → 2 accumulators; do-while unrolled 1 → 2 steps
  per iteration (4 → 8 lanes/iter). Residual-ladder steps alternate
  between sum_a and sum_b for prologue ILP.

Correctness invariant: residual block consumes exactly `residual` lanes
(0..15) → remaining tail is always a multiple of 16, so the unrolled
loops (multiples of 8 / 16 / 64) terminate exactly. Verified by 131
SQ8_FP16 unit tests + 115 under ASan.
---
 src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h   | 15 +++++++----
 src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h       | 15 +++++++----
 .../IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h       | 27 ++++++++++++++++---
 src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h       | 19 ++++++++-----
 4 files changed, 56 insertions(+), 20 deletions(-)

diff --git a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
index eda8b393e..a4c1612ea 100644
--- a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
@@ -51,7 +51,10 @@ float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v,
     const float16 *pVec2 = static_cast<const float16 *>(pVec2v);
     const uint8_t *pEnd1 = pVec1 + dimension;
 
-    __m256 sum256 = _mm256_setzero_ps();
+    // Two independent accumulators break the FMA dependency chain so consecutive iterations
+    // can issue in parallel through both FMA ports.
+    __m256 sum_a = _mm256_setzero_ps();
+    __m256 sum_b = _mm256_setzero_ps();
 
     if constexpr (residual % 8) {
         constexpr int mask = (1 << (residual % 8)) - 1;
@@ -70,18 +73,20 @@ float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v,
         v2_f = _mm256_blend_ps(_mm256_setzero_ps(), v2_f, mask);
         pVec2 += residual % 8;
 
-        sum256 = _mm256_mul_ps(v1_f, v2_f);
+        sum_a = _mm256_mul_ps(v1_f, v2_f);
     }
 
     if constexpr (residual >= 8) {
-        SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum256);
+        // Route the half-residual chunk to the second accumulator for ILP.
+        SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum_b);
     }
 
     do {
-        SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum256);
-        SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum256);
+        SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum_a);
+        SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum_b);
     } while (pVec1 < pEnd1);
 
+    __m256 sum256 = _mm256_add_ps(sum_a, sum_b);
     float quantized_dot = my_mm256_reduce_add_ps(sum256);
 
     const uint8_t *pVec1Base = static_cast<const uint8_t *>(pVec1v);
diff --git a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
index 028d7d3e0..3a01d80f2 100644
--- a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
@@ -52,7 +52,10 @@ float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size
     const float16 *pVec2 = static_cast<const float16 *>(pVec2v);
     const uint8_t *pEnd1 = pVec1 + dimension;
 
-    __m256 sum256 = _mm256_setzero_ps();
+    // Two independent accumulators break the mul→add dependency chain on Haswell-class CPUs
+    // without FMA, where the add cannot retire before the prior mul.
+    __m256 sum_a = _mm256_setzero_ps();
+    __m256 sum_b = _mm256_setzero_ps();
 
     if constexpr (residual % 8) {
         constexpr int mask = (1 << (residual % 8)) - 1;
@@ -70,18 +73,20 @@ float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size
         v2_f = _mm256_blend_ps(_mm256_setzero_ps(), v2_f, mask);
         pVec2 += residual % 8;
 
-        sum256 = _mm256_mul_ps(v1_f, v2_f);
+        sum_a = _mm256_mul_ps(v1_f, v2_f);
     }
 
     if constexpr (residual >= 8) {
-        SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum256);
+        // Route the half-residual chunk to the second accumulator for ILP.
+        SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum_b);
     }
 
     do {
-        SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum256);
-        SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum256);
+        SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum_a);
+        SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum_b);
     } while (pVec1 < pEnd1);
 
+    __m256 sum256 = _mm256_add_ps(sum_a, sum_b);
     float quantized_dot = my_mm256_reduce_add_ps(sum256);
 
     const uint8_t *pVec1Base = static_cast<const uint8_t *>(pVec1v);
diff --git a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h
index 07f5d3456..fa0d508b4 100644
--- a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h
@@ -45,7 +45,12 @@ float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, si
     const float16 *pVec2 = static_cast<const float16 *>(pVec2v); // FP16 query
     const uint8_t *pEnd1 = pVec1 + dimension;
 
-    __m512 sum = _mm512_setzero_ps();
+    // Four independent accumulators break the FMA dependency chain so the inner loop can
+    // saturate both FMA ports on Sapphire Rapids / Zen 4.
+    __m512 sum0 = _mm512_setzero_ps();
+    __m512 sum1 = _mm512_setzero_ps();
+    __m512 sum2 = _mm512_setzero_ps();
+    __m512 sum3 = _mm512_setzero_ps();
 
     if constexpr (residual > 0) {
         __mmask16 mask = (1U << residual) - 1;
@@ -60,15 +65,29 @@ float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, si
         __m512 v2_f = _mm512_cvtph_ps(v2_16);
 
         // Mask out unused lanes by folding the mask into the multiply.
-        sum = _mm512_maskz_mul_ps(mask, v1_f, v2_f);
+        sum0 = _mm512_maskz_mul_ps(mask, v1_f, v2_f);
 
         pVec1 += residual;
         pVec2 += residual;
     }
 
-    do {
+    // Main unrolled loop: 4 chunks of 16 lanes per iteration, one chunk per accumulator.
+    // Residual leaves `dim - residual` lanes remaining (a multiple of 16), so the
+    // pointer comparison stays exact.
+    while (pVec1 + 64 <= pEnd1) {
+        SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum0);
+        SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum1);
+        SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum2);
+        SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum3);
+    }
+
+    // Reduce the four accumulators into one.
+    __m512 sum = _mm512_add_ps(_mm512_add_ps(sum0, sum1), _mm512_add_ps(sum2, sum3));
+
+    // Tail: at most three remaining 16-lane chunks.
+    while (pVec1 < pEnd1) {
         SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum);
-    } while (pVec1 < pEnd1);
+    }
 
     float quantized_dot = _mm512_reduce_add_ps(sum);
 
diff --git a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
index e5ca51860..871a189dc 100644
--- a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
@@ -50,7 +50,9 @@ float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2
     const float16 *pVec2 = static_cast<const float16 *>(pVec2v);
     const uint8_t *pEnd1 = pVec1 + dimension;
 
-    __m128 sum = _mm_setzero_ps();
+    // Two independent accumulators break the mul→add dependency chain (SSE4 lacks FMA).
+    __m128 sum_a = _mm_setzero_ps();
+    __m128 sum_b = _mm_setzero_ps();
 
     if constexpr (residual % 4) {
         __m128 v1_f;
@@ -75,23 +77,28 @@ float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2
         pVec1 += residual % 4;
         pVec2 += residual % 4;
 
-        sum = _mm_mul_ps(v1_f, v2_f);
+        sum_a = _mm_mul_ps(v1_f, v2_f);
     }
 
+    // Alternate the residual-ladder steps across the two accumulators for ILP.
     if constexpr (residual >= 4) {
-        SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum);
+        SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_b);
     }
     if constexpr (residual >= 8) {
-        SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum);
+        SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_a);
     }
     if constexpr (residual >= 12) {
-        SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum);
+        SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_b);
     }
 
+    // Remaining lanes after the residual block are a multiple of 16, hence a multiple of 8,
+    // so two 4-lane steps per iteration consume the tail exactly.
     do {
-        SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum);
+        SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_a);
+        SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_b);
     } while (pVec1 < pEnd1);
 
+    __m128 sum = _mm_add_ps(sum_a, sum_b);
     float PORTABLE_ALIGN16 TmpRes[4];
     _mm_store_ps(TmpRes, sum);
     float quantized_dot = TmpRes[0] + TmpRes[1] + TmpRes[2] + TmpRes[3];

From 658c485b9e3d601fa702ff21b13ee7e3c4eb48cb Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Thu, 28 May 2026 13:47:29 +0300
Subject: [PATCH 13/24] =?UTF-8?q?Drop=20misleading=20VNNI=20suffix=20from?=
 =?UTF-8?q?=20SQ8=E2=86=94FP16=20AVX-512=20kernel=20[MOD-14954]?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The SQ8↔FP16 AVX-512 kernel does not actually issue any VNNI instruction
— the inner loop is FP32 FMA (`_mm512_fmadd_ps`) over lanes widened from
SQ8 (`_mm512_cvtepu8_epi32` + `_mm512_cvtepi32_ps`) and FP16
(`_mm512_cvtph_ps`). Real VNNI use would require an integer-encoded
query, which is a different kernel entirely.

The file/function names are renamed to match what the kernel actually
uses (AVX-512F). The dispatcher .cpp/.h files stay named after the
runtime tier (AVX512F_BW_VL_VNNI) since the SQ8↔FP16 kernel still
registers under that tier alongside the genuinely VNNI-using SQ8↔SQ8 /
INT8 / UINT8 kernels — the gate is a CPU-feature gate, not an ISA claim.

The same misnomer exists for SQ8↔FP32; tracked separately so the rename
there can ship as its own commit.

Also: fix a strict-aliasing-class UB introduced by the AVX-512 unroll-4
loop. `while (pVec1 + 64 <= pEnd1)` forms a pointer past one-past-end of
the SQ8 storage object when fewer than 64 lane bytes remain, which is UB
in C++ regardless of dereference. Switched to pointer subtraction
(`static_cast<size_t>(pEnd1 - pVec1) >= 64`).

Renames:
- IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> IP_AVX512F_SQ8_FP16.h
- L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> L2_AVX512F_SQ8_FP16.h
- SQ8_FP16_{InnerProduct,Cosine,L2Sqr}SIMD16_AVX512F_BW_VL_VNNI -> _AVX512F
- Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F_BW_VL_VNNI -> _AVX512F

Verified: 131 SQ8_FP16 unit tests + 115 under ASan.
---
 ..._VNNI_SQ8_FP16.h => IP_AVX512F_SQ8_FP16.h} | 12 +++++++-----
 src/VecSim/spaces/IP_space.cpp                |  4 ++--
 ..._VNNI_SQ8_FP16.h => L2_AVX512F_SQ8_FP16.h} |  5 ++---
 src/VecSim/spaces/L2_space.cpp                |  2 +-
 .../spaces/functions/AVX512F_BW_VL_VNNI.cpp   | 19 ++++++++++---------
 .../spaces/functions/AVX512F_BW_VL_VNNI.h     |  8 +++++---
 .../spaces_benchmarks/bm_spaces_sq8_fp16.cpp  |  6 ++++--
 tests/unit/test_spaces.cpp                    | 12 ++++++------
 8 files changed, 37 insertions(+), 31 deletions(-)
 rename src/VecSim/spaces/IP/{IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h => IP_AVX512F_SQ8_FP16.h} (90%)
 rename src/VecSim/spaces/L2/{L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h => L2_AVX512F_SQ8_FP16.h} (85%)

diff --git a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h
similarity index 90%
rename from src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h
rename to src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h
index fa0d508b4..955f431f6 100644
--- a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h
@@ -73,8 +73,10 @@ float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, si
 
     // Main unrolled loop: 4 chunks of 16 lanes per iteration, one chunk per accumulator.
     // Residual leaves `dim - residual` lanes remaining (a multiple of 16), so the
-    // pointer comparison stays exact.
-    while (pVec1 + 64 <= pEnd1) {
+    // pointer comparison stays exact. Compare via pointer subtraction (not
+    // `pVec1 + 64 <= pEnd1`) so we never form a pointer past one-past-the-end,
+    // which would be UB in C++ regardless of dereference.
+    while (static_cast<size_t>(pEnd1 - pVec1) >= 64) {
         SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum0);
         SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum1);
         SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum2);
@@ -108,15 +110,15 @@ float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, si
 }
 
 template <unsigned char residual> // 0..15
-float SQ8_FP16_InnerProductSIMD16_AVX512F_BW_VL_VNNI(const void *pVec1v, const void *pVec2v,
+float SQ8_FP16_InnerProductSIMD16_AVX512F(const void *pVec1v, const void *pVec2v,
                                                      size_t dimension) {
     return 1.0f - SQ8_FP16_InnerProductImp_AVX512<residual>(pVec1v, pVec2v, dimension);
 }
 
 template <unsigned char residual> // 0..15
-float SQ8_FP16_CosineSIMD16_AVX512F_BW_VL_VNNI(const void *pVec1v, const void *pVec2v,
+float SQ8_FP16_CosineSIMD16_AVX512F(const void *pVec1v, const void *pVec2v,
                                                size_t dimension) {
     // Cosine distance = 1 - IP for pre-normalised vectors. Aliases InnerProduct, matching the
     // SQ8_FP32 pattern.
-    return SQ8_FP16_InnerProductSIMD16_AVX512F_BW_VL_VNNI<residual>(pVec1v, pVec2v, dimension);
+    return SQ8_FP16_InnerProductSIMD16_AVX512F<residual>(pVec1v, pVec2v, dimension);
 }
diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp
index 37ffc9ed4..1fd7381b7 100644
--- a/src/VecSim/spaces/IP_space.cpp
+++ b/src/VecSim/spaces/IP_space.cpp
@@ -191,7 +191,7 @@ dist_func_t<float> IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment,
     if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) {
         if (dim % 16 == 0) // SQ8 chunk = 16 bytes
             *alignment = 16 * sizeof(uint8_t);
-        return Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim);
+        return Choose_SQ8_FP16_IP_implementation_AVX512F(dim);
     }
 #endif
 #ifdef OPT_AVX2_FMA
@@ -245,7 +245,7 @@ dist_func_t<float> Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignm
     if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) {
         if (dim % 16 == 0)
             *alignment = 16 * sizeof(uint8_t);
-        return Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(dim);
+        return Choose_SQ8_FP16_Cosine_implementation_AVX512F(dim);
     }
 #endif
 #ifdef OPT_AVX2_FMA
diff --git a/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h
similarity index 85%
rename from src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h
rename to src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h
index 635f30904..384870b21 100644
--- a/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h
+++ b/src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h
@@ -8,7 +8,7 @@
  */
 #pragma once
 #include "VecSim/spaces/space_includes.h"
-#include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h"
+#include "VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h"
 #include "VecSim/types/sq8.h"
 #include "VecSim/types/float16.h"
 #include "VecSim/utils/alignment.h"
@@ -18,8 +18,7 @@ using float16 = vecsim_types::float16;
 
 // L2² = x_sum_squares + y_sum_squares - 2 * IP(x, y), computed via the AVX-512 IP impl above.
 template <unsigned char residual> // 0..15
-float SQ8_FP16_L2SqrSIMD16_AVX512F_BW_VL_VNNI(const void *pVect1v, const void *pVect2v,
-                                              size_t dimension) {
+float SQ8_FP16_L2SqrSIMD16_AVX512F(const void *pVect1v, const void *pVect2v, size_t dimension) {
     const float ip = SQ8_FP16_InnerProductImp_AVX512<residual>(pVect1v, pVect2v, dimension);
 
     const uint8_t *pVect1 = static_cast<const uint8_t *>(pVect1v);
diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp
index ab5188800..0ada05f76 100644
--- a/src/VecSim/spaces/L2_space.cpp
+++ b/src/VecSim/spaces/L2_space.cpp
@@ -123,7 +123,7 @@ dist_func_t<float> L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment,
     if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) {
         if (dim % 16 == 0)
             *alignment = 16 * sizeof(uint8_t);
-        return Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(dim);
+        return Choose_SQ8_FP16_L2_implementation_AVX512F(dim);
     }
 #endif
 #ifdef OPT_AVX2_FMA
diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp
index e5e8bb1c2..145300f24 100644
--- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp
+++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp
@@ -17,8 +17,8 @@
 #include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP32.h"
 #include "VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP32.h"
 
-#include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h"
-#include "VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h"
+#include "VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h"
+#include "VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h"
 
 #include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_SQ8.h"
 #include "VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_SQ8.h"
@@ -79,20 +79,21 @@ dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t d
     return ret_dist_func;
 }
 
-// SQ8-to-FP16 distance functions
-dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim) {
+// SQ8-to-FP16 distance functions. The kernels themselves only use AVX-512F (cvtph_ps + FMA);
+// they register under the VNNI tier solely for CPU-feature dispatch.
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX512F(size_t dim) {
     dist_func_t<float> ret_dist_func;
-    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX512F_BW_VL_VNNI);
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX512F);
     return ret_dist_func;
 }
-dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim) {
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX512F(size_t dim) {
     dist_func_t<float> ret_dist_func;
-    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX512F_BW_VL_VNNI);
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX512F);
     return ret_dist_func;
 }
-dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(size_t dim) {
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX512F(size_t dim) {
     dist_func_t<float> ret_dist_func;
-    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX512F_BW_VL_VNNI);
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX512F);
     return ret_dist_func;
 }
 
diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h
index b68bfd0a4..13dd9e8a8 100644
--- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h
+++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h
@@ -24,9 +24,11 @@ dist_func_t<float> Choose_SQ8_FP32_IP_implementation_AVX512F_BW_VL_VNNI(size_t d
 dist_func_t<float> Choose_SQ8_FP32_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim);
 dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t dim);
 
-dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim);
-dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim);
-dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(size_t dim);
+// SQ8-to-FP16 kernels only use AVX-512F instructions; they are declared here because
+// they register under the VNNI tier for CPU-feature dispatch.
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX512F(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX512F(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX512F(size_t dim);
 
 // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum)
 dist_func_t<float> Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim);
diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp
index f81a9d89d..04cb13eea 100644
--- a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp
+++ b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp
@@ -57,9 +57,11 @@ cpu_features::X86Features opt = cpu_features::GetX86Info().features;
 // AVX-512 F+BW+VL+VNNI (no F16C requirement — _mm512_cvtph_ps is part of AVX512F).
 #ifdef OPT_AVX512_F_BW_VL_VNNI
 bool avx512_f_bw_vl_vnni_supported = opt.avx512f && opt.avx512bw && opt.avx512vl && opt.avx512vnni;
-INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F_BW_VL_VNNI, 16,
+// Kernel itself only needs AVX-512F (cvtph_ps + FMA); the VNNI feature check keeps it on the
+// same dispatch tier as the rest of this file.
+INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16,
                                 avx512_f_bw_vl_vnni_supported);
-INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F_BW_VL_VNNI, 16,
+INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16,
                                  avx512_f_bw_vl_vnni_supported);
 #endif
 
diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp
index 2cccd1183..04618672a 100644
--- a/tests/unit/test_spaces.cpp
+++ b/tests/unit/test_spaces.cpp
@@ -577,9 +577,9 @@ TEST_F(SpacesTest, GetDistFuncSQ8FP16Asymmetric) {
 #ifdef OPT_AVX512_F_BW_VL_VNNI
         if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl &&
             optimization.avx512vnni) {
-            expected_l2 = Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(dim);
-            expected_ip = Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim);
-            expected_cos = Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(dim);
+            expected_l2 = Choose_SQ8_FP16_L2_implementation_AVX512F(dim);
+            expected_ip = Choose_SQ8_FP16_IP_implementation_AVX512F(dim);
+            expected_cos = Choose_SQ8_FP16_Cosine_implementation_AVX512F(dim);
         } else
 #endif
 #if defined(OPT_AVX2_FMA) && defined(OPT_F16C)
@@ -3142,7 +3142,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) {
         optimization.avx512vnni) {
         unsigned char alignment = 0;
         arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
-        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(dim))
+        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_AVX512F(dim))
             << "Unexpected distance function chosen for dim " << dim;
         ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
             << "AVX512 with dim " << dim;
@@ -3225,7 +3225,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) {
         optimization.avx512vnni) {
         unsigned char alignment = 0;
         arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
-        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim))
+        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_AVX512F(dim))
             << "Unexpected distance function chosen for dim " << dim;
         ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
             << "AVX512 with dim " << dim;
@@ -3306,7 +3306,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) {
         optimization.avx512vnni) {
         unsigned char alignment = 0;
         arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
-        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(dim))
+        ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_AVX512F(dim))
             << "Unexpected distance function chosen for dim " << dim;
         ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
             << "AVX512 with dim " << dim;

From fe69f8588eb07fded243674ac4fc470fe50f6dfa Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Thu, 28 May 2026 13:57:22 +0300
Subject: [PATCH 14/24] =?UTF-8?q?Remove=20SQ8=E2=86=94FP16=20design=20doc?=
 =?UTF-8?q?=20from=20PR=20[MOD-14954]?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Design doc was added in ad941b8f for planning; not appropriate as
a long-lived in-repo artifact. Keep externally (Confluence / scratch)
rather than ship with the kernel commit.
---
 .../2026-05-26-sq8-fp16-x86-kernels-design.md | 385 ------------------
 1 file changed, 385 deletions(-)
 delete mode 100644 docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md

diff --git a/docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md b/docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md
deleted file mode 100644
index 1ef7a787a..000000000
--- a/docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md
+++ /dev/null
@@ -1,385 +0,0 @@
-# SQ8↔FP16 SIMD distance kernels — Intel x86 (MOD-14954)
-
-## Goal
-
-Add asymmetric SQ8 (storage) ↔ FP16 (query) distance kernels for Inner
-Product, Cosine, and L2² on Intel x86 across four ISA tiers:
-
-- AVX-512 (F + BW + VL + VNNI bundle already used for SQ8_FP32)
-- AVX2 + FMA
-- AVX2 (no FMA)
-- SSE4.1
-
-Each kernel converts FP16 query lanes to FP32 per SIMD chunk; the inner
-multiply-accumulate runs in FP32. SQ8 metadata and FP32 query metadata
-(precomputed sums) stay scalar and are read with the same algebraic
-identity used by the SQ8_FP32 kernels:
-
-```text
-IP(x, y) = min · y_sum + delta · Σ(q_i · y_i)
-L2²(x, y) = x_sum_squares + y_sum_squares − 2 · IP(x, y)
-```
-
-Wire the new kernels into the dispatcher tables so
-`{IP,Cosine,L2}_SQ8_FP16_GetDistFunc` returns the best SIMD path
-available at runtime instead of the scalar fallback delivered by
-MOD-15141.
-
-## Non-goals
-
-- No new metric (only IP / Cosine / L2²).
-- No change to scalar `SQ8_FP16_*` reference; existing tests against
-  `SQ8_FP16_NotOptimized_*` remain the correctness baseline.
-- No ARM kernels (MOD-14972 covers ARM).
-- No SQ8↔FP32 changes; existing kernels untouched.
-
-## Scope and constraints
-
-- FP16 query layout is `[float16 values (dim)] [y_sum (float)]
-  [y_sum_squares (float, L2 only)]`. Trailing metadata is FP32 and may
-  sit at an offset that is not a multiple of 4 when `dim` is odd; use
-  `load_unaligned<float>` to read it (mirrors scalar `SQ8_FP16_Impl`).
-- All four ISA tiers need a way to widen FP16 → FP32. The 512-bit
-  variant (`_mm512_cvtph_ps`) is in AVX512F. The 256-bit and 128-bit
-  variants (`_mm256_cvtph_ps`, `_mm_cvtph_ps`) require the F16C
-  extension. F16C is its own ISA flag; AVX2/SSE4.1 do not imply it.
-- Existing dispatcher source files (`AVX2_FMA.cpp`, `AVX2.cpp`,
-  `SSE4.cpp`) are compiled without `-mf16c`. We add `-mf16c` to those
-  files in CMake (conditional on `CXX_F16C`), guard the new SQ8_FP16
-  symbols behind `#ifdef OPT_F16C`, and add `features.f16c &&` to the
-  dispatch gates for the AVX2/SSE4 tiers. The AVX-512 tier needs no
-  F16C gate.
-- `dim` must be ≥ 16 for the AVX-512/AVX2 SIMD paths and ≥ 16 for SSE4
-  (matches existing SQ8_FP32 contract).
-- SQ8 storage is read as `uint8_t`; alignment hint returned by
-  `*_GetDistFunc` continues to refer to the SQ8 (first) operand. Hints:
-  16 / 8 / 8 / 4 bytes for AVX-512 / AVX2+FMA / AVX2 / SSE4 when
-  `dim % chunk == 0`, else 0.
-
-## File-level design
-
-### New SIMD headers (8 files)
-
-Per ISA tier × {IP, L2}:
-
-```text
-src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h
-src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
-src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
-src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
-src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h
-src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h
-src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h
-src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h
-```
-
-Each IP header exposes:
-
-- `template <unsigned char residual> float SQ8_FP16_InnerProductImp_<tier>(const void*, const void*, size_t)` — raw inner product (no `1 -`), used by both InnerProduct/Cosine wrappers and the L2 kernel.
-- `template <unsigned char residual> float SQ8_FP16_InnerProductSIMD16_<tier>(...)` — returns `1.0f - Imp`.
-- `template <unsigned char residual> float SQ8_FP16_CosineSIMD16_<tier>(...)` — aliases InnerProduct (vectors are pre-normalised, mirrors SQ8_FP32 pattern).
-
-Each L2 header `#include`s the matching IP header and exposes:
-
-- `template <unsigned char residual> float SQ8_FP16_L2SqrSIMD16_<tier>(...)` — computes `x_sum_sq + y_sum_sq − 2·Imp(...)`.
-
-`<tier>` strings:
-
-- `AVX512F_BW_VL_VNNI`
-- `AVX2_FMA`
-- `AVX2`
-- `SSE4`
-
-All four headers' inner loops:
-
-1. Load 16 SQ8 bytes (one chunk) and widen to 16×FP32.
-2. Load 16 FP16 query lanes and widen to 16×FP32 (`_mm512_cvtph_ps`,
-   two `_mm256_cvtph_ps` calls, two `_mm256_cvtph_ps` for plain AVX2,
-   or four `_mm_cvtph_ps` for SSE4 — chunk granularity matches the
-   existing SQ8_FP32 layout for that tier).
-3. Fuse-multiply-add (or mul + add for SSE4 and plain AVX2) into the
-   FP32 accumulator(s).
-4. After the loop, horizontal-reduce and apply
-   `min_val · y_sum + delta · quantized_dot`.
-
-L2 kernels additionally read `x_sum_squares` from SQ8 metadata and
-`y_sum_squares` from query metadata, return
-`x_sum_sq + y_sum_sq − 2·ip`. **Both** the SQ8 storage metadata
-(`min_val`, `delta`, `x_sum_squares`) and the FP16 query metadata
-(`y_sum`, `y_sum_squares`) are read with `load_unaligned<float>`. SQ8
-metadata starts at byte offset `dim` after the quantised lanes — for
-odd `dim` that offset is not 4-byte aligned. FP16 query metadata
-starts at byte offset `2*dim` after the FP16 lanes — odd `dim` leaves
-it 2-byte aligned. Mirrors the scalar `SQ8_FP16_InnerProduct_Impl`
-pattern in `src/VecSim/spaces/IP/IP.cpp`.
-
-Residual handling:
-
-- **AVX-512** (residual 0..15): load the full 256-bit FP16 chunk
-  (`_mm256_loadu_si256` over 32 bytes; the chunk is always within the
-  query blob since `dim >= 16` and the FP16 metadata follows), convert with
-  `_mm512_cvtph_ps`, then mask away unused lanes via
-  `_mm512_maskz_mov_ps(mask, v2_f)` (or fold the mask into the
-  FP32 multiply with `_mm512_maskz_mul_ps`). The SQ8 side uses
-  `_mm_loadu_si128` + `_mm512_cvtepu8_epi32` + `_mm512_cvtepi32_ps`
-  and is also masked.
-- **AVX2+FMA / AVX2** (residual 0..15, split into a 0..7 head plus a
-  conditional 8-wide pre-step): for the 0..7 head, load the full
-  128-bit FP16 block (`_mm_loadu_si128`), convert with
-  `_mm256_cvtph_ps`, then zero out unused lanes via
-  `_mm256_blend_ps(_mm256_setzero_ps(), v2_f, residuals_mask)` —
-  mirroring the existing F16C `FP16_InnerProductSIMD32_F16C` blend
-  pattern. The SQ8 side uses `_mm_loadl_epi64` (8 bytes) +
-  `_mm256_cvtepu8_epi32` + `_mm256_cvtepi32_ps`. When residual ≥ 8,
-  one extra full 8-wide step runs before the do-while loop, matching
-  the SQ8_FP32 AVX2[+FMA] residual layout.
-- **SSE4** (residual 0..15, split into 4-wide pre-steps): for the
-  0..3 head, materialise the FP32 lanes via `_mm_set_ps(0, ..., 0,
-  FP16_to_FP32(pVec2[k]), ...)` paired with `_mm_set_ps` on the SQ8
-  side — mirrors the existing SSE4 SQ8_FP32 `_mm_set_ps` residual
-  path. For residual ≥ 4 / ≥ 8 / ≥ 12, run 1 / 2 / 3 extra 4-wide
-  steps before the do-while loop. Each 4-wide step loads 8 bytes of
-  FP16 (`_mm_loadl_epi64`), converts with `_mm_cvtph_ps`, and loads
-  4 SQ8 bytes via `_mm_cvtsi32_si128` + `_mm_cvtepu8_epi32` +
-  `_mm_cvtepi32_ps`.
-
-### Dispatcher edits
-
-Per existing ISA dispatcher (no new dispatcher files):
-
-| File | Add declarations / definitions |
-| --- | --- |
-| `src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.{h,cpp}` | `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F_BW_VL_VNNI` |
-| `src/VecSim/spaces/functions/AVX2_FMA.{h,cpp}` | `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX2_FMA`, guarded by `#ifdef OPT_F16C` |
-| `src/VecSim/spaces/functions/AVX2.{h,cpp}` | `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX2`, guarded by `#ifdef OPT_F16C` |
-| `src/VecSim/spaces/functions/SSE4.{h,cpp}` | `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_SSE4`, guarded by `#ifdef OPT_F16C` |
-
-Each `Choose_*` uses the existing `CHOOSE_IMPLEMENTATION(out, dim, 16,
-func)` macro (16-element residual table — matches SQ8_FP32 contract).
-
-`src/VecSim/spaces/IP_space.cpp` — extend `IP_SQ8_FP16_GetDistFunc` and
-`Cosine_SQ8_FP16_GetDistFunc`. `L2_space.cpp` — extend
-`L2_SQ8_FP16_GetDistFunc`. New body shape (IP shown; L2/Cosine
-identical):
-
-```cpp
-dist_func_t<float> ret_dist_func = SQ8_FP16_InnerProduct;
-[[maybe_unused]] auto features = getCpuOptimizationFeatures(arch_opt);
-
-#ifdef CPU_FEATURES_ARCH_X86_64
-if (dim < 16) {
-    return ret_dist_func;
-}
-#ifdef OPT_AVX512_F_BW_VL_VNNI
-if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) {
-    if (dim % 16 == 0) *alignment = 16 * sizeof(uint8_t);
-    return Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim);
-}
-#endif
-#ifdef OPT_AVX2_FMA
-#ifdef OPT_F16C
-if (features.avx2 && features.fma3 && features.f16c) {
-    if (dim % 8 == 0) *alignment = 8 * sizeof(uint8_t);
-    return Choose_SQ8_FP16_IP_implementation_AVX2_FMA(dim);
-}
-#endif
-#endif
-#ifdef OPT_AVX2
-#ifdef OPT_F16C
-if (features.avx2 && features.f16c) {
-    if (dim % 8 == 0) *alignment = 8 * sizeof(uint8_t);
-    return Choose_SQ8_FP16_IP_implementation_AVX2(dim);
-}
-#endif
-#endif
-#ifdef OPT_SSE4
-#ifdef OPT_F16C
-// F16C instructions are VEX-encoded — require AVX as well, matching the
-// existing FP16/F16C dispatcher gate in IP_space.cpp.
-if (features.sse4_1 && features.f16c && features.avx) {
-    if (dim % 4 == 0) *alignment = 4 * sizeof(uint8_t);
-    return Choose_SQ8_FP16_IP_implementation_SSE4(dim);
-}
-#endif
-#endif
-#endif // x86_64
-return ret_dist_func;
-```
-
-ARM block (`OPT_SVE2` / `OPT_SVE` / `OPT_NEON`) is left as-is — the
-SQ8_FP16 ARM kernels arrive via MOD-14972.
-
-### CMake change
-
-`src/VecSim/spaces/CMakeLists.txt` — when both `CXX_F16C` and the
-parent ISA flag are present, add `-mf16c` to the dispatcher file:
-
-```cmake
-if(CXX_AVX2 AND CXX_FMA)
-    set(_avx2_fma_flags "-mavx2 -mfma")
-    if(CXX_F16C)
-        set(_avx2_fma_flags "${_avx2_fma_flags} -mf16c")
-    endif()
-    set_source_files_properties(functions/AVX2_FMA.cpp PROPERTIES COMPILE_FLAGS "${_avx2_fma_flags}")
-    list(APPEND OPTIMIZATIONS functions/AVX2_FMA.cpp)
-endif()
-
-if(CXX_AVX2)
-    set(_avx2_flags "-mavx2")
-    if(CXX_F16C)
-        set(_avx2_flags "${_avx2_flags} -mf16c")
-    endif()
-    set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS "${_avx2_flags}")
-    list(APPEND OPTIMIZATIONS functions/AVX2.cpp)
-endif()
-
-if(CXX_SSE4)
-    set(_sse4_flags "-msse4.1")
-    if(CXX_F16C)
-        set(_sse4_flags "${_sse4_flags} -mf16c")
-    endif()
-    set_source_files_properties(functions/SSE4.cpp PROPERTIES COMPILE_FLAGS "${_sse4_flags}")
-    list(APPEND OPTIMIZATIONS functions/SSE4.cpp)
-endif()
-```
-
-AVX-512 dispatcher (`AVX512F_BW_VL_VNNI.cpp`) needs no flag change —
-`-mavx512f` already enables `_mm512_cvtph_ps`.
-
-`-mf16c` does not alter the emitted code for the existing SQ8_FP32
-sources, since those sources contain no F16C intrinsics.
-
-### Tests (`tests/unit/test_spaces.cpp`)
-
-1. New parameterised class `SQ8_FP16_SpacesOptimizationTest` mirroring
-   `SQ8_FP32_SpacesOptimizationTest`. Three test bodies for L2 / IP /
-   Cosine, each comparing the chosen optimised function against the
-   scalar `SQ8_FP16_*` baseline (`ASSERT_NEAR ... 0.01`). Walks down
-   AVX512 → AVX2_FMA → AVX2 → SSE4 → scalar by zeroing feature flags
-   between assertions, exactly like `SQ8_FP32_SpacesOptimizationTest`.
-   `INSTANTIATE_TEST_SUITE_P` with `testing::Range(16UL, 16 * 2UL + 1)`.
-
-2. Update existing `SpacesTest.GetDistFunc_*_SQ8_FP16` assertions at
-   lines ~563–575: when running on x86, the dispatcher now returns the
-   SIMD `Choose_*` symbol instead of the scalar. AVX-512 selection
-   depends on `avx512f && avx512bw && avx512vl && avx512vnni` only
-   (no F16C requirement — 512-bit `_mm512_cvtph_ps` is part of
-   AVX512F). AVX2+FMA, AVX2, and SSE4 selection additionally requires
-   `features.f16c` (and `features.avx` for the SSE4 gate). The tests
-   should call `getCpuOptimizationFeatures()` and assert the expected
-   `Choose_*` for the host's highest supported tier (same shape used
-   by `SQ8_FP32_SpacesOptimizationTest`).
-
-3. Reuse existing helpers: `populate_sq8_fp16_query`,
-   `populate_float_vec_to_sq8_with_metadata`,
-   `SQ8_FP16_NotOptimized_{InnerProduct,Cosine,L2Sqr}`.
-
-### Benchmarks (`tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp`)
-
-Add per-ISA benches mirroring `bm_spaces_sq8_fp32.cpp`:
-
-```cpp
-#ifdef CPU_FEATURES_ARCH_X86_64
-cpu_features::X86Features opt = cpu_features::GetX86Info().features;
-
-#ifdef OPT_AVX512_F_BW_VL_VNNI
-bool avx512_f_bw_vl_vnni_supported = opt.avx512f && opt.avx512bw && opt.avx512vl && opt.avx512vnni;
-INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F_BW_VL_VNNI, 16, avx512_f_bw_vl_vnni_supported);
-INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F_BW_VL_VNNI, 16, avx512_f_bw_vl_vnni_supported);
-#endif
-
-#ifdef OPT_F16C
-#ifdef OPT_AVX2_FMA
-bool avx2_fma3_f16c_supported = opt.avx2 && opt.fma3 && opt.f16c;
-INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2_FMA, 16, avx2_fma3_f16c_supported);
-INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2_FMA, 16, avx2_fma3_f16c_supported);
-#endif
-
-#ifdef OPT_AVX2
-bool avx2_f16c_supported = opt.avx2 && opt.f16c;
-INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2, 16, avx2_f16c_supported);
-INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2, 16, avx2_f16c_supported);
-#endif
-
-#ifdef OPT_SSE4
-bool sse4_f16c_supported = opt.sse4_1 && opt.f16c && opt.avx;
-INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SSE4, 16, sse4_f16c_supported);
-INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SSE4, 16, sse4_f16c_supported);
-#endif
-#endif // OPT_F16C
-#endif // x86_64
-```
-
-Naive bench lines stay (covers the scalar fallback case).
-
-## Validation strategy
-
-1. Unit tests (`SQ8_FP16_SpacesOptimizationTest`) assert numerical
-   parity against the scalar baseline for all dims in `[16, 32]`
-   (covers every residual class for the 16-wide chunk). Existing
-   `SQ8_FP16_NoOpt` parameterised suite continues to exercise small
-   and odd dims for the scalar reference; combined with the new
-   optimisation tests this covers each SIMD residual class plus the
-   scalar fallback.
-2. Existing edge-case tests (`SQ8_FP16_EdgeCases.ZeroQueryTest`,
-   `SQ8_FP16_l2sqr_odd_dim_unaligned_metadata_test`) keep running
-   against the scalar implementation directly — they exercise
-   alignment-sensitive paths that are deliberately scalar-only.
-3. Microbenchmarks compare per-ISA SQ8_FP16 throughput to the matching
-   SQ8_FP32 throughput on the same machine. Acceptance: SQ8_FP16
-   should be within ~1.0–1.5× of SQ8_FP32 (one extra widening per
-   chunk, no extra memory pressure since the FP16 query is half the
-   size of FP32).
-4. CI: x86 jobs already exist; verifies the CMake change keeps
-   building. No new toolchain requirement (binutils 2.34+ already
-   covers F16C, no AVX-512 FP16 dependency).
-
-## Risk register
-
-| Risk | Likelihood | Mitigation |
-| --- | --- | --- |
-| Adding `-mf16c` to AVX2_FMA.cpp / AVX2.cpp / SSE4.cpp accidentally enables F16C codegen elsewhere | Low | Those sources contain only SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 code; no F16C intrinsics — compiler cannot synthesise F16C without an explicit intrinsic. |
-| Older toolchain without F16C support | Low | `CXX_F16C` already detected; `-mf16c` only appended when present. Dispatcher symbols guarded by `#ifdef OPT_F16C`; missing → falls through to scalar. |
-| Backport branches diverge in dispatcher | Medium | Change is additive (new headers, new symbols, new gates). No SQ8_FP32 path touched. CMake change is conditional. Backport just cherry-picks the commit. |
-| Pre-Ivy Bridge SSE4-only CPUs lose a SIMD tier (no F16C) | Negligible | Fall through to scalar SQ8_FP16. Such CPUs are out of practical support anyway. |
-| Numerical drift between FP16→FP32 widening and the scalar `FP16_to_FP32` software path | Low | `vcvtph2ps` follows IEEE 754 half→single conversion exactly; the scalar `FP16_to_FP32` in `float16.h` is bit-faithful for finite values. Tests use `ASSERT_NEAR ... 0.01` slack. |
-
-## Out-of-scope follow-ups
-
-- AVX512FP16-native kernels (would use `__m512h` and `vfmadd*ph`
-  directly on 32 FP16 lanes per 512-bit register, skipping the
-  widen-to-FP32 step). Deferred for four concrete reasons, not just
-  "lower priority":
-    1. **Deployment baseline.** AVX512FP16 is Sapphire Rapids and
-       newer (Intel server 2023+) plus very recent AMD parts. Most
-       production hosts running this library do not have it. The
-       AVX-512F path delivered here is the right default for the
-       widely-deployed AVX-512 tier, and a Sapphire-Rapids-only
-       variant would land underneath the same gating tree, not as a
-       replacement.
-    2. **Numerical fit is awkward for SQ8↔FP16.** The kernel computes
-       `Σ(q_i · y_i)` where `q_i ∈ [0,255]` (uint8) and `y_i` is
-       FP16. Each lane product can be as large as
-       `255 · 65504 ≈ 1.67e7`, which is well above the FP16 finite
-       range (`±65504`). A pure FP16 accumulator would overflow on
-       realistic data; the only safe path is to accumulate in FP32
-       after a per-chunk `vcvtph2ps`-equivalent — which is exactly
-       what the AVX-512F path already does. AVX512FP16 mainly buys
-       FP16-native multiply-add, which we cannot safely use here.
-    3. **Marginal speedup over the AVX-512F path proposed here.**
-       The widening cost is one `_mm512_cvtph_ps` per 16-element
-       chunk against a kernel that is already memory-bandwidth-bound
-       (16 bytes of SQ8 storage + 32 bytes of FP16 query per chunk).
-       Eliminating that one conversion saves a few cycles per chunk
-       on a path that is gated on memory, not arithmetic throughput.
-    4. **Ticket scope.** MOD-14954 enumerates AVX-512, AVX2+FMA, and
-       SSE4; the plain-AVX2 tier was added during brainstorming as
-       free coverage. An AVX512FP16 variant is its own ISA tier with
-       its own gating column in the dispatcher and its own residual
-       table, and warrants a separate design / benchmarking pass
-       once the deployment baseline justifies the maintenance cost.
-  Pure FP16↔FP16 (no SQ8 involved) already has an AVX512FP16_VL path
-  at `src/VecSim/spaces/functions/AVX512FP16_VL.cpp`; that file is the
-  natural home should we revisit this later.
-- ARM SQ8_FP16 (MOD-14972).
-- Reranking flow integration tests under HNSW (separate ticket).

From 2a4ef92bc91a1e17e5cc590908a1c10c5aa12127 Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Thu, 28 May 2026 14:12:19 +0300
Subject: [PATCH 15/24] =?UTF-8?q?Simplify=20SQ8=E2=86=94FP16=20tests=20to?=
 =?UTF-8?q?=20match=20sister=20conventions=20[MOD-14954]?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two trims, both restoring pre-existing patterns elsewhere in the file:

1. `GetDistFuncSQ8FP16Asymmetric` had grown a runtime SIMD-tier walk that
   duplicated coverage already provided by `SQ8_FP16_SpacesOptimizationTest`.
   Reduced to the bare dispatcher-equality check used by the FP32 / SQ8↔SQ8
   sister tests at lines 540-548 and 551-559.

2. The `SQ8_FP16_EdgeCases` tests (`ZeroQueryTest`, `ConstantStorageTest`,
   `MixedSignQueryTest`) were routed through
   `{IP,Cosine,L2}_SQ8_FP16_GetDistFunc(dim, nullptr)` to force runtime SIMD
   dispatch on adversarial inputs. Reverted to direct scalar calls
   (`SQ8_FP16_InnerProduct`, etc.) — the original pre-fdc5c1cd shape.

   Coverage rationale: the SIMD kernels are branchless on input values
   (verified by grep — no value-dependent `if` in any tier). Every code
   path is therefore exercised by `SQ8_FP16_SpacesOptimizationTest`'s
   random inputs at multiple dims. The edge-case tests verify the
   *algebraic identity* (IP of zero query = 1.0, constant storage matches
   dequant baseline, mixed-sign handling) — scalar correctness on these
   inputs is what was actually being checked, and the SIMD path matches
   scalar via the SpacesOptimizationTest tier walk.

Net: 77 lines removed from the test file, matches sister conventions,
no coverage gap.
---
 tests/unit/test_spaces.cpp | 97 ++++++++------------------------------
 1 file changed, 20 insertions(+), 77 deletions(-)

diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp
index 04618672a..a4da4abb8 100644
--- a/tests/unit/test_spaces.cpp
+++ b/tests/unit/test_spaces.cpp
@@ -560,61 +560,15 @@ TEST_F(SpacesTest, GetDistFuncSQ8Asymmetric) {
 }
 
 TEST_F(SpacesTest, GetDistFuncSQ8FP16Asymmetric) {
-    // SQ8 storage with FP16 query (asymmetric). The dispatcher now returns the highest SIMD
-    // tier available at runtime; assert that and fall back to scalar only if no tier matches.
+    // SQ8 storage with FP16 query (asymmetric) - should return SQ8_FP16 functions.
+    // Per-ISA dispatcher walk coverage lives in the SQ8_FP16 SpacesOptimizationTest below.
     size_t dim = 128;
     auto l2_func = spaces::GetDistFunc<sq8, float, float16>(VecSimMetric_L2, dim, nullptr);
     auto ip_func = spaces::GetDistFunc<sq8, float, float16>(VecSimMetric_IP, dim, nullptr);
     auto cosine_func = spaces::GetDistFunc<sq8, float, float16>(VecSimMetric_Cosine, dim, nullptr);
-
-    auto optimization = getCpuOptimizationFeatures();
-    dist_func_t<float> expected_l2 = SQ8_FP16_L2Sqr;
-    dist_func_t<float> expected_ip = SQ8_FP16_InnerProduct;
-    dist_func_t<float> expected_cos = SQ8_FP16_Cosine;
-
-#ifdef CPU_FEATURES_ARCH_X86_64
-    if (dim >= 16) {
-#ifdef OPT_AVX512_F_BW_VL_VNNI
-        if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl &&
-            optimization.avx512vnni) {
-            expected_l2 = Choose_SQ8_FP16_L2_implementation_AVX512F(dim);
-            expected_ip = Choose_SQ8_FP16_IP_implementation_AVX512F(dim);
-            expected_cos = Choose_SQ8_FP16_Cosine_implementation_AVX512F(dim);
-        } else
-#endif
-#if defined(OPT_AVX2_FMA) && defined(OPT_F16C)
-            if (optimization.avx2 && optimization.fma3 && optimization.f16c) {
-            expected_l2 = Choose_SQ8_FP16_L2_implementation_AVX2_FMA(dim);
-            expected_ip = Choose_SQ8_FP16_IP_implementation_AVX2_FMA(dim);
-            expected_cos = Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(dim);
-        } else
-#endif
-#if defined(OPT_AVX2) && defined(OPT_F16C)
-            if (optimization.avx2 && optimization.f16c) {
-            expected_l2 = Choose_SQ8_FP16_L2_implementation_AVX2(dim);
-            expected_ip = Choose_SQ8_FP16_IP_implementation_AVX2(dim);
-            expected_cos = Choose_SQ8_FP16_Cosine_implementation_AVX2(dim);
-        } else
-#endif
-#if defined(OPT_SSE4) && defined(OPT_F16C)
-            if (optimization.sse4_1 && optimization.f16c && optimization.avx) {
-            expected_l2 = Choose_SQ8_FP16_L2_implementation_SSE4(dim);
-            expected_ip = Choose_SQ8_FP16_IP_implementation_SSE4(dim);
-            expected_cos = Choose_SQ8_FP16_Cosine_implementation_SSE4(dim);
-        } else
-#endif
-        {
-            // Falls through to scalar.
-        }
-    }
-#endif // x86_64
-
     ASSERT_EQ(l2_func, L2_SQ8_FP16_GetDistFunc(dim, nullptr));
     ASSERT_EQ(ip_func, IP_SQ8_FP16_GetDistFunc(dim, nullptr));
     ASSERT_EQ(cosine_func, Cosine_SQ8_FP16_GetDistFunc(dim, nullptr));
-    ASSERT_EQ(l2_func, expected_l2);
-    ASSERT_EQ(ip_func, expected_ip);
-    ASSERT_EQ(cosine_func, expected_cos);
 }
 
 #ifdef CPU_FEATURES_ARCH_X86_64
@@ -3425,8 +3379,9 @@ TEST(SQ8_FP16_SIMD_TierCoverage, ReportTiersExercised) {
 /* ======================== Tests SQ8_FP16 (edge cases) ========================= */
 
 // Zero FP16 query against a non-zero SQ8 storage. IP must be exactly 1.0 (1 - 0),
-// L2² must equal Σ dequantized². Routes through the dispatcher so the runtime-selected
-// SIMD tier (AVX-512 / AVX2+FMA / AVX2 / SSE4 / scalar) is exercised, not just scalar.
+// L2² must equal Σ dequantized². Math correctness on adversarial inputs is verified
+// against the scalar reference; SIMD tier coverage with branchless kernels is provided
+// separately by SQ8_FP16_SpacesOptimizationTest.
 TEST(SQ8_FP16_EdgeCases, ZeroQueryTest) {
     size_t dim = 64;
 
@@ -3441,24 +3396,20 @@ TEST(SQ8_FP16_EdgeCases, ZeroQueryTest) {
     test_utils::populate_float_vec_to_sq8_with_metadata(v_nonzero_quantized.data(), dim, false,
                                                         1234);
 
-    auto ip_func = IP_SQ8_FP16_GetDistFunc(dim, nullptr);
-    auto l2_func = L2_SQ8_FP16_GetDistFunc(dim, nullptr);
-
     float ip_baseline = test_utils::SQ8_FP16_NotOptimized_InnerProduct(v_nonzero_quantized.data(),
                                                                        v_zero_query.data(), dim);
-    float ip = ip_func(v_nonzero_quantized.data(), v_zero_query.data(), dim);
-    ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Zero-query SQ8_FP16 IP mismatch";
+    float ip = SQ8_FP16_InnerProduct(v_nonzero_quantized.data(), v_zero_query.data(), dim);
+    ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Zero-query SQ8_FP16_InnerProduct mismatch";
     ASSERT_NEAR(ip, 1.0f, 0.01f) << "Zero-query IP must equal 1.0 (1 - 0)";
 
     float l2_baseline = test_utils::SQ8_FP16_NotOptimized_L2Sqr(v_nonzero_quantized.data(),
                                                                 v_zero_query.data(), dim);
-    float l2 = l2_func(v_nonzero_quantized.data(), v_zero_query.data(), dim);
-    ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Zero-query SQ8_FP16 L2 mismatch";
+    float l2 = SQ8_FP16_L2Sqr(v_nonzero_quantized.data(), v_zero_query.data(), dim);
+    ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Zero-query SQ8_FP16_L2Sqr mismatch";
 }
 
 // Constant SQ8 storage (all values identical => delta = 0). Storage quantizer sets delta to 1.0
-// to avoid div-by-zero, so verify the kernels still match the dequantization baseline. Routes
-// through the dispatcher so the runtime-selected SIMD tier sees the edge input.
+// to avoid div-by-zero, so verify the kernels still match the dequantization baseline.
 TEST(SQ8_FP16_EdgeCases, ConstantStorageTest) {
     size_t dim = 64;
 
@@ -3474,23 +3425,19 @@ TEST(SQ8_FP16_EdgeCases, ConstantStorageTest) {
     test_utils::quantize_float_vec_to_sq8_with_metadata(v_const.data(), dim,
                                                         v_const_quantized.data());
 
-    auto ip_func = IP_SQ8_FP16_GetDistFunc(dim, nullptr);
-    auto l2_func = L2_SQ8_FP16_GetDistFunc(dim, nullptr);
-
     float ip_baseline = test_utils::SQ8_FP16_NotOptimized_InnerProduct(v_const_quantized.data(),
                                                                        v_query.data(), dim);
-    float ip = ip_func(v_const_quantized.data(), v_query.data(), dim);
-    ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Constant-storage SQ8_FP16 IP mismatch";
+    float ip = SQ8_FP16_InnerProduct(v_const_quantized.data(), v_query.data(), dim);
+    ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Constant-storage SQ8_FP16_InnerProduct mismatch";
 
     float l2_baseline =
         test_utils::SQ8_FP16_NotOptimized_L2Sqr(v_const_quantized.data(), v_query.data(), dim);
-    float l2 = l2_func(v_const_quantized.data(), v_query.data(), dim);
-    ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Constant-storage SQ8_FP16 L2 mismatch";
+    float l2 = SQ8_FP16_L2Sqr(v_const_quantized.data(), v_query.data(), dim);
+    ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Constant-storage SQ8_FP16_L2Sqr mismatch";
 }
 
 // Mixed-sign FP16 query (alternating positive/negative values) verifies sign handling
 // in the FP16->FP32 widening path and in the algebraic identity used by the kernels.
-// Routes through the dispatcher so the runtime-selected SIMD tier sees the edge input.
 TEST(SQ8_FP16_EdgeCases, MixedSignQueryTest) {
     size_t dim = 64;
 
@@ -3510,24 +3457,20 @@ TEST(SQ8_FP16_EdgeCases, MixedSignQueryTest) {
     std::vector<uint8_t> v_quantized(quantized_size);
     test_utils::populate_float_vec_to_sq8_with_metadata(v_quantized.data(), dim, false, 9876);
 
-    auto ip_func = IP_SQ8_FP16_GetDistFunc(dim, nullptr);
-    auto cos_func = Cosine_SQ8_FP16_GetDistFunc(dim, nullptr);
-    auto l2_func = L2_SQ8_FP16_GetDistFunc(dim, nullptr);
-
     float ip_baseline =
         test_utils::SQ8_FP16_NotOptimized_InnerProduct(v_quantized.data(), v_query.data(), dim);
-    float ip = ip_func(v_quantized.data(), v_query.data(), dim);
-    ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Mixed-sign SQ8_FP16 IP mismatch";
+    float ip = SQ8_FP16_InnerProduct(v_quantized.data(), v_query.data(), dim);
+    ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Mixed-sign SQ8_FP16_InnerProduct mismatch";
 
     float cos_baseline =
         test_utils::SQ8_FP16_NotOptimized_Cosine(v_quantized.data(), v_query.data(), dim);
-    float cos = cos_func(v_quantized.data(), v_query.data(), dim);
-    ASSERT_NEAR(cos, cos_baseline, 0.01f) << "Mixed-sign SQ8_FP16 Cosine mismatch";
+    float cos = SQ8_FP16_Cosine(v_quantized.data(), v_query.data(), dim);
+    ASSERT_NEAR(cos, cos_baseline, 0.01f) << "Mixed-sign SQ8_FP16_Cosine mismatch";
 
     float l2_baseline =
         test_utils::SQ8_FP16_NotOptimized_L2Sqr(v_quantized.data(), v_query.data(), dim);
-    float l2 = l2_func(v_quantized.data(), v_query.data(), dim);
-    ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Mixed-sign SQ8_FP16 L2 mismatch";
+    float l2 = SQ8_FP16_L2Sqr(v_quantized.data(), v_query.data(), dim);
+    ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Mixed-sign SQ8_FP16_L2Sqr mismatch";
 }
 
 /* ======================== Tests SQ8_SQ8 ========================= */

From 929f694cdfd106eaba30c94cee87d8ef3564b18d Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Thu, 28 May 2026 14:26:38 +0300
Subject: [PATCH 16/24] =?UTF-8?q?Split=20SQ8=E2=86=94FP16=20F16C=20kernels?=
 =?UTF-8?q?=20into=20sibling=20TUs=20[MOD-14954]?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The SQ8↔FP16 kernels in the SSE4, AVX2, and AVX2+FMA tiers depend on
F16C (`_mm_cvtph_ps` / `_mm256_cvtph_ps`), while every other kernel in
those dispatcher TUs is F16C-clean. The previous arrangement mixed
both under `#ifdef OPT_F16C` blocks inside the base dispatcher .cpp/.h
files.

Split each tier's F16C-dependent kernels off into a sibling TU:

  functions/SSE4.cpp           → SSE4 + SQ8_FP32 (no F16C)
  functions/SSE4_F16C.cpp      → SQ8_FP16 only (requires -mavx -mf16c)

  functions/AVX2.cpp           → AVX2 + BF16 + SQ8_FP32 (no F16C)
  functions/AVX2_F16C.cpp      → SQ8_FP16 only (requires -mf16c)

  functions/AVX2_FMA.cpp       → SQ8_FP32 (no F16C)
  functions/AVX2_FMA_F16C.cpp  → SQ8_FP16 only (requires -mf16c)

The AVX-512 tier is unaffected — its SQ8_FP16 kernel uses
`_mm512_cvtph_ps`, which is part of AVX-512F and not F16C.

CMake now compiles each sibling TU conditionally on
`_has_full_f16c` and applies the F16C flags only there. Base TUs no
longer carry `-mf16c`, since they no longer reference F16C intrinsics.

Result:
- No `#ifdef OPT_F16C` directives in `functions/*.cpp` or `functions/*.h`.
- Compile-time isolation: an F16C intrinsic accidentally added outside
  a `_F16C` sibling will fail to build, not silently miscompile.
- Caller sites (`IP_space.cpp`, `L2_space.cpp`, `test_spaces.cpp`,
  `bm_spaces.h`) still gate the *calls* with `#ifdef OPT_F16C`; the new
  sibling .h includes are unconditional, since declarations alone don't
  link-error and the calls remain guarded.

Verified: 131 SQ8_FP16 unit tests + 115 ASan + 1166 full test_spaces
suite (covers other ISA tiers SQ8_FP32 / BF16 / INT8 / UINT8 to confirm
no regression from the dispatcher restructure).
---
 src/VecSim/spaces/CMakeLists.txt              | 55 ++++++++++---------
 src/VecSim/spaces/IP_space.cpp                |  3 +
 src/VecSim/spaces/L2_space.cpp                |  3 +
 src/VecSim/spaces/functions/AVX2.cpp          | 22 --------
 src/VecSim/spaces/functions/AVX2.h            |  5 --
 src/VecSim/spaces/functions/AVX2_F16C.cpp     | 35 ++++++++++++
 src/VecSim/spaces/functions/AVX2_F16C.h       | 23 ++++++++
 src/VecSim/spaces/functions/AVX2_FMA.cpp      | 22 --------
 src/VecSim/spaces/functions/AVX2_FMA.h        |  5 --
 src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp | 35 ++++++++++++
 src/VecSim/spaces/functions/AVX2_FMA_F16C.h   | 23 ++++++++
 src/VecSim/spaces/functions/SSE4.cpp          | 22 --------
 src/VecSim/spaces/functions/SSE4.h            |  5 --
 src/VecSim/spaces/functions/SSE4_F16C.cpp     | 35 ++++++++++++
 src/VecSim/spaces/functions/SSE4_F16C.h       | 23 ++++++++
 tests/benchmark/spaces_benchmarks/bm_spaces.h |  3 +
 tests/unit/test_spaces.cpp                    |  3 +
 17 files changed, 215 insertions(+), 107 deletions(-)
 create mode 100644 src/VecSim/spaces/functions/AVX2_F16C.cpp
 create mode 100644 src/VecSim/spaces/functions/AVX2_F16C.h
 create mode 100644 src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp
 create mode 100644 src/VecSim/spaces/functions/AVX2_FMA_F16C.h
 create mode 100644 src/VecSim/spaces/functions/SSE4_F16C.cpp
 create mode 100644 src/VecSim/spaces/functions/SSE4_F16C.h

diff --git a/src/VecSim/spaces/CMakeLists.txt b/src/VecSim/spaces/CMakeLists.txt
index a580916d2..9b7477837 100644
--- a/src/VecSim/spaces/CMakeLists.txt
+++ b/src/VecSim/spaces/CMakeLists.txt
@@ -57,30 +57,33 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)")
 		set(_has_full_f16c TRUE)
 	endif()
 
+	# Base AVX2 / AVX2+FMA dispatcher TUs hold only kernels with no F16C dependency.
+	# SQ8↔FP16 kernels (which require F16C) live in sibling TUs functions/AVX2_F16C.cpp and
+	# functions/AVX2_FMA_F16C.cpp, compiled only when _has_full_f16c is true.
 	if(CXX_AVX2)
-		set(_avx2_flags "-mavx2")
-		if(_has_full_f16c)
-			message("Building functions/AVX2.cpp with AVX2 and F16C")
-			set(_avx2_flags "${_avx2_flags} -mf16c")
-		else()
-			message("Building functions/AVX2.cpp with AVX2 (no F16C)")
-		endif()
-		set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS "${_avx2_flags}")
+		message("Building functions/AVX2.cpp with AVX2")
+		set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS "-mavx2")
 		list(APPEND OPTIMIZATIONS functions/AVX2.cpp)
 	endif()
 
+	if(CXX_AVX2 AND _has_full_f16c)
+		message("Building functions/AVX2_F16C.cpp with AVX2 and F16C")
+		set_source_files_properties(functions/AVX2_F16C.cpp PROPERTIES COMPILE_FLAGS "-mavx2 -mf16c")
+		list(APPEND OPTIMIZATIONS functions/AVX2_F16C.cpp)
+	endif()
+
 	if(CXX_AVX2 AND CXX_FMA)
-		set(_avx2_fma_flags "-mavx2 -mfma")
-		if(_has_full_f16c)
-			message("Building functions/AVX2_FMA.cpp with AVX2, FMA, and F16C")
-			set(_avx2_fma_flags "${_avx2_fma_flags} -mf16c")
-		else()
-			message("Building functions/AVX2_FMA.cpp with AVX2 and FMA (no F16C)")
-		endif()
-		set_source_files_properties(functions/AVX2_FMA.cpp PROPERTIES COMPILE_FLAGS "${_avx2_fma_flags}")
+		message("Building functions/AVX2_FMA.cpp with AVX2 and FMA")
+		set_source_files_properties(functions/AVX2_FMA.cpp PROPERTIES COMPILE_FLAGS "-mavx2 -mfma")
 		list(APPEND OPTIMIZATIONS functions/AVX2_FMA.cpp)
 	endif()
 
+	if(CXX_AVX2 AND CXX_FMA AND _has_full_f16c)
+		message("Building functions/AVX2_FMA_F16C.cpp with AVX2, FMA, and F16C")
+		set_source_files_properties(functions/AVX2_FMA_F16C.cpp PROPERTIES COMPILE_FLAGS "-mavx2 -mfma -mf16c")
+		list(APPEND OPTIMIZATIONS functions/AVX2_FMA_F16C.cpp)
+	endif()
+
 	if(CXX_F16C AND CXX_FMA AND CXX_AVX)
 		message("Building with CXX_F16C")
 		set_source_files_properties(functions/F16C.cpp PROPERTIES COMPILE_FLAGS "-mf16c -mfma -mavx")
@@ -100,19 +103,19 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)")
 	endif()
 
 	if(CXX_SSE4)
-		set(_sse4_flags "-msse4.1")
-		if(_has_full_f16c)
-			# F16C is VEX-encoded → must compile with -mavx alongside -mf16c, matching the
-			# F16C.cpp recipe above.
-			message("Building functions/SSE4.cpp with SSE4.1, AVX, and F16C")
-			set(_sse4_flags "${_sse4_flags} -mavx -mf16c")
-		else()
-			message("Building functions/SSE4.cpp with SSE4.1 (no F16C)")
-		endif()
-		set_source_files_properties(functions/SSE4.cpp PROPERTIES COMPILE_FLAGS "${_sse4_flags}")
+		message("Building functions/SSE4.cpp with SSE4.1")
+		set_source_files_properties(functions/SSE4.cpp PROPERTIES COMPILE_FLAGS "-msse4.1")
 		list(APPEND OPTIMIZATIONS functions/SSE4.cpp)
 	endif()
 
+	# SSE4 SQ8↔FP16 kernels need F16C, which is VEX-encoded → require -mavx alongside -mf16c
+	# (mirrors the F16C.cpp recipe above).
+	if(CXX_SSE4 AND _has_full_f16c)
+		message("Building functions/SSE4_F16C.cpp with SSE4.1, AVX, and F16C")
+		set_source_files_properties(functions/SSE4_F16C.cpp PROPERTIES COMPILE_FLAGS "-msse4.1 -mavx -mf16c")
+		list(APPEND OPTIMIZATIONS functions/SSE4_F16C.cpp)
+	endif()
+
 	if(CXX_SSE)
 		message("Building with SSE")
 		set_source_files_properties(functions/SSE.cpp PROPERTIES COMPILE_FLAGS -msse)
diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp
index 1fd7381b7..cdc086683 100644
--- a/src/VecSim/spaces/IP_space.cpp
+++ b/src/VecSim/spaces/IP_space.cpp
@@ -20,9 +20,12 @@
 #include "VecSim/spaces/functions/AVX512BF16_VL.h"
 #include "VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h"
 #include "VecSim/spaces/functions/AVX2.h"
+#include "VecSim/spaces/functions/AVX2_F16C.h"
 #include "VecSim/spaces/functions/AVX2_FMA.h"
+#include "VecSim/spaces/functions/AVX2_FMA_F16C.h"
 #include "VecSim/spaces/functions/SSE3.h"
 #include "VecSim/spaces/functions/SSE4.h"
+#include "VecSim/spaces/functions/SSE4_F16C.h"
 #include "VecSim/spaces/functions/NEON.h"
 #include "VecSim/spaces/functions/NEON_DOTPROD.h"
 #include "VecSim/spaces/functions/NEON_HP.h"
diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp
index 0ada05f76..dcf8b2376 100644
--- a/src/VecSim/spaces/L2_space.cpp
+++ b/src/VecSim/spaces/L2_space.cpp
@@ -19,9 +19,12 @@
 #include "VecSim/spaces/functions/AVX512FP16_VL.h"
 #include "VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h"
 #include "VecSim/spaces/functions/AVX2.h"
+#include "VecSim/spaces/functions/AVX2_F16C.h"
 #include "VecSim/spaces/functions/AVX2_FMA.h"
+#include "VecSim/spaces/functions/AVX2_FMA_F16C.h"
 #include "VecSim/spaces/functions/SSE3.h"
 #include "VecSim/spaces/functions/SSE4.h"
+#include "VecSim/spaces/functions/SSE4_F16C.h"
 #include "VecSim/spaces/functions/NEON.h"
 #include "VecSim/spaces/functions/NEON_DOTPROD.h"
 #include "VecSim/spaces/functions/NEON_HP.h"
diff --git a/src/VecSim/spaces/functions/AVX2.cpp b/src/VecSim/spaces/functions/AVX2.cpp
index 7e229b003..0e6737f30 100644
--- a/src/VecSim/spaces/functions/AVX2.cpp
+++ b/src/VecSim/spaces/functions/AVX2.cpp
@@ -13,10 +13,6 @@
 #include "VecSim/spaces/IP/IP_AVX2_SQ8_FP32.h"
 #include "VecSim/spaces/L2/L2_AVX2_SQ8_FP32.h"
 
-#ifdef OPT_F16C
-#include "VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h"
-#include "VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h"
-#endif
 
 namespace spaces {
 
@@ -52,24 +48,6 @@ dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX2(size_t dim) {
     return ret_dist_func;
 }
 
-#ifdef OPT_F16C
-dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX2(size_t dim) {
-    dist_func_t<float> ret_dist_func;
-    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX2);
-    return ret_dist_func;
-}
-dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX2(size_t dim) {
-    dist_func_t<float> ret_dist_func;
-    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX2);
-    return ret_dist_func;
-}
-dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX2(size_t dim) {
-    dist_func_t<float> ret_dist_func;
-    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX2);
-    return ret_dist_func;
-}
-#endif
-
 #include "implementation_chooser_cleanup.h"
 
 } // namespace spaces
diff --git a/src/VecSim/spaces/functions/AVX2.h b/src/VecSim/spaces/functions/AVX2.h
index 45fa2c951..283f6b95e 100644
--- a/src/VecSim/spaces/functions/AVX2.h
+++ b/src/VecSim/spaces/functions/AVX2.h
@@ -19,10 +19,5 @@ dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX2(size_t dim);
 dist_func_t<float> Choose_BF16_IP_implementation_AVX2(size_t dim);
 dist_func_t<float> Choose_BF16_L2_implementation_AVX2(size_t dim);
 
-#ifdef OPT_F16C
-dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX2(size_t dim);
-dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX2(size_t dim);
-dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX2(size_t dim);
-#endif
 
 } // namespace spaces
diff --git a/src/VecSim/spaces/functions/AVX2_F16C.cpp b/src/VecSim/spaces/functions/AVX2_F16C.cpp
new file mode 100644
index 000000000..3d298e81b
--- /dev/null
+++ b/src/VecSim/spaces/functions/AVX2_F16C.cpp
@@ -0,0 +1,35 @@
+/*
+ * Copyright (c) 2006-Present, Redis Ltd.
+ * All rights reserved.
+ *
+ * Licensed under your choice of the Redis Source Available License 2.0
+ * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the
+ * GNU Affero General Public License v3 (AGPLv3).
+ */
+#include "AVX2_F16C.h"
+#include "VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h"
+#include "VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h"
+
+namespace spaces {
+
+#include "implementation_chooser.h"
+
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX2(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX2);
+    return ret_dist_func;
+}
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX2(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX2);
+    return ret_dist_func;
+}
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX2(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX2);
+    return ret_dist_func;
+}
+
+#include "implementation_chooser_cleanup.h"
+
+} // namespace spaces
diff --git a/src/VecSim/spaces/functions/AVX2_F16C.h b/src/VecSim/spaces/functions/AVX2_F16C.h
new file mode 100644
index 000000000..95a171199
--- /dev/null
+++ b/src/VecSim/spaces/functions/AVX2_F16C.h
@@ -0,0 +1,23 @@
+/*
+ * Copyright (c) 2006-Present, Redis Ltd.
+ * All rights reserved.
+ *
+ * Licensed under your choice of the Redis Source Available License 2.0
+ * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the
+ * GNU Affero General Public License v3 (AGPLv3).
+ */
+#pragma once
+
+#include "VecSim/spaces/spaces.h"
+
+// SQ8↔FP16 kernels for the AVX2 (no FMA) tier. Live in a sibling TU compiled only when the
+// toolchain supports F16C (via `-mf16c`), so this header has no preprocessor guard. Callers
+// still gate the calls themselves with `#ifdef OPT_F16C`.
+
+namespace spaces {
+
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX2(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX2(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX2(size_t dim);
+
+} // namespace spaces
diff --git a/src/VecSim/spaces/functions/AVX2_FMA.cpp b/src/VecSim/spaces/functions/AVX2_FMA.cpp
index 5745a4ddf..288b8c6cb 100644
--- a/src/VecSim/spaces/functions/AVX2_FMA.cpp
+++ b/src/VecSim/spaces/functions/AVX2_FMA.cpp
@@ -10,10 +10,6 @@
 #include "VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP32.h"
 #include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP32.h"
 
-#ifdef OPT_F16C
-#include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h"
-#include "VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h"
-#endif
 
 namespace spaces {
 
@@ -36,24 +32,6 @@ dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX2_FMA(size_t dim) {
     return ret_dist_func;
 }
 
-#ifdef OPT_F16C
-dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX2_FMA(size_t dim) {
-    dist_func_t<float> ret_dist_func;
-    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX2_FMA);
-    return ret_dist_func;
-}
-dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(size_t dim) {
-    dist_func_t<float> ret_dist_func;
-    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX2_FMA);
-    return ret_dist_func;
-}
-dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX2_FMA(size_t dim) {
-    dist_func_t<float> ret_dist_func;
-    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX2_FMA);
-    return ret_dist_func;
-}
-#endif
-
 #include "implementation_chooser_cleanup.h"
 
 } // namespace spaces
diff --git a/src/VecSim/spaces/functions/AVX2_FMA.h b/src/VecSim/spaces/functions/AVX2_FMA.h
index 413f55081..21a364177 100644
--- a/src/VecSim/spaces/functions/AVX2_FMA.h
+++ b/src/VecSim/spaces/functions/AVX2_FMA.h
@@ -16,10 +16,5 @@ dist_func_t<float> Choose_SQ8_FP32_IP_implementation_AVX2_FMA(size_t dim);
 dist_func_t<float> Choose_SQ8_FP32_Cosine_implementation_AVX2_FMA(size_t dim);
 dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX2_FMA(size_t dim);
 
-#ifdef OPT_F16C
-dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX2_FMA(size_t dim);
-dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(size_t dim);
-dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX2_FMA(size_t dim);
-#endif
 
 } // namespace spaces
diff --git a/src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp b/src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp
new file mode 100644
index 000000000..4e9dd8131
--- /dev/null
+++ b/src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp
@@ -0,0 +1,35 @@
+/*
+ * Copyright (c) 2006-Present, Redis Ltd.
+ * All rights reserved.
+ *
+ * Licensed under your choice of the Redis Source Available License 2.0
+ * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the
+ * GNU Affero General Public License v3 (AGPLv3).
+ */
+#include "AVX2_FMA_F16C.h"
+#include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h"
+#include "VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h"
+
+namespace spaces {
+
+#include "implementation_chooser.h"
+
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX2_FMA(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX2_FMA);
+    return ret_dist_func;
+}
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX2_FMA);
+    return ret_dist_func;
+}
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX2_FMA(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX2_FMA);
+    return ret_dist_func;
+}
+
+#include "implementation_chooser_cleanup.h"
+
+} // namespace spaces
diff --git a/src/VecSim/spaces/functions/AVX2_FMA_F16C.h b/src/VecSim/spaces/functions/AVX2_FMA_F16C.h
new file mode 100644
index 000000000..7943ff4eb
--- /dev/null
+++ b/src/VecSim/spaces/functions/AVX2_FMA_F16C.h
@@ -0,0 +1,23 @@
+/*
+ * Copyright (c) 2006-Present, Redis Ltd.
+ * All rights reserved.
+ *
+ * Licensed under your choice of the Redis Source Available License 2.0
+ * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the
+ * GNU Affero General Public License v3 (AGPLv3).
+ */
+#pragma once
+
+#include "VecSim/spaces/spaces.h"
+
+// SQ8↔FP16 kernels for the AVX2+FMA tier. Live in a sibling TU compiled only when the
+// toolchain supports F16C (via `-mf16c`), so this header has no preprocessor guard. Callers
+// still gate the calls themselves with `#ifdef OPT_F16C`.
+
+namespace spaces {
+
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX2_FMA(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX2_FMA(size_t dim);
+
+} // namespace spaces
diff --git a/src/VecSim/spaces/functions/SSE4.cpp b/src/VecSim/spaces/functions/SSE4.cpp
index e41762955..1a21d0000 100644
--- a/src/VecSim/spaces/functions/SSE4.cpp
+++ b/src/VecSim/spaces/functions/SSE4.cpp
@@ -10,10 +10,6 @@
 #include "VecSim/spaces/IP/IP_SSE4_SQ8_FP32.h"
 #include "VecSim/spaces/L2/L2_SSE4_SQ8_FP32.h"
 
-#ifdef OPT_F16C
-#include "VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h"
-#include "VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h"
-#endif
 
 namespace spaces {
 
@@ -37,24 +33,6 @@ dist_func_t<float> Choose_SQ8_FP32_L2_implementation_SSE4(size_t dim) {
     return ret_dist_func;
 }
 
-#ifdef OPT_F16C
-dist_func_t<float> Choose_SQ8_FP16_IP_implementation_SSE4(size_t dim) {
-    dist_func_t<float> ret_dist_func;
-    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_SSE4);
-    return ret_dist_func;
-}
-dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_SSE4(size_t dim) {
-    dist_func_t<float> ret_dist_func;
-    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_SSE4);
-    return ret_dist_func;
-}
-dist_func_t<float> Choose_SQ8_FP16_L2_implementation_SSE4(size_t dim) {
-    dist_func_t<float> ret_dist_func;
-    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_SSE4);
-    return ret_dist_func;
-}
-#endif
-
 #include "implementation_chooser_cleanup.h"
 
 } // namespace spaces
diff --git a/src/VecSim/spaces/functions/SSE4.h b/src/VecSim/spaces/functions/SSE4.h
index c33187983..b1d49c32a 100644
--- a/src/VecSim/spaces/functions/SSE4.h
+++ b/src/VecSim/spaces/functions/SSE4.h
@@ -16,10 +16,5 @@ dist_func_t<float> Choose_SQ8_FP32_IP_implementation_SSE4(size_t dim);
 dist_func_t<float> Choose_SQ8_FP32_Cosine_implementation_SSE4(size_t dim);
 dist_func_t<float> Choose_SQ8_FP32_L2_implementation_SSE4(size_t dim);
 
-#ifdef OPT_F16C
-dist_func_t<float> Choose_SQ8_FP16_IP_implementation_SSE4(size_t dim);
-dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_SSE4(size_t dim);
-dist_func_t<float> Choose_SQ8_FP16_L2_implementation_SSE4(size_t dim);
-#endif
 
 } // namespace spaces
diff --git a/src/VecSim/spaces/functions/SSE4_F16C.cpp b/src/VecSim/spaces/functions/SSE4_F16C.cpp
new file mode 100644
index 000000000..91a11885f
--- /dev/null
+++ b/src/VecSim/spaces/functions/SSE4_F16C.cpp
@@ -0,0 +1,35 @@
+/*
+ * Copyright (c) 2006-Present, Redis Ltd.
+ * All rights reserved.
+ *
+ * Licensed under your choice of the Redis Source Available License 2.0
+ * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the
+ * GNU Affero General Public License v3 (AGPLv3).
+ */
+#include "SSE4_F16C.h"
+#include "VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h"
+#include "VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h"
+
+namespace spaces {
+
+#include "implementation_chooser.h"
+
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_SSE4(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_SSE4);
+    return ret_dist_func;
+}
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_SSE4(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_SSE4);
+    return ret_dist_func;
+}
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_SSE4(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_SSE4);
+    return ret_dist_func;
+}
+
+#include "implementation_chooser_cleanup.h"
+
+} // namespace spaces
diff --git a/src/VecSim/spaces/functions/SSE4_F16C.h b/src/VecSim/spaces/functions/SSE4_F16C.h
new file mode 100644
index 000000000..2459c216c
--- /dev/null
+++ b/src/VecSim/spaces/functions/SSE4_F16C.h
@@ -0,0 +1,23 @@
+/*
+ * Copyright (c) 2006-Present, Redis Ltd.
+ * All rights reserved.
+ *
+ * Licensed under your choice of the Redis Source Available License 2.0
+ * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the
+ * GNU Affero General Public License v3 (AGPLv3).
+ */
+#pragma once
+
+#include "VecSim/spaces/spaces.h"
+
+// SQ8↔FP16 kernels for the SSE4 tier. Live in a sibling TU compiled only when the toolchain
+// supports F16C (via `-mf16c -mavx`), so this header has no preprocessor guard. Callers
+// still gate the calls themselves with `#ifdef OPT_F16C`.
+
+namespace spaces {
+
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_SSE4(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_SSE4(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_SSE4(size_t dim);
+
+} // namespace spaces
diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces.h b/tests/benchmark/spaces_benchmarks/bm_spaces.h
index d99bcc4ca..2303eac0a 100644
--- a/tests/benchmark/spaces_benchmarks/bm_spaces.h
+++ b/tests/benchmark/spaces_benchmarks/bm_spaces.h
@@ -24,9 +24,12 @@
 #include "VecSim/spaces/functions/AVX512BF16_VL.h"
 #include "VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h"
 #include "VecSim/spaces/functions/AVX2.h"
+#include "VecSim/spaces/functions/AVX2_F16C.h"
 #include "VecSim/spaces/functions/AVX2_FMA.h"
+#include "VecSim/spaces/functions/AVX2_FMA_F16C.h"
 #include "VecSim/spaces/functions/F16C.h"
 #include "VecSim/spaces/functions/SSE4.h"
+#include "VecSim/spaces/functions/SSE4_F16C.h"
 #include "VecSim/spaces/functions/SSE3.h"
 #include "VecSim/spaces/functions/SSE.h"
 #include "VecSim/spaces/functions/NEON.h"
diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp
index a4da4abb8..9d082a315 100644
--- a/tests/unit/test_spaces.cpp
+++ b/tests/unit/test_spaces.cpp
@@ -32,9 +32,12 @@
 #include "VecSim/spaces/functions/AVX512FP16_VL.h"
 #include "VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h"
 #include "VecSim/spaces/functions/AVX2.h"
+#include "VecSim/spaces/functions/AVX2_F16C.h"
 #include "VecSim/spaces/functions/AVX2_FMA.h"
+#include "VecSim/spaces/functions/AVX2_FMA_F16C.h"
 #include "VecSim/spaces/functions/SSE3.h"
 #include "VecSim/spaces/functions/SSE4.h"
+#include "VecSim/spaces/functions/SSE4_F16C.h"
 #include "VecSim/spaces/functions/F16C.h"
 #include "VecSim/spaces/functions/NEON.h"
 #include "VecSim/spaces/functions/NEON_DOTPROD.h"

From b689840f946d43f70023b7eb7f3cc0536ca721ea Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Thu, 28 May 2026 14:38:12 +0300
Subject: [PATCH 17/24] =?UTF-8?q?Move=20SQ8=E2=86=94FP16=20AVX-512=20dispa?=
 =?UTF-8?q?tch=20to=20AVX512F=20tier=20+=20flatten=20F16C=20guards=20[MOD-?=
 =?UTF-8?q?14954]?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two related cleanups in the SQ8↔FP16 dispatch path:

1. The AVX-512 SQ8↔FP16 kernel only uses AVX-512F instructions
   (`_mm512_cvtph_ps`, `_mm512_fmadd_ps`, etc.) but was registered under
   the VNNI tier (`OPT_AVX512_F_BW_VL_VNNI` + check of avx512f/bw/vl/vnni).
   That meant CPUs with AVX-512F but no VNNI (Skylake-X, some Cascade Lake
   variants, etc.) would fall through to AVX2_FMA even though they can
   run the AVX-512 kernel.

   Moved the `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F`
   definitions from `functions/AVX512F_BW_VL_VNNI.cpp` to
   `functions/AVX512F.cpp`, with matching header reshuffle. Dispatch
   sites now gate on `OPT_AVX512F` + `features.avx512f`.

2. F16C is a transversal requirement across the non-AVX-512 SQ8↔FP16
   tiers (SSE4, AVX2, AVX2+FMA) — every one of them widens FP16 query
   lanes via `vcvtph2ps`. Per-tier nested `#ifdef OPT_F16C` was hoisted
   into a single outer block around the three ISA branches in
   `IP_SQ8_FP16_GetDistFunc`, `Cosine_SQ8_FP16_GetDistFunc`, and
   `L2_SQ8_FP16_GetDistFunc`.

Verified: 131 SQ8_FP16 release + 115 ASan + 1166 full test_spaces suite.
---
 src/VecSim/spaces/IP_space.cpp                | 27 ++++++++-----------
 src/VecSim/spaces/L2_space.cpp                | 15 +++++------
 src/VecSim/spaces/functions/AVX2.cpp          |  1 -
 src/VecSim/spaces/functions/AVX2.h            |  1 -
 src/VecSim/spaces/functions/AVX512F.cpp       | 21 +++++++++++++++
 src/VecSim/spaces/functions/AVX512F.h         |  5 ++++
 .../spaces/functions/AVX512F_BW_VL_VNNI.cpp   | 21 +--------------
 .../spaces/functions/AVX512F_BW_VL_VNNI.h     |  6 +----
 src/VecSim/spaces/functions/SSE4.h            |  1 -
 .../spaces_benchmarks/bm_spaces_sq8_fp16.cpp  | 14 ++++------
 tests/unit/test_spaces.cpp                    | 15 +++++------
 11 files changed, 57 insertions(+), 70 deletions(-)

diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp
index cdc086683..b57971b60 100644
--- a/src/VecSim/spaces/IP_space.cpp
+++ b/src/VecSim/spaces/IP_space.cpp
@@ -190,33 +190,32 @@ dist_func_t<float> IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment,
         return ret_dist_func;
     }
     // Alignment hints below refer to the SQ8 (first) operand per the GetDistFunc contract.
-#ifdef OPT_AVX512_F_BW_VL_VNNI
-    if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) {
+    // AVX-512 tier only needs AVX-512F (cvtph_ps is part of AVX-512F, no VNNI/BW/VL required).
+#ifdef OPT_AVX512F
+    if (features.avx512f) {
         if (dim % 16 == 0) // SQ8 chunk = 16 bytes
             *alignment = 16 * sizeof(uint8_t);
         return Choose_SQ8_FP16_IP_implementation_AVX512F(dim);
     }
 #endif
-#ifdef OPT_AVX2_FMA
+    // F16C is required by every non-AVX-512 SQ8↔FP16 tier (vcvtph2ps), so the guard is hoisted
+    // around all three.
 #ifdef OPT_F16C
+#ifdef OPT_AVX2_FMA
     if (features.avx2 && features.fma3 && features.f16c) {
         if (dim % 8 == 0) // SQ8 chunk = 8 bytes
             *alignment = 8 * sizeof(uint8_t);
         return Choose_SQ8_FP16_IP_implementation_AVX2_FMA(dim);
     }
 #endif
-#endif
 #ifdef OPT_AVX2
-#ifdef OPT_F16C
     if (features.avx2 && features.f16c) {
         if (dim % 8 == 0)
             *alignment = 8 * sizeof(uint8_t);
         return Choose_SQ8_FP16_IP_implementation_AVX2(dim);
     }
 #endif
-#endif
 #ifdef OPT_SSE4
-#ifdef OPT_F16C
     // F16C is VEX-encoded — require AVX as well, matching the existing F16C/FP16 dispatcher.
     if (features.sse4_1 && features.f16c && features.avx) {
         if (dim % 4 == 0)
@@ -224,7 +223,7 @@ dist_func_t<float> IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment,
         return Choose_SQ8_FP16_IP_implementation_SSE4(dim);
     }
 #endif
-#endif
+#endif // OPT_F16C
 #endif // x86_64
     return ret_dist_func;
 }
@@ -244,40 +243,36 @@ dist_func_t<float> Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignm
     if (dim < 16) {
         return ret_dist_func;
     }
-#ifdef OPT_AVX512_F_BW_VL_VNNI
-    if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) {
+#ifdef OPT_AVX512F
+    if (features.avx512f) {
         if (dim % 16 == 0)
             *alignment = 16 * sizeof(uint8_t);
         return Choose_SQ8_FP16_Cosine_implementation_AVX512F(dim);
     }
 #endif
-#ifdef OPT_AVX2_FMA
 #ifdef OPT_F16C
+#ifdef OPT_AVX2_FMA
     if (features.avx2 && features.fma3 && features.f16c) {
         if (dim % 8 == 0)
             *alignment = 8 * sizeof(uint8_t);
         return Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(dim);
     }
 #endif
-#endif
 #ifdef OPT_AVX2
-#ifdef OPT_F16C
     if (features.avx2 && features.f16c) {
         if (dim % 8 == 0)
             *alignment = 8 * sizeof(uint8_t);
         return Choose_SQ8_FP16_Cosine_implementation_AVX2(dim);
     }
 #endif
-#endif
 #ifdef OPT_SSE4
-#ifdef OPT_F16C
     if (features.sse4_1 && features.f16c && features.avx) {
         if (dim % 4 == 0)
             *alignment = 4 * sizeof(uint8_t);
         return Choose_SQ8_FP16_Cosine_implementation_SSE4(dim);
     }
 #endif
-#endif
+#endif // OPT_F16C
 #endif // x86_64
     return ret_dist_func;
 }
diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp
index dcf8b2376..43020399f 100644
--- a/src/VecSim/spaces/L2_space.cpp
+++ b/src/VecSim/spaces/L2_space.cpp
@@ -122,40 +122,39 @@ dist_func_t<float> L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment,
         return ret_dist_func;
     }
     // Alignment hints below refer to the SQ8 (first) operand per the GetDistFunc contract.
-#ifdef OPT_AVX512_F_BW_VL_VNNI
-    if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) {
+    // AVX-512 tier only needs AVX-512F (cvtph_ps is part of AVX-512F, no VNNI/BW/VL required).
+#ifdef OPT_AVX512F
+    if (features.avx512f) {
         if (dim % 16 == 0)
             *alignment = 16 * sizeof(uint8_t);
         return Choose_SQ8_FP16_L2_implementation_AVX512F(dim);
     }
 #endif
-#ifdef OPT_AVX2_FMA
+    // F16C is required by every non-AVX-512 SQ8↔FP16 tier (vcvtph2ps), so the guard is hoisted
+    // around all three.
 #ifdef OPT_F16C
+#ifdef OPT_AVX2_FMA
     if (features.avx2 && features.fma3 && features.f16c) {
         if (dim % 8 == 0)
             *alignment = 8 * sizeof(uint8_t);
         return Choose_SQ8_FP16_L2_implementation_AVX2_FMA(dim);
     }
 #endif
-#endif
 #ifdef OPT_AVX2
-#ifdef OPT_F16C
     if (features.avx2 && features.f16c) {
         if (dim % 8 == 0)
             *alignment = 8 * sizeof(uint8_t);
         return Choose_SQ8_FP16_L2_implementation_AVX2(dim);
     }
 #endif
-#endif
 #ifdef OPT_SSE4
-#ifdef OPT_F16C
     if (features.sse4_1 && features.f16c && features.avx) {
         if (dim % 4 == 0)
             *alignment = 4 * sizeof(uint8_t);
         return Choose_SQ8_FP16_L2_implementation_SSE4(dim);
     }
 #endif
-#endif
+#endif // OPT_F16C
 #endif // x86_64
     return ret_dist_func;
 }
diff --git a/src/VecSim/spaces/functions/AVX2.cpp b/src/VecSim/spaces/functions/AVX2.cpp
index 0e6737f30..322ed0aec 100644
--- a/src/VecSim/spaces/functions/AVX2.cpp
+++ b/src/VecSim/spaces/functions/AVX2.cpp
@@ -13,7 +13,6 @@
 #include "VecSim/spaces/IP/IP_AVX2_SQ8_FP32.h"
 #include "VecSim/spaces/L2/L2_AVX2_SQ8_FP32.h"
 
-
 namespace spaces {
 
 #include "implementation_chooser.h"
diff --git a/src/VecSim/spaces/functions/AVX2.h b/src/VecSim/spaces/functions/AVX2.h
index 283f6b95e..081c42a4e 100644
--- a/src/VecSim/spaces/functions/AVX2.h
+++ b/src/VecSim/spaces/functions/AVX2.h
@@ -19,5 +19,4 @@ dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX2(size_t dim);
 dist_func_t<float> Choose_BF16_IP_implementation_AVX2(size_t dim);
 dist_func_t<float> Choose_BF16_L2_implementation_AVX2(size_t dim);
 
-
 } // namespace spaces
diff --git a/src/VecSim/spaces/functions/AVX512F.cpp b/src/VecSim/spaces/functions/AVX512F.cpp
index e765f4c8b..feb261fb4 100644
--- a/src/VecSim/spaces/functions/AVX512F.cpp
+++ b/src/VecSim/spaces/functions/AVX512F.cpp
@@ -11,10 +11,12 @@
 #include "VecSim/spaces/L2/L2_AVX512F_FP16.h"
 #include "VecSim/spaces/L2/L2_AVX512F_FP32.h"
 #include "VecSim/spaces/L2/L2_AVX512F_FP64.h"
+#include "VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h"
 
 #include "VecSim/spaces/IP/IP_AVX512F_FP16.h"
 #include "VecSim/spaces/IP/IP_AVX512F_FP32.h"
 #include "VecSim/spaces/IP/IP_AVX512F_FP64.h"
+#include "VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h"
 
 namespace spaces {
 
@@ -56,6 +58,25 @@ dist_func_t<float> Choose_FP16_L2_implementation_AVX512F(size_t dim) {
     return ret_dist_func;
 }
 
+// SQ8↔FP16 kernels only use AVX-512F (cvtph_ps + FMA), so they register here rather than under
+// the VNNI tier — CPUs with AVX-512F but no VNNI (Skylake-X, some Cascade Lake variants) can use
+// these kernels.
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX512F(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX512F);
+    return ret_dist_func;
+}
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX512F(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX512F);
+    return ret_dist_func;
+}
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX512F(size_t dim) {
+    dist_func_t<float> ret_dist_func;
+    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX512F);
+    return ret_dist_func;
+}
+
 #include "implementation_chooser_cleanup.h"
 
 } // namespace spaces
diff --git a/src/VecSim/spaces/functions/AVX512F.h b/src/VecSim/spaces/functions/AVX512F.h
index fd36f312f..8d600f961 100644
--- a/src/VecSim/spaces/functions/AVX512F.h
+++ b/src/VecSim/spaces/functions/AVX512F.h
@@ -20,4 +20,9 @@ dist_func_t<float> Choose_FP16_L2_implementation_AVX512F(size_t dim);
 dist_func_t<float> Choose_FP32_L2_implementation_AVX512F(size_t dim);
 dist_func_t<double> Choose_FP64_L2_implementation_AVX512F(size_t dim);
 
+// SQ8↔FP16 kernels — only need AVX-512F, not VNNI/BW/VL.
+dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX512F(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX512F(size_t dim);
+dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX512F(size_t dim);
+
 } // namespace spaces
diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp
index 145300f24..090b192bf 100644
--- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp
+++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp
@@ -17,9 +17,6 @@
 #include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP32.h"
 #include "VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP32.h"
 
-#include "VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h"
-#include "VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h"
-
 #include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_SQ8.h"
 #include "VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_SQ8.h"
 
@@ -79,23 +76,7 @@ dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t d
     return ret_dist_func;
 }
 
-// SQ8-to-FP16 distance functions. The kernels themselves only use AVX-512F (cvtph_ps + FMA);
-// they register under the VNNI tier solely for CPU-feature dispatch.
-dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX512F(size_t dim) {
-    dist_func_t<float> ret_dist_func;
-    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX512F);
-    return ret_dist_func;
-}
-dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX512F(size_t dim) {
-    dist_func_t<float> ret_dist_func;
-    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX512F);
-    return ret_dist_func;
-}
-dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX512F(size_t dim) {
-    dist_func_t<float> ret_dist_func;
-    CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX512F);
-    return ret_dist_func;
-}
+// SQ8-to-FP16 dispatch lives in functions/AVX512F.cpp — the kernel only needs AVX-512F.
 
 // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum)
 dist_func_t<float> Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim) {
diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h
index 13dd9e8a8..13cf06264 100644
--- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h
+++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h
@@ -24,11 +24,7 @@ dist_func_t<float> Choose_SQ8_FP32_IP_implementation_AVX512F_BW_VL_VNNI(size_t d
 dist_func_t<float> Choose_SQ8_FP32_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim);
 dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t dim);
 
-// SQ8-to-FP16 kernels only use AVX-512F instructions; they are declared here because
-// they register under the VNNI tier for CPU-feature dispatch.
-dist_func_t<float> Choose_SQ8_FP16_IP_implementation_AVX512F(size_t dim);
-dist_func_t<float> Choose_SQ8_FP16_Cosine_implementation_AVX512F(size_t dim);
-dist_func_t<float> Choose_SQ8_FP16_L2_implementation_AVX512F(size_t dim);
+// SQ8-to-FP16 dispatch declared in AVX512F.h — kernel only needs AVX-512F.
 
 // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum)
 dist_func_t<float> Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim);
diff --git a/src/VecSim/spaces/functions/SSE4.h b/src/VecSim/spaces/functions/SSE4.h
index b1d49c32a..e47948137 100644
--- a/src/VecSim/spaces/functions/SSE4.h
+++ b/src/VecSim/spaces/functions/SSE4.h
@@ -16,5 +16,4 @@ dist_func_t<float> Choose_SQ8_FP32_IP_implementation_SSE4(size_t dim);
 dist_func_t<float> Choose_SQ8_FP32_Cosine_implementation_SSE4(size_t dim);
 dist_func_t<float> Choose_SQ8_FP32_L2_implementation_SSE4(size_t dim);
 
-
 } // namespace spaces
diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp
index 04cb13eea..d6e49a180 100644
--- a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp
+++ b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp
@@ -54,15 +54,11 @@ class BM_VecSimSpaces_SQ8_FP16 : public benchmark::Fixture {
 #ifdef CPU_FEATURES_ARCH_X86_64
 cpu_features::X86Features opt = cpu_features::GetX86Info().features;
 
-// AVX-512 F+BW+VL+VNNI (no F16C requirement — _mm512_cvtph_ps is part of AVX512F).
-#ifdef OPT_AVX512_F_BW_VL_VNNI
-bool avx512_f_bw_vl_vnni_supported = opt.avx512f && opt.avx512bw && opt.avx512vl && opt.avx512vnni;
-// Kernel itself only needs AVX-512F (cvtph_ps + FMA); the VNNI feature check keeps it on the
-// same dispatch tier as the rest of this file.
-INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16,
-                                avx512_f_bw_vl_vnni_supported);
-INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16,
-                                 avx512_f_bw_vl_vnni_supported);
+// AVX-512F is sufficient — _mm512_cvtph_ps is part of AVX-512F, no F16C/VNNI/BW/VL needed.
+#ifdef OPT_AVX512F
+bool avx512f_supported = opt.avx512f;
+INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16, avx512f_supported);
+INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16, avx512f_supported);
 #endif
 
 #ifdef OPT_F16C
diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp
index 9d082a315..b880b6f13 100644
--- a/tests/unit/test_spaces.cpp
+++ b/tests/unit/test_spaces.cpp
@@ -3094,9 +3094,8 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) {
     dist_func_t<float> arch_opt_func;
     float baseline = SQ8_FP16_L2Sqr(v2_compressed.data(), v1_query.data(), dim);
 
-#ifdef OPT_AVX512_F_BW_VL_VNNI
-    if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl &&
-        optimization.avx512vnni) {
+#ifdef OPT_AVX512F
+    if (optimization.avx512f) {
         unsigned char alignment = 0;
         arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
         ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_AVX512F(dim))
@@ -3177,9 +3176,8 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) {
     dist_func_t<float> arch_opt_func;
     float baseline = SQ8_FP16_InnerProduct(v2_compressed.data(), v1_query.data(), dim);
 
-#ifdef OPT_AVX512_F_BW_VL_VNNI
-    if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl &&
-        optimization.avx512vnni) {
+#ifdef OPT_AVX512F
+    if (optimization.avx512f) {
         unsigned char alignment = 0;
         arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
         ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_AVX512F(dim))
@@ -3258,9 +3256,8 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) {
     dist_func_t<float> arch_opt_func;
     float baseline = SQ8_FP16_Cosine(v2_compressed.data(), v1_query.data(), dim);
 
-#ifdef OPT_AVX512_F_BW_VL_VNNI
-    if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl &&
-        optimization.avx512vnni) {
+#ifdef OPT_AVX512F
+    if (optimization.avx512f) {
         unsigned char alignment = 0;
         arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
         ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_AVX512F(dim))

From 839fe3c669f4ed9371b1e0663a0e6da6a73e6321 Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Thu, 28 May 2026 14:47:32 +0300
Subject: [PATCH 18/24] Clean up whitespace and formatting inconsistencies

Remove extraneous blank lines in SSE4 and AVX2_FMA source files, fix indentation in AVX512F SQ8_FP16 function signatures, and reformat benchmark macro invocation to fit line length conventions.
---
 src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h               | 5 ++---
 src/VecSim/spaces/functions/AVX2_FMA.cpp                 | 1 -
 src/VecSim/spaces/functions/AVX2_FMA.h                   | 1 -
 src/VecSim/spaces/functions/SSE4.cpp                     | 1 -
 tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp | 3 ++-
 5 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h
index 955f431f6..7ba9c0412 100644
--- a/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h
@@ -111,13 +111,12 @@ float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, si
 
 template <unsigned char residual> // 0..15
 float SQ8_FP16_InnerProductSIMD16_AVX512F(const void *pVec1v, const void *pVec2v,
-                                                     size_t dimension) {
+                                          size_t dimension) {
     return 1.0f - SQ8_FP16_InnerProductImp_AVX512<residual>(pVec1v, pVec2v, dimension);
 }
 
 template <unsigned char residual> // 0..15
-float SQ8_FP16_CosineSIMD16_AVX512F(const void *pVec1v, const void *pVec2v,
-                                               size_t dimension) {
+float SQ8_FP16_CosineSIMD16_AVX512F(const void *pVec1v, const void *pVec2v, size_t dimension) {
     // Cosine distance = 1 - IP for pre-normalised vectors. Aliases InnerProduct, matching the
     // SQ8_FP32 pattern.
     return SQ8_FP16_InnerProductSIMD16_AVX512F<residual>(pVec1v, pVec2v, dimension);
diff --git a/src/VecSim/spaces/functions/AVX2_FMA.cpp b/src/VecSim/spaces/functions/AVX2_FMA.cpp
index 288b8c6cb..c859128b2 100644
--- a/src/VecSim/spaces/functions/AVX2_FMA.cpp
+++ b/src/VecSim/spaces/functions/AVX2_FMA.cpp
@@ -10,7 +10,6 @@
 #include "VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP32.h"
 #include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP32.h"
 
-
 namespace spaces {
 
 #include "implementation_chooser.h"
diff --git a/src/VecSim/spaces/functions/AVX2_FMA.h b/src/VecSim/spaces/functions/AVX2_FMA.h
index 21a364177..b20b1a588 100644
--- a/src/VecSim/spaces/functions/AVX2_FMA.h
+++ b/src/VecSim/spaces/functions/AVX2_FMA.h
@@ -16,5 +16,4 @@ dist_func_t<float> Choose_SQ8_FP32_IP_implementation_AVX2_FMA(size_t dim);
 dist_func_t<float> Choose_SQ8_FP32_Cosine_implementation_AVX2_FMA(size_t dim);
 dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX2_FMA(size_t dim);
 
-
 } // namespace spaces
diff --git a/src/VecSim/spaces/functions/SSE4.cpp b/src/VecSim/spaces/functions/SSE4.cpp
index 1a21d0000..5f5bbc1ba 100644
--- a/src/VecSim/spaces/functions/SSE4.cpp
+++ b/src/VecSim/spaces/functions/SSE4.cpp
@@ -10,7 +10,6 @@
 #include "VecSim/spaces/IP/IP_SSE4_SQ8_FP32.h"
 #include "VecSim/spaces/L2/L2_SSE4_SQ8_FP32.h"
 
-
 namespace spaces {
 
 #include "implementation_chooser.h"
diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp
index d6e49a180..ba3030064 100644
--- a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp
+++ b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp
@@ -58,7 +58,8 @@ cpu_features::X86Features opt = cpu_features::GetX86Info().features;
 #ifdef OPT_AVX512F
 bool avx512f_supported = opt.avx512f;
 INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16, avx512f_supported);
-INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16, avx512f_supported);
+INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16,
+                                 avx512f_supported);
 #endif
 
 #ifdef OPT_F16C

From 3565985ef9849e51f567707dbdb71936dccce984 Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Thu, 28 May 2026 15:41:43 +0300
Subject: [PATCH 19/24] Remove obsolete SQ8-to-FP16 dispatch comments

The comments referencing SQ8-to-FP16 dispatch location are no longer accurate after the recent refactoring that moved the dispatch logic. Clean up these stale comments from the AVX512F_BW_VL_VNNI files.
---
 src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp | 2 --
 src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h   | 2 --
 2 files changed, 4 deletions(-)

diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp
index 090b192bf..712bdda4e 100644
--- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp
+++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp
@@ -76,8 +76,6 @@ dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t d
     return ret_dist_func;
 }
 
-// SQ8-to-FP16 dispatch lives in functions/AVX512F.cpp — the kernel only needs AVX-512F.
-
 // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum)
 dist_func_t<float> Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim) {
     dist_func_t<float> ret_dist_func;
diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h
index 13cf06264..fe1583491 100644
--- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h
+++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h
@@ -24,8 +24,6 @@ dist_func_t<float> Choose_SQ8_FP32_IP_implementation_AVX512F_BW_VL_VNNI(size_t d
 dist_func_t<float> Choose_SQ8_FP32_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim);
 dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t dim);
 
-// SQ8-to-FP16 dispatch declared in AVX512F.h — kernel only needs AVX-512F.
-
 // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum)
 dist_func_t<float> Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim);
 dist_func_t<float> Choose_SQ8_SQ8_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim);

From 771bb39e5754c715e3cc5c2fba767c7bd183581f Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Sun, 31 May 2026 11:58:54 +0300
Subject: [PATCH 20/24] =?UTF-8?q?Hoist=20OPT=5FF16C=20guard=20around=20low?=
 =?UTF-8?q?er=20SIMD=20tiers=20in=20SQ8=E2=86=94FP16=20tests=20[MOD-14954]?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Mirrors the dispatcher layout in IP_space.cpp / L2_space.cpp where a single
OPT_F16C guard wraps the AVX2+FMA, AVX2, and SSE4 branches. Each test body
(L2/IP/Cosine) and the TierCoverage report now use the same single-guard
shape. Also retargets the TierCoverage AVX-512 check from
OPT_AVX512_F_BW_VL_VNNI to OPT_AVX512F, matching the dispatcher's new
AVX-512F-only gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 tests/unit/test_spaces.cpp | 47 ++++++++++++++++++--------------------
 1 file changed, 22 insertions(+), 25 deletions(-)

diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp
index b880b6f13..71573ed03 100644
--- a/tests/unit/test_spaces.cpp
+++ b/tests/unit/test_spaces.cpp
@@ -3105,8 +3105,10 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) {
         optimization.avx512f = 0;
     }
 #endif
-#ifdef OPT_AVX2_FMA
+    // F16C is required by every non-AVX-512 SQ8↔FP16 tier (vcvtph2ps), so the guard is hoisted
+    // around all three — matches the dispatcher layout in L2_space.cpp.
 #ifdef OPT_F16C
+#ifdef OPT_AVX2_FMA
     if (optimization.avx2 && optimization.fma3 && optimization.f16c) {
         unsigned char alignment = 0;
         arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
@@ -3117,9 +3119,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) {
         optimization.fma3 = 0;
     }
 #endif
-#endif
 #ifdef OPT_AVX2
-#ifdef OPT_F16C
     if (optimization.avx2 && optimization.f16c) {
         unsigned char alignment = 0;
         arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
@@ -3130,9 +3130,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) {
         optimization.avx2 = 0;
     }
 #endif
-#endif
 #ifdef OPT_SSE4
-#ifdef OPT_F16C
     if (optimization.sse4_1 && optimization.f16c && optimization.avx) {
         unsigned char alignment = 0;
         arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
@@ -3143,7 +3141,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) {
         optimization.sse4_1 = 0;
     }
 #endif
-#endif
+#endif // OPT_F16C
 
     // Scalar fallback. Init alignment to a sentinel (0xFF) so the assert below actually verifies
     // that the dispatcher LEAVES THE VALUE UNTOUCHED on the scalar path — initialising to 0 then
@@ -3187,8 +3185,10 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) {
         optimization.avx512f = 0;
     }
 #endif
-#ifdef OPT_AVX2_FMA
+    // F16C is required by every non-AVX-512 SQ8↔FP16 tier (vcvtph2ps), so the guard is hoisted
+    // around all three — matches the dispatcher layout in IP_space.cpp.
 #ifdef OPT_F16C
+#ifdef OPT_AVX2_FMA
     if (optimization.avx2 && optimization.fma3 && optimization.f16c) {
         unsigned char alignment = 0;
         arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
@@ -3199,9 +3199,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) {
         optimization.fma3 = 0;
     }
 #endif
-#endif
 #ifdef OPT_AVX2
-#ifdef OPT_F16C
     if (optimization.avx2 && optimization.f16c) {
         unsigned char alignment = 0;
         arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
@@ -3212,9 +3210,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) {
         optimization.avx2 = 0;
     }
 #endif
-#endif
 #ifdef OPT_SSE4
-#ifdef OPT_F16C
     if (optimization.sse4_1 && optimization.f16c && optimization.avx) {
         unsigned char alignment = 0;
         arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
@@ -3225,7 +3221,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) {
         optimization.sse4_1 = 0;
     }
 #endif
-#endif
+#endif // OPT_F16C
 
     // Scalar fallback — see L2 test for the 0xFF sentinel rationale.
     unsigned char alignment = 0xFF;
@@ -3267,8 +3263,10 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) {
         optimization.avx512f = 0;
     }
 #endif
-#ifdef OPT_AVX2_FMA
+    // F16C is required by every non-AVX-512 SQ8↔FP16 tier (vcvtph2ps), so the guard is hoisted
+    // around all three — matches the dispatcher layout in IP_space.cpp.
 #ifdef OPT_F16C
+#ifdef OPT_AVX2_FMA
     if (optimization.avx2 && optimization.fma3 && optimization.f16c) {
         unsigned char alignment = 0;
         arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
@@ -3279,9 +3277,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) {
         optimization.fma3 = 0;
     }
 #endif
-#endif
 #ifdef OPT_AVX2
-#ifdef OPT_F16C
     if (optimization.avx2 && optimization.f16c) {
         unsigned char alignment = 0;
         arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
@@ -3292,9 +3288,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) {
         optimization.avx2 = 0;
     }
 #endif
-#endif
 #ifdef OPT_SSE4
-#ifdef OPT_F16C
     if (optimization.sse4_1 && optimization.f16c && optimization.avx) {
         unsigned char alignment = 0;
         arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
@@ -3305,7 +3299,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) {
         optimization.sse4_1 = 0;
     }
 #endif
-#endif
+#endif // OPT_F16C
 
     // Scalar fallback — see L2 test for the 0xFF sentinel rationale.
     unsigned char alignment = 0xFF;
@@ -3337,15 +3331,17 @@ TEST(SQ8_FP16_SIMD_TierCoverage, ReportTiersExercised) {
     bool any_simd = false;
 
 #ifdef CPU_FEATURES_ARCH_X86_64
-#ifdef OPT_AVX512_F_BW_VL_VNNI
-    if (opt.avx512f && opt.avx512bw && opt.avx512vl && opt.avx512vnni) {
-        std::cerr << "[SQ8_FP16] AVX-512 F+BW+VL+VNNI tier exercised\n";
+#ifdef OPT_AVX512F
+    if (opt.avx512f) {
+        std::cerr << "[SQ8_FP16] AVX-512F tier exercised\n";
         any_simd = true;
     } else {
-        std::cerr << "[SQ8_FP16] AVX-512 F+BW+VL+VNNI tier NOT exercised on this host\n";
+        std::cerr << "[SQ8_FP16] AVX-512F tier NOT exercised on this host\n";
     }
 #endif
-#if defined(OPT_AVX2_FMA) && defined(OPT_F16C)
+    // F16C guards all non-AVX-512 SQ8↔FP16 tiers — matches the dispatcher layout.
+#ifdef OPT_F16C
+#ifdef OPT_AVX2_FMA
     if (opt.avx2 && opt.fma3 && opt.f16c) {
         std::cerr << "[SQ8_FP16] AVX2+FMA+F16C tier exercised\n";
         any_simd = true;
@@ -3353,7 +3349,7 @@ TEST(SQ8_FP16_SIMD_TierCoverage, ReportTiersExercised) {
         std::cerr << "[SQ8_FP16] AVX2+FMA+F16C tier NOT exercised on this host\n";
     }
 #endif
-#if defined(OPT_AVX2) && defined(OPT_F16C)
+#ifdef OPT_AVX2
     if (opt.avx2 && opt.f16c) {
         std::cerr << "[SQ8_FP16] AVX2+F16C tier exercised\n";
         any_simd = true;
@@ -3361,7 +3357,7 @@ TEST(SQ8_FP16_SIMD_TierCoverage, ReportTiersExercised) {
         std::cerr << "[SQ8_FP16] AVX2+F16C tier NOT exercised on this host\n";
     }
 #endif
-#if defined(OPT_SSE4) && defined(OPT_F16C)
+#ifdef OPT_SSE4
     if (opt.sse4_1 && opt.f16c && opt.avx) {
         std::cerr << "[SQ8_FP16] SSE4+F16C+AVX tier exercised\n";
         any_simd = true;
@@ -3369,6 +3365,7 @@ TEST(SQ8_FP16_SIMD_TierCoverage, ReportTiersExercised) {
         std::cerr << "[SQ8_FP16] SSE4+F16C+AVX tier NOT exercised on this host\n";
     }
 #endif
+#endif // OPT_F16C
 #endif // x86_64
 
     if (!any_simd) {

From 8fe3d7431fd0c3f54bd000461be4623048b2aa79 Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Sun, 31 May 2026 12:24:56 +0300
Subject: [PATCH 21/24] =?UTF-8?q?Drop=20non-idiomatic=20SQ8=E2=86=94FP16?=
 =?UTF-8?q?=20tier-coverage=20reporter=20test=20[MOD-14954]?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised was an outlier — no
other data type has a std::cerr-based coverage reporter. Per-tier
coverage is already provided by SQ8_FP16_SpacesOptimizationTest (which
walks AVX-512 → AVX2+FMA → AVX2 → SSE4 → scalar by clearing feature
flags), and ISA-lane presence is handled by the CI matrix, matching
the convention used by every other type's SpacesOptimizationTest.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/unit/test_spaces.cpp | 50 --------------------------------------
 1 file changed, 50 deletions(-)

diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp
index 71573ed03..ae54e931e 100644
--- a/tests/unit/test_spaces.cpp
+++ b/tests/unit/test_spaces.cpp
@@ -3323,56 +3323,6 @@ INSTANTIATE_TEST_SUITE_P(SQ8_FP16_SIMD, SQ8_FP16_SpacesOptimizationTest,
 INSTANTIATE_TEST_SUITE_P(SQ8_FP16_SIMD_HighDim, SQ8_FP16_SpacesOptimizationTest,
                          testing::Values(64UL, 128UL, 256UL, 512UL, 1024UL));
 
-// Surfaces which SIMD tiers were actually exercised on the current host. Without this, a CI
-// runner that lacks AVX-512 silently passes with zero tier-1 coverage. Logs per-tier presence
-// to stderr and GTEST_SKIPs only when no SIMD tier is available at all.
-TEST(SQ8_FP16_SIMD_TierCoverage, ReportTiersExercised) {
-    auto opt = getCpuOptimizationFeatures();
-    bool any_simd = false;
-
-#ifdef CPU_FEATURES_ARCH_X86_64
-#ifdef OPT_AVX512F
-    if (opt.avx512f) {
-        std::cerr << "[SQ8_FP16] AVX-512F tier exercised\n";
-        any_simd = true;
-    } else {
-        std::cerr << "[SQ8_FP16] AVX-512F tier NOT exercised on this host\n";
-    }
-#endif
-    // F16C guards all non-AVX-512 SQ8↔FP16 tiers — matches the dispatcher layout.
-#ifdef OPT_F16C
-#ifdef OPT_AVX2_FMA
-    if (opt.avx2 && opt.fma3 && opt.f16c) {
-        std::cerr << "[SQ8_FP16] AVX2+FMA+F16C tier exercised\n";
-        any_simd = true;
-    } else {
-        std::cerr << "[SQ8_FP16] AVX2+FMA+F16C tier NOT exercised on this host\n";
-    }
-#endif
-#ifdef OPT_AVX2
-    if (opt.avx2 && opt.f16c) {
-        std::cerr << "[SQ8_FP16] AVX2+F16C tier exercised\n";
-        any_simd = true;
-    } else {
-        std::cerr << "[SQ8_FP16] AVX2+F16C tier NOT exercised on this host\n";
-    }
-#endif
-#ifdef OPT_SSE4
-    if (opt.sse4_1 && opt.f16c && opt.avx) {
-        std::cerr << "[SQ8_FP16] SSE4+F16C+AVX tier exercised\n";
-        any_simd = true;
-    } else {
-        std::cerr << "[SQ8_FP16] SSE4+F16C+AVX tier NOT exercised on this host\n";
-    }
-#endif
-#endif // OPT_F16C
-#endif // x86_64
-
-    if (!any_simd) {
-        GTEST_SKIP() << "No SQ8_FP16 SIMD tier available on this host — scalar fallback only.";
-    }
-}
-
 /* ======================== Tests SQ8_FP16 (edge cases) ========================= */
 
 // Zero FP16 query against a non-zero SQ8 storage. IP must be exactly 1.0 (1 - 0),

From 999580f1c780b3d518201d332e4125474357a918 Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Sun, 31 May 2026 14:26:19 +0300
Subject: [PATCH 22/24] =?UTF-8?q?Simplify=20SQ8=E2=86=94FP16=20kernels=20a?=
 =?UTF-8?q?nd=20trim=20PR=20churn=20[MOD-14954]?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- AVX512F IP: keep the <=3 tail chunks on distinct accumulators
  (sum0/sum1/sum2) instead of serializing into one, preserving ILP when
  the main 64-lane loop runs few or zero times.
- Condense kernel header comments; drop redundant float16.h/alignment.h
  includes (pulled in transitively) and the direct <immintrin.h> include
  (provided via space_includes.h, matching the other AVX512F headers).
- test_spaces: align the SQ8_FP16 scalar-fallback alignment assertion with
  the convention used by the other SpacesOptimizationTest suites.
- Revert unrelated CMake message/quote churn on the base AVX2/SSE4 TUs and
  the stray blank line in AVX512F_BW_VL_VNNI.cpp, leaving only the additive
  F16C build blocks in this PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 src/VecSim/spaces/CMakeLists.txt              | 10 ++--
 src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h   | 19 ++----
 src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h       | 21 ++-----
 src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h    | 58 ++++++++-----------
 src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h       | 18 ++----
 src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h   |  2 -
 src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h       |  2 -
 src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h    |  2 -
 src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h       |  2 -
 .../spaces/functions/AVX512F_BW_VL_VNNI.cpp   |  1 -
 tests/unit/test_spaces.cpp                    | 23 ++------
 11 files changed, 49 insertions(+), 109 deletions(-)

diff --git a/src/VecSim/spaces/CMakeLists.txt b/src/VecSim/spaces/CMakeLists.txt
index 9b7477837..309d3f3a4 100644
--- a/src/VecSim/spaces/CMakeLists.txt
+++ b/src/VecSim/spaces/CMakeLists.txt
@@ -61,8 +61,8 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)")
 	# SQ8↔FP16 kernels (which require F16C) live in sibling TUs functions/AVX2_F16C.cpp and
 	# functions/AVX2_FMA_F16C.cpp, compiled only when _has_full_f16c is true.
 	if(CXX_AVX2)
-		message("Building functions/AVX2.cpp with AVX2")
-		set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS "-mavx2")
+		message("Building with AVX2")
+		set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS -mavx2)
 		list(APPEND OPTIMIZATIONS functions/AVX2.cpp)
 	endif()
 
@@ -73,7 +73,7 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)")
 	endif()
 
 	if(CXX_AVX2 AND CXX_FMA)
-		message("Building functions/AVX2_FMA.cpp with AVX2 and FMA")
+		message("Building with AVX2 and FMA")
 		set_source_files_properties(functions/AVX2_FMA.cpp PROPERTIES COMPILE_FLAGS "-mavx2 -mfma")
 		list(APPEND OPTIMIZATIONS functions/AVX2_FMA.cpp)
 	endif()
@@ -103,8 +103,8 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)")
 	endif()
 
 	if(CXX_SSE4)
-		message("Building functions/SSE4.cpp with SSE4.1")
-		set_source_files_properties(functions/SSE4.cpp PROPERTIES COMPILE_FLAGS "-msse4.1")
+		message("Building with SSE4")
+		set_source_files_properties(functions/SSE4.cpp PROPERTIES COMPILE_FLAGS -msse4.1)
 		list(APPEND OPTIMIZATIONS functions/SSE4.cpp)
 	endif()
 
diff --git a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
index a4c1612ea..3800f1e8a 100644
--- a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h
@@ -17,11 +17,8 @@ using sq8 = vecsim_types::sq8;
 using float16 = vecsim_types::float16;
 
 /*
- * Asymmetric SQ8 (storage) ↔ FP16 (query) inner product using algebraic identity:
- *   IP(x, y) = Σ(x_i * y_i)
- *            ≈ Σ((min + delta * q_i) * y_i)
- *            = min * Σy_i + delta * Σ(q_i * y_i)
- *            = min * y_sum + delta * quantized_dot_product
+ * Asymmetric SQ8 (storage) <-> FP16 (query) inner product using algebraic identity:
+ *   IP(x, y) = min * y_sum + delta * Σ(q_i * y_i)
  *
  * FP16 query lanes are widened to FP32 per 8-lane chunk via _mm256_cvtph_ps (F16C);
  * inner-loop arithmetic runs in FP32 with _mm256_fmadd_ps.
@@ -42,32 +39,25 @@ static inline void SQ8_FP16_InnerProductStep_AVX2_FMA(const uint8_t *&pVect1,
     sum256 = _mm256_fmadd_ps(v1_f, v2_f, sum256);
 }
 
-// Precondition: dim >= 16. Caller is the dispatcher in IP_space.cpp / L2_space.cpp.
-// The residual block reads 8 SQ8 bytes and 16 FP16 bytes unconditionally; shorter blobs would
-// under-read.
+// pVec1v = SQ8 storage, pVec2v = FP16 query. Precondition: dim >= 16 (enforced by dispatcher).
 template <unsigned char residual> // 0..15
 float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v, size_t dimension) {
     const uint8_t *pVec1 = static_cast<const uint8_t *>(pVec1v);
     const float16 *pVec2 = static_cast<const float16 *>(pVec2v);
     const uint8_t *pEnd1 = pVec1 + dimension;
 
-    // Two independent accumulators break the FMA dependency chain so consecutive iterations
-    // can issue in parallel through both FMA ports.
+    // Two accumulators break the FMA dependency chain across consecutive iterations.
     __m256 sum_a = _mm256_setzero_ps();
     __m256 sum_b = _mm256_setzero_ps();
 
     if constexpr (residual % 8) {
         constexpr int mask = (1 << (residual % 8)) - 1;
 
-        // Single-side mask is sufficient: SQ8 lanes beyond `residual` may hold garbage, but the
-        // FP16 query blend below forces those FP32 query lanes to 0, so garbage·0=0 contributes
-        // nothing to the dot product. SQ8 load is intentionally unmasked.
         __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast<const __m128i *>(pVec1));
         pVec1 += residual % 8;
         __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128);
         __m256 v1_f = _mm256_cvtepi32_ps(v1_256);
 
-        // FP16 side: load full 16-byte block (safe — dim >= 16 and metadata follows).
         __m128i v2_128 = _mm_loadu_si128(reinterpret_cast<const __m128i *>(pVec2));
         __m256 v2_f = _mm256_cvtph_ps(v2_128);
         v2_f = _mm256_blend_ps(_mm256_setzero_ps(), v2_f, mask);
@@ -77,7 +67,6 @@ float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v,
     }
 
     if constexpr (residual >= 8) {
-        // Route the half-residual chunk to the second accumulator for ILP.
         SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum_b);
     }
 
diff --git a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
index 3a01d80f2..acec6102c 100644
--- a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h
@@ -17,15 +17,11 @@ using sq8 = vecsim_types::sq8;
 using float16 = vecsim_types::float16;
 
 /*
- * Asymmetric SQ8 (storage) ↔ FP16 (query) inner product using algebraic identity:
- *   IP(x, y) = Σ(x_i * y_i)
- *            ≈ Σ((min + delta * q_i) * y_i)
- *            = min * Σy_i + delta * Σ(q_i * y_i)
- *            = min * y_sum + delta * quantized_dot_product
+ * Asymmetric SQ8 (storage) <-> FP16 (query) inner product using algebraic identity:
+ *   IP(x, y) = min * y_sum + delta * Σ(q_i * y_i)
  *
  * FP16 query lanes are widened to FP32 per 8-lane chunk via _mm256_cvtph_ps (F16C);
- * inner-loop arithmetic runs in FP32 with separate _mm256_mul_ps + _mm256_add_ps
- * (no FMA tier — Haswell-era AVX2 without FMA support).
+ * inner-loop arithmetic runs in FP32 with separate _mm256_mul_ps + _mm256_add_ps (no FMA).
  */
 
 // 8-wide AVX2 step (no FMA): 8 SQ8 lanes + 8 FP16 lanes -> mul + add into sum.
@@ -43,26 +39,20 @@ static inline void SQ8_FP16_InnerProductStep_AVX2(const uint8_t *&pVect1, const
     sum256 = _mm256_add_ps(sum256, _mm256_mul_ps(v1_f, v2_f));
 }
 
-// Precondition: dim >= 16. Caller is the dispatcher in IP_space.cpp / L2_space.cpp.
-// The residual block reads 8 SQ8 bytes and 16 FP16 bytes unconditionally; shorter blobs would
-// under-read.
+// pVec1v = SQ8 storage, pVec2v = FP16 query. Precondition: dim >= 16 (enforced by dispatcher).
 template <unsigned char residual> // 0..15
 float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size_t dimension) {
     const uint8_t *pVec1 = static_cast<const uint8_t *>(pVec1v);
     const float16 *pVec2 = static_cast<const float16 *>(pVec2v);
     const uint8_t *pEnd1 = pVec1 + dimension;
 
-    // Two independent accumulators break the mul→add dependency chain on Haswell-class CPUs
-    // without FMA, where the add cannot retire before the prior mul.
+    // Two accumulators break the mul->add dependency chain (no FMA on this tier).
     __m256 sum_a = _mm256_setzero_ps();
     __m256 sum_b = _mm256_setzero_ps();
 
     if constexpr (residual % 8) {
         constexpr int mask = (1 << (residual % 8)) - 1;
 
-        // Single-side mask is sufficient: SQ8 lanes beyond `residual` may hold garbage, but the
-        // FP16 query blend below forces those FP32 query lanes to 0, so garbage·0=0 contributes
-        // nothing to the dot product. SQ8 load is intentionally unmasked.
         __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast<const __m128i *>(pVec1));
         pVec1 += residual % 8;
         __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128);
@@ -77,7 +67,6 @@ float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size
     }
 
     if constexpr (residual >= 8) {
-        // Route the half-residual chunk to the second accumulator for ILP.
         SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum_b);
     }
 
diff --git a/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h
index 7ba9c0412..60d0ba719 100644
--- a/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h
@@ -11,20 +11,25 @@
 #include "VecSim/types/sq8.h"
 #include "VecSim/types/float16.h"
 #include "VecSim/utils/alignment.h"
-#include <immintrin.h>
 
 using sq8 = vecsim_types::sq8;
 using float16 = vecsim_types::float16;
 
-// Helper: load 16 SQ8 + 16 FP16 lanes, widen both to FP32, fused-multiply-add into sum.
+/*
+ * Asymmetric SQ8 (storage) <-> FP16 (query) inner product using algebraic identity:
+ *   IP(x, y) = min * y_sum + delta * Σ(q_i * y_i)
+ *
+ * FP16 query lanes are widened to FP32 per 16-lane chunk via _mm512_cvtph_ps (AVX512F);
+ * inner-loop arithmetic runs in FP32 with _mm512_fmadd_ps.
+ */
+
+// 16-wide AVX512F step: 16 SQ8 lanes + 16 FP16 lanes -> 16 FP32 fused-multiply-add.
 static inline void SQ8_FP16_InnerProductStep_AVX512(const uint8_t *&pVec1, const float16 *&pVec2,
                                                     __m512 &sum) {
-    // 16 uint8 -> 16 fp32
     __m128i v1_128 = _mm_loadu_si128(reinterpret_cast<const __m128i *>(pVec1));
     __m512i v1_512 = _mm512_cvtepu8_epi32(v1_128);
     __m512 v1_f = _mm512_cvtepi32_ps(v1_512);
 
-    // 16 fp16 -> 16 fp32. _mm512_cvtph_ps is part of AVX512F.
     __m256i v2_16 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(pVec2));
     __m512 v2_f = _mm512_cvtph_ps(v2_16);
 
@@ -34,19 +39,14 @@ static inline void SQ8_FP16_InnerProductStep_AVX512(const uint8_t *&pVec1, const
     pVec2 += 16;
 }
 
-// Raw inner product Σ((min + delta * q_i) * y_i). Used by both InnerProduct/Cosine wrappers
-// and by the L2 kernel.
-// Precondition: dim >= 16. Caller is the dispatcher in IP_space.cpp / L2_space.cpp, which gates
-// this. The residual block reads 16 SQ8 bytes and 32 FP16 bytes unconditionally; shorter blobs
-// would under-read.
+// pVec1v = SQ8 storage, pVec2v = FP16 query. Precondition: dim >= 16 (enforced by dispatcher).
 template <unsigned char residual> // 0..15
 float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, size_t dimension) {
-    const uint8_t *pVec1 = static_cast<const uint8_t *>(pVec1v); // SQ8 storage
-    const float16 *pVec2 = static_cast<const float16 *>(pVec2v); // FP16 query
+    const uint8_t *pVec1 = static_cast<const uint8_t *>(pVec1v);
+    const float16 *pVec2 = static_cast<const float16 *>(pVec2v);
     const uint8_t *pEnd1 = pVec1 + dimension;
 
-    // Four independent accumulators break the FMA dependency chain so the inner loop can
-    // saturate both FMA ports on Sapphire Rapids / Zen 4.
+    // Four accumulators break the FMA dependency chain to saturate both FMA ports.
     __m512 sum0 = _mm512_setzero_ps();
     __m512 sum1 = _mm512_setzero_ps();
     __m512 sum2 = _mm512_setzero_ps();
@@ -59,23 +59,16 @@ float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, si
         __m512i v1_512 = _mm512_cvtepu8_epi32(v1_128);
         __m512 v1_f = _mm512_cvtepi32_ps(v1_512);
 
-        // Safe to read the full 32-byte FP16 chunk: dim >= 16 and the FP16 metadata follows
-        // the lanes, so the load stays within the query blob.
         __m256i v2_16 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(pVec2));
         __m512 v2_f = _mm512_cvtph_ps(v2_16);
 
-        // Mask out unused lanes by folding the mask into the multiply.
         sum0 = _mm512_maskz_mul_ps(mask, v1_f, v2_f);
 
         pVec1 += residual;
         pVec2 += residual;
     }
 
-    // Main unrolled loop: 4 chunks of 16 lanes per iteration, one chunk per accumulator.
-    // Residual leaves `dim - residual` lanes remaining (a multiple of 16), so the
-    // pointer comparison stays exact. Compare via pointer subtraction (not
-    // `pVec1 + 64 <= pEnd1`) so we never form a pointer past one-past-the-end,
-    // which would be UB in C++ regardless of dereference.
+    // Main loop: 4 chunks of 16 lanes per iteration, one chunk per accumulator.
     while (static_cast<size_t>(pEnd1 - pVec1) >= 64) {
         SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum0);
         SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum1);
@@ -83,25 +76,24 @@ float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, si
         SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum3);
     }
 
-    // Reduce the four accumulators into one.
-    __m512 sum = _mm512_add_ps(_mm512_add_ps(sum0, sum1), _mm512_add_ps(sum2, sum3));
-
-    // Tail: at most three remaining 16-lane chunks.
-    while (pVec1 < pEnd1) {
-        SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum);
-    }
+    // Tail: at most three remaining 16-lane chunks (post-residual remainder is a multiple of 16).
+    // Keep chunks on distinct accumulators to preserve ILP when the main loop did not run.
+    const size_t remaining = pEnd1 - pVec1;
+    if (remaining >= 16)
+        SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum0);
+    if (remaining >= 32)
+        SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum1);
+    if (remaining >= 48)
+        SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum2);
 
+    __m512 sum = _mm512_add_ps(_mm512_add_ps(sum0, sum1), _mm512_add_ps(sum2, sum3));
     float quantized_dot = _mm512_reduce_add_ps(sum);
 
-    // SQ8 metadata starts at byte offset `dimension`; for odd `dimension` it is not
-    // 4-byte aligned, so use load_unaligned. Mirrors the scalar SQ8_FP16_Impl pattern.
     const uint8_t *pVec1Base = static_cast<const uint8_t *>(pVec1v);
     const uint8_t *params_bytes = pVec1Base + dimension;
     const float min_val = load_unaligned<float>(params_bytes + sq8::MIN_VAL * sizeof(float));
     const float delta = load_unaligned<float>(params_bytes + sq8::DELTA * sizeof(float));
 
-    // FP16 query metadata sits at byte offset 2*dimension; for odd `dimension` it is
-    // 2-byte aligned only.
     const float16 *pVec2Base = static_cast<const float16 *>(pVec2v);
     const auto *query_meta_bytes = reinterpret_cast<const uint8_t *>(pVec2Base + dimension);
     const float y_sum = load_unaligned<float>(query_meta_bytes + sq8::SUM_QUERY * sizeof(float));
@@ -117,7 +109,5 @@ float SQ8_FP16_InnerProductSIMD16_AVX512F(const void *pVec1v, const void *pVec2v
 
 template <unsigned char residual> // 0..15
 float SQ8_FP16_CosineSIMD16_AVX512F(const void *pVec1v, const void *pVec2v, size_t dimension) {
-    // Cosine distance = 1 - IP for pre-normalised vectors. Aliases InnerProduct, matching the
-    // SQ8_FP32 pattern.
     return SQ8_FP16_InnerProductSIMD16_AVX512F<residual>(pVec1v, pVec2v, dimension);
 }
diff --git a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
index 871a189dc..1cc3cb153 100644
--- a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
+++ b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h
@@ -16,20 +16,16 @@ using sq8 = vecsim_types::sq8;
 using float16 = vecsim_types::float16;
 
 /*
- * Asymmetric SQ8 (storage) ↔ FP16 (query) inner product using algebraic identity:
- *   IP(x, y) = Σ(x_i * y_i)
- *            ≈ Σ((min + delta * q_i) * y_i)
- *            = min * Σy_i + delta * Σ(q_i * y_i)
- *            = min * y_sum + delta * quantized_dot_product
+ * Asymmetric SQ8 (storage) <-> FP16 (query) inner product using algebraic identity:
+ *   IP(x, y) = min * y_sum + delta * Σ(q_i * y_i)
  *
  * FP16 query lanes are widened to FP32 per 4-lane chunk via _mm_cvtph_ps (F16C);
- * inner-loop arithmetic runs in FP32 with separate _mm_mul_ps + _mm_add_ps (SSE4 has no FMA).
+ * inner-loop arithmetic runs in FP32 with separate _mm_mul_ps + _mm_add_ps (no FMA).
  */
 
 // 4-wide SSE4+F16C step: 4 SQ8 lanes + 4 FP16 lanes -> mul + add into sum.
 static inline void SQ8_FP16_InnerProductStep_SSE4(const uint8_t *&pVect1, const float16 *&pVect2,
                                                   __m128 &sum) {
-    // Alignment-safe 4-byte load of SQ8 lanes via load_unaligned<int32_t> (no strict-aliasing UB).
     __m128i v1_i = _mm_cvtepu8_epi32(_mm_cvtsi32_si128(load_unaligned<int32_t>(pVect1)));
     pVect1 += 4;
     __m128 v1_f = _mm_cvtepi32_ps(v1_i);
@@ -41,8 +37,7 @@ static inline void SQ8_FP16_InnerProductStep_SSE4(const uint8_t *&pVect1, const
     sum = _mm_add_ps(sum, _mm_mul_ps(v1_f, v2_f));
 }
 
-// Precondition: dim >= 16. Caller is the dispatcher in IP_space.cpp / L2_space.cpp.
-// Shorter blobs would underflow the residual ladder + final do-while loop.
+// pVec1v = SQ8 storage, pVec2v = FP16 query. Precondition: dim >= 16 (enforced by dispatcher).
 template <unsigned char residual> // 0..15
 float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2v,
                                            size_t dimension) {
@@ -50,7 +45,7 @@ float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2
     const float16 *pVec2 = static_cast<const float16 *>(pVec2v);
     const uint8_t *pEnd1 = pVec1 + dimension;
 
-    // Two independent accumulators break the mul→add dependency chain (SSE4 lacks FMA).
+    // Two accumulators break the mul->add dependency chain (no FMA on this tier).
     __m128 sum_a = _mm_setzero_ps();
     __m128 sum_b = _mm_setzero_ps();
 
@@ -80,7 +75,6 @@ float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2
         sum_a = _mm_mul_ps(v1_f, v2_f);
     }
 
-    // Alternate the residual-ladder steps across the two accumulators for ILP.
     if constexpr (residual >= 4) {
         SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_b);
     }
@@ -91,8 +85,6 @@ float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2
         SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_b);
     }
 
-    // Remaining lanes after the residual block are a multiple of 16, hence a multiple of 8,
-    // so two 4-lane steps per iteration consume the tail exactly.
     do {
         SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_a);
         SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_b);
diff --git a/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h
index 38809e9c2..c855b62ca 100644
--- a/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h
+++ b/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h
@@ -11,8 +11,6 @@
 #include "VecSim/spaces/AVX_utils.h"
 #include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h"
 #include "VecSim/types/sq8.h"
-#include "VecSim/types/float16.h"
-#include "VecSim/utils/alignment.h"
 
 using sq8 = vecsim_types::sq8;
 using float16 = vecsim_types::float16;
diff --git a/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h
index 98bb29c05..7c2cbfcd8 100644
--- a/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h
+++ b/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h
@@ -11,8 +11,6 @@
 #include "VecSim/spaces/AVX_utils.h"
 #include "VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h"
 #include "VecSim/types/sq8.h"
-#include "VecSim/types/float16.h"
-#include "VecSim/utils/alignment.h"
 
 using sq8 = vecsim_types::sq8;
 using float16 = vecsim_types::float16;
diff --git a/src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h
index 384870b21..9d7b1569f 100644
--- a/src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h
+++ b/src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h
@@ -10,8 +10,6 @@
 #include "VecSim/spaces/space_includes.h"
 #include "VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h"
 #include "VecSim/types/sq8.h"
-#include "VecSim/types/float16.h"
-#include "VecSim/utils/alignment.h"
 
 using sq8 = vecsim_types::sq8;
 using float16 = vecsim_types::float16;
diff --git a/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h
index 75bbd46f8..d0a0fea06 100644
--- a/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h
+++ b/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h
@@ -10,8 +10,6 @@
 #include "VecSim/spaces/space_includes.h"
 #include "VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h"
 #include "VecSim/types/sq8.h"
-#include "VecSim/types/float16.h"
-#include "VecSim/utils/alignment.h"
 
 using sq8 = vecsim_types::sq8;
 using float16 = vecsim_types::float16;
diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp
index 712bdda4e..3b8813b89 100644
--- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp
+++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp
@@ -75,7 +75,6 @@ dist_func_t<float> Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t d
     CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP32_L2SqrSIMD16_AVX512F_BW_VL_VNNI);
     return ret_dist_func;
 }
-
 // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum)
 dist_func_t<float> Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim) {
     dist_func_t<float> ret_dist_func;
diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp
index ae54e931e..5a3b94556 100644
--- a/tests/unit/test_spaces.cpp
+++ b/tests/unit/test_spaces.cpp
@@ -3143,18 +3143,13 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) {
 #endif
 #endif // OPT_F16C
 
-    // Scalar fallback. Init alignment to a sentinel (0xFF) so the assert below actually verifies
-    // that the dispatcher LEAVES THE VALUE UNTOUCHED on the scalar path — initialising to 0 then
-    // asserting `== 0` would pass even if the dispatcher were a no-op.
-    unsigned char alignment = 0xFF;
+    unsigned char alignment = 0;
     arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
     ASSERT_EQ(arch_opt_func, SQ8_FP16_L2Sqr)
         << "Unexpected scalar fallback function for dim " << dim;
     ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
         << "Scalar fallback with dim " << dim;
-    ASSERT_EQ(alignment, 0xFF) << "Scalar fallback must leave caller's alignment value untouched "
-                                  "(dim "
-                               << dim << ")";
+    ASSERT_EQ(alignment, 0) << "No optimization with dim " << dim;
 }
 
 TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) {
@@ -3223,16 +3218,13 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) {
 #endif
 #endif // OPT_F16C
 
-    // Scalar fallback — see L2 test for the 0xFF sentinel rationale.
-    unsigned char alignment = 0xFF;
+    unsigned char alignment = 0;
     arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
     ASSERT_EQ(arch_opt_func, SQ8_FP16_InnerProduct)
         << "Unexpected scalar fallback function for dim " << dim;
     ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
         << "Scalar fallback with dim " << dim;
-    ASSERT_EQ(alignment, 0xFF) << "Scalar fallback must leave caller's alignment value untouched "
-                                  "(dim "
-                               << dim << ")";
+    ASSERT_EQ(alignment, 0) << "No optimization with dim " << dim;
 }
 
 TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) {
@@ -3301,16 +3293,13 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) {
 #endif
 #endif // OPT_F16C
 
-    // Scalar fallback — see L2 test for the 0xFF sentinel rationale.
-    unsigned char alignment = 0xFF;
+    unsigned char alignment = 0;
     arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);
     ASSERT_EQ(arch_opt_func, SQ8_FP16_Cosine)
         << "Unexpected scalar fallback function for dim " << dim;
     ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01)
         << "Scalar fallback with dim " << dim;
-    ASSERT_EQ(alignment, 0xFF) << "Scalar fallback must leave caller's alignment value untouched "
-                                  "(dim "
-                               << dim << ")";
+    ASSERT_EQ(alignment, 0) << "No optimization with dim " << dim;
 }
 
 // Dim range [16, 32] covers every residual class for the 16-element chunk used by every tier.

From 91c14e5afb50009ad9bde0027f202b84b847fc5e Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Sun, 31 May 2026 14:37:11 +0300
Subject: [PATCH 23/24] Document why OPT_F16C differs from the other OPT_*
 macros [MOD-14954]

Explain at the definition site that OPT_F16C is a cross-cutting capability
gate (not a 1:1 dispatch tier), why it is a compound CXX_F16C/FMA/AVX guard
(F16C is VEX-encoded and needs AVX state), and why the AVX-512 SQ8<->FP16
path stays outside it (_mm512_cvtph_ps is part of AVX512F).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 cmake/x86_64InstructionFlags.cmake | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/cmake/x86_64InstructionFlags.cmake b/cmake/x86_64InstructionFlags.cmake
index f19ef7662..ff0e43e97 100644
--- a/cmake/x86_64InstructionFlags.cmake
+++ b/cmake/x86_64InstructionFlags.cmake
@@ -73,6 +73,24 @@ if(CXX_AVX512F AND CXX_AVX512BW AND CXX_AVX512VL AND CXX_AVX512VNNI)
 	add_compile_definitions(OPT_AVX512_F_BW_VL_VNNI)
 endif()
 
+# OPT_F16C is unusual compared to the other OPT_* macros above:
+#
+#  1. It is a *capability* gate, not a dispatch tier. Every other OPT_* maps 1:1 to a single
+#     ISA tier that owns its own translation unit (OPT_AVX2 -> AVX2.cpp, OPT_SSE4 -> SSE4.cpp).
+#     F16C owns no tier of its own; it only enables the vcvtph2ps (FP16<->FP32) conversion that
+#     several tiers need. So it is hoisted *around* multiple tiers (AVX2_FMA / AVX2 / SSE4 for
+#     the SQ8<->FP16 kernels) rather than selecting one.
+#
+#  2. It is a compound guard (CXX_F16C AND CXX_FMA AND CXX_AVX), not a single flag. F16C is
+#     VEX-encoded, so vcvtph2ps requires AVX state to execute -- emitting it without AVX is
+#     invalid. Defining OPT_F16C therefore implies AVX is present, and the F16C kernels must be
+#     compiled with -mf16c added *on top of* -mavx (see functions/*_F16C.cpp in
+#     src/VecSim/spaces/CMakeLists.txt). The base AVX2.cpp / SSE4.cpp objects stay F16C-free so
+#     they still run on CPUs without F16C.
+#
+#  3. The AVX-512 tier deliberately does NOT use this gate: _mm512_cvtph_ps is part of AVX512F
+#     itself, so the AVX-512 SQ8<->FP16 path needs only OPT_AVX512F and lives outside any
+#     OPT_F16C guard.
 if(CXX_F16C AND CXX_FMA AND CXX_AVX)
 	add_compile_definitions(OPT_F16C)
 endif()

From f5926c230b57e028523688d2f49f36db80204378 Mon Sep 17 00:00:00 2001
From: Dor Forer <dor.forer@redis.com>
Date: Sun, 31 May 2026 17:12:04 +0300
Subject: [PATCH 24/24] Cover AVX512 three-chunk tail and dim<16 dispatcher
 guard in SQ8_FP16 tests [MOD-14954]

Codecov flagged 4 uncovered lines on PR #970:
- The AVX512F `remaining >= 48` third tail step in IP_AVX512F_SQ8_FP16.h was
  never executed: the test dims never satisfied (dim / 16) % 4 == 3. Add 48
  (zero main-loop iterations) and 112 (one main-loop iteration) to exercise it.
- The `dim < 16` scalar early-return in the IP/Cosine/L2 SQ8_FP16 dispatchers
  was never taken. Assert the three dispatchers return the scalar funcs at dim 8.

Test-only change. Local release + ASan: SQ8_FP16 137/137, ASan clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/unit/test_spaces.cpp | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp
index 5a3b94556..474ac5c75 100644
--- a/tests/unit/test_spaces.cpp
+++ b/tests/unit/test_spaces.cpp
@@ -572,6 +572,12 @@ TEST_F(SpacesTest, GetDistFuncSQ8FP16Asymmetric) {
     ASSERT_EQ(l2_func, L2_SQ8_FP16_GetDistFunc(dim, nullptr));
     ASSERT_EQ(ip_func, IP_SQ8_FP16_GetDistFunc(dim, nullptr));
     ASSERT_EQ(cosine_func, Cosine_SQ8_FP16_GetDistFunc(dim, nullptr));
+
+    // dim < 16 takes the scalar early-return in every SQ8_FP16 dispatcher (no SIMD tier).
+    size_t small_dim = 8;
+    ASSERT_EQ(L2_SQ8_FP16_GetDistFunc(small_dim, nullptr), SQ8_FP16_L2Sqr);
+    ASSERT_EQ(IP_SQ8_FP16_GetDistFunc(small_dim, nullptr), SQ8_FP16_InnerProduct);
+    ASSERT_EQ(Cosine_SQ8_FP16_GetDistFunc(small_dim, nullptr), SQ8_FP16_Cosine);
 }
 
 #ifdef CPU_FEATURES_ARCH_X86_64
@@ -3308,9 +3314,10 @@ INSTANTIATE_TEST_SUITE_P(SQ8_FP16_SIMD, SQ8_FP16_SpacesOptimizationTest,
 
 // Higher dimensions surface multi-iteration loop bugs (pointer stride, do-while termination
 // off-by-one) that the [16, 32] range does not exercise because the AVX-512 inner loop runs at
-// most twice in that range.
+// most twice in that range. 48 and 112 specifically hit the AVX-512 three-chunk tail
+// (remaining == 48, i.e. (dim / 16) % 4 == 3): 48 with zero main-loop iterations, 112 with one.
 INSTANTIATE_TEST_SUITE_P(SQ8_FP16_SIMD_HighDim, SQ8_FP16_SpacesOptimizationTest,
-                         testing::Values(64UL, 128UL, 256UL, 512UL, 1024UL));
+                         testing::Values(48UL, 64UL, 112UL, 128UL, 256UL, 512UL, 1024UL));
 
 /* ======================== Tests SQ8_FP16 (edge cases) ========================= */