From ad941b8fb86ddff295f009d5c6c4f1d576e993df Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Tue, 26 May 2026 10:30:25 +0300 Subject: [PATCH 01/24] =?UTF-8?q?Add=20design=20doc=20for=20SQ8=E2=86=94FP?= =?UTF-8?q?16=20SIMD=20x86=20kernels=20[MOD-14954]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Captures the architecture, file-level plan, CMake F16C gating, and risk register for adding AVX-512 / AVX2+FMA / AVX2 / SSE4 kernels for the asymmetric SQ8 (storage) ↔ FP16 (query) distance functions, wiring them into the existing dispatcher tables and SQ8_FP16 unit/benchmark scaffolding from MOD-15141. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-05-26-sq8-fp16-x86-kernels-design.md | 385 ++++++++++++++++++ 1 file changed, 385 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md diff --git a/docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md b/docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md new file mode 100644 index 000000000..1ef7a787a --- /dev/null +++ b/docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md @@ -0,0 +1,385 @@ +# SQ8↔FP16 SIMD distance kernels — Intel x86 (MOD-14954) + +## Goal + +Add asymmetric SQ8 (storage) ↔ FP16 (query) distance kernels for Inner +Product, Cosine, and L2² on Intel x86 across four ISA tiers: + +- AVX-512 (F + BW + VL + VNNI bundle already used for SQ8_FP32) +- AVX2 + FMA +- AVX2 (no FMA) +- SSE4.1 + +Each kernel converts FP16 query lanes to FP32 per SIMD chunk; the inner +multiply-accumulate runs in FP32. SQ8 metadata and FP32 query metadata +(precomputed sums) stay scalar and are read with the same algebraic +identity used by the SQ8_FP32 kernels: + +```text +IP(x, y) = min · y_sum + delta · Σ(q_i · y_i) +L2²(x, y) = x_sum_squares + y_sum_squares − 2 · IP(x, y) +``` + +Wire the new kernels into the dispatcher tables so +`{IP,Cosine,L2}_SQ8_FP16_GetDistFunc` returns the best SIMD path +available at runtime instead of the scalar fallback delivered by +MOD-15141. + +## Non-goals + +- No new metric (only IP / Cosine / L2²). +- No change to scalar `SQ8_FP16_*` reference; existing tests against + `SQ8_FP16_NotOptimized_*` remain the correctness baseline. +- No ARM kernels (MOD-14972 covers ARM). +- No SQ8↔FP32 changes; existing kernels untouched. + +## Scope and constraints + +- FP16 query layout is `[float16 values (dim)] [y_sum (float)] + [y_sum_squares (float, L2 only)]`. Trailing metadata is FP32 and may + sit at an offset that is not a multiple of 4 when `dim` is odd; use + `load_unaligned` to read it (mirrors scalar `SQ8_FP16_Impl`). +- All four ISA tiers need a way to widen FP16 → FP32. The 512-bit + variant (`_mm512_cvtph_ps`) is in AVX512F. The 256-bit and 128-bit + variants (`_mm256_cvtph_ps`, `_mm_cvtph_ps`) require the F16C + extension. F16C is its own ISA flag; AVX2/SSE4.1 do not imply it. +- Existing dispatcher source files (`AVX2_FMA.cpp`, `AVX2.cpp`, + `SSE4.cpp`) are compiled without `-mf16c`. We add `-mf16c` to those + files in CMake (conditional on `CXX_F16C`), guard the new SQ8_FP16 + symbols behind `#ifdef OPT_F16C`, and add `features.f16c &&` to the + dispatch gates for the AVX2/SSE4 tiers. The AVX-512 tier needs no + F16C gate. +- `dim` must be ≥ 16 for the AVX-512/AVX2 SIMD paths and ≥ 16 for SSE4 + (matches existing SQ8_FP32 contract). +- SQ8 storage is read as `uint8_t`; alignment hint returned by + `*_GetDistFunc` continues to refer to the SQ8 (first) operand. Hints: + 16 / 8 / 8 / 4 bytes for AVX-512 / AVX2+FMA / AVX2 / SSE4 when + `dim % chunk == 0`, else 0. + +## File-level design + +### New SIMD headers (8 files) + +Per ISA tier × {IP, L2}: + +```text +src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h +src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h +src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h +src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h +src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h +src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h +src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h +src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h +``` + +Each IP header exposes: + +- `template float SQ8_FP16_InnerProductImp_(const void*, const void*, size_t)` — raw inner product (no `1 -`), used by both InnerProduct/Cosine wrappers and the L2 kernel. +- `template float SQ8_FP16_InnerProductSIMD16_(...)` — returns `1.0f - Imp`. +- `template float SQ8_FP16_CosineSIMD16_(...)` — aliases InnerProduct (vectors are pre-normalised, mirrors SQ8_FP32 pattern). + +Each L2 header `#include`s the matching IP header and exposes: + +- `template float SQ8_FP16_L2SqrSIMD16_(...)` — computes `x_sum_sq + y_sum_sq − 2·Imp(...)`. + +`` strings: + +- `AVX512F_BW_VL_VNNI` +- `AVX2_FMA` +- `AVX2` +- `SSE4` + +All four headers' inner loops: + +1. Load 16 SQ8 bytes (one chunk) and widen to 16×FP32. +2. Load 16 FP16 query lanes and widen to 16×FP32 (`_mm512_cvtph_ps`, + two `_mm256_cvtph_ps` calls, two `_mm256_cvtph_ps` for plain AVX2, + or four `_mm_cvtph_ps` for SSE4 — chunk granularity matches the + existing SQ8_FP32 layout for that tier). +3. Fuse-multiply-add (or mul + add for SSE4 and plain AVX2) into the + FP32 accumulator(s). +4. After the loop, horizontal-reduce and apply + `min_val · y_sum + delta · quantized_dot`. + +L2 kernels additionally read `x_sum_squares` from SQ8 metadata and +`y_sum_squares` from query metadata, return +`x_sum_sq + y_sum_sq − 2·ip`. **Both** the SQ8 storage metadata +(`min_val`, `delta`, `x_sum_squares`) and the FP16 query metadata +(`y_sum`, `y_sum_squares`) are read with `load_unaligned`. SQ8 +metadata starts at byte offset `dim` after the quantised lanes — for +odd `dim` that offset is not 4-byte aligned. FP16 query metadata +starts at byte offset `2*dim` after the FP16 lanes — odd `dim` leaves +it 2-byte aligned. Mirrors the scalar `SQ8_FP16_InnerProduct_Impl` +pattern in `src/VecSim/spaces/IP/IP.cpp`. + +Residual handling: + +- **AVX-512** (residual 0..15): load the full 256-bit FP16 chunk + (`_mm256_loadu_si256` over 32 bytes; the chunk is always within the + query blob since `dim >= 16` and the FP16 metadata follows), convert with + `_mm512_cvtph_ps`, then mask away unused lanes via + `_mm512_maskz_mov_ps(mask, v2_f)` (or fold the mask into the + FP32 multiply with `_mm512_maskz_mul_ps`). The SQ8 side uses + `_mm_loadu_si128` + `_mm512_cvtepu8_epi32` + `_mm512_cvtepi32_ps` + and is also masked. +- **AVX2+FMA / AVX2** (residual 0..15, split into a 0..7 head plus a + conditional 8-wide pre-step): for the 0..7 head, load the full + 128-bit FP16 block (`_mm_loadu_si128`), convert with + `_mm256_cvtph_ps`, then zero out unused lanes via + `_mm256_blend_ps(_mm256_setzero_ps(), v2_f, residuals_mask)` — + mirroring the existing F16C `FP16_InnerProductSIMD32_F16C` blend + pattern. The SQ8 side uses `_mm_loadl_epi64` (8 bytes) + + `_mm256_cvtepu8_epi32` + `_mm256_cvtepi32_ps`. When residual ≥ 8, + one extra full 8-wide step runs before the do-while loop, matching + the SQ8_FP32 AVX2[+FMA] residual layout. +- **SSE4** (residual 0..15, split into 4-wide pre-steps): for the + 0..3 head, materialise the FP32 lanes via `_mm_set_ps(0, ..., 0, + FP16_to_FP32(pVec2[k]), ...)` paired with `_mm_set_ps` on the SQ8 + side — mirrors the existing SSE4 SQ8_FP32 `_mm_set_ps` residual + path. For residual ≥ 4 / ≥ 8 / ≥ 12, run 1 / 2 / 3 extra 4-wide + steps before the do-while loop. Each 4-wide step loads 8 bytes of + FP16 (`_mm_loadl_epi64`), converts with `_mm_cvtph_ps`, and loads + 4 SQ8 bytes via `_mm_cvtsi32_si128` + `_mm_cvtepu8_epi32` + + `_mm_cvtepi32_ps`. + +### Dispatcher edits + +Per existing ISA dispatcher (no new dispatcher files): + +| File | Add declarations / definitions | +| --- | --- | +| `src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.{h,cpp}` | `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F_BW_VL_VNNI` | +| `src/VecSim/spaces/functions/AVX2_FMA.{h,cpp}` | `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX2_FMA`, guarded by `#ifdef OPT_F16C` | +| `src/VecSim/spaces/functions/AVX2.{h,cpp}` | `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX2`, guarded by `#ifdef OPT_F16C` | +| `src/VecSim/spaces/functions/SSE4.{h,cpp}` | `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_SSE4`, guarded by `#ifdef OPT_F16C` | + +Each `Choose_*` uses the existing `CHOOSE_IMPLEMENTATION(out, dim, 16, +func)` macro (16-element residual table — matches SQ8_FP32 contract). + +`src/VecSim/spaces/IP_space.cpp` — extend `IP_SQ8_FP16_GetDistFunc` and +`Cosine_SQ8_FP16_GetDistFunc`. `L2_space.cpp` — extend +`L2_SQ8_FP16_GetDistFunc`. New body shape (IP shown; L2/Cosine +identical): + +```cpp +dist_func_t ret_dist_func = SQ8_FP16_InnerProduct; +[[maybe_unused]] auto features = getCpuOptimizationFeatures(arch_opt); + +#ifdef CPU_FEATURES_ARCH_X86_64 +if (dim < 16) { + return ret_dist_func; +} +#ifdef OPT_AVX512_F_BW_VL_VNNI +if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) { + if (dim % 16 == 0) *alignment = 16 * sizeof(uint8_t); + return Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim); +} +#endif +#ifdef OPT_AVX2_FMA +#ifdef OPT_F16C +if (features.avx2 && features.fma3 && features.f16c) { + if (dim % 8 == 0) *alignment = 8 * sizeof(uint8_t); + return Choose_SQ8_FP16_IP_implementation_AVX2_FMA(dim); +} +#endif +#endif +#ifdef OPT_AVX2 +#ifdef OPT_F16C +if (features.avx2 && features.f16c) { + if (dim % 8 == 0) *alignment = 8 * sizeof(uint8_t); + return Choose_SQ8_FP16_IP_implementation_AVX2(dim); +} +#endif +#endif +#ifdef OPT_SSE4 +#ifdef OPT_F16C +// F16C instructions are VEX-encoded — require AVX as well, matching the +// existing FP16/F16C dispatcher gate in IP_space.cpp. +if (features.sse4_1 && features.f16c && features.avx) { + if (dim % 4 == 0) *alignment = 4 * sizeof(uint8_t); + return Choose_SQ8_FP16_IP_implementation_SSE4(dim); +} +#endif +#endif +#endif // x86_64 +return ret_dist_func; +``` + +ARM block (`OPT_SVE2` / `OPT_SVE` / `OPT_NEON`) is left as-is — the +SQ8_FP16 ARM kernels arrive via MOD-14972. + +### CMake change + +`src/VecSim/spaces/CMakeLists.txt` — when both `CXX_F16C` and the +parent ISA flag are present, add `-mf16c` to the dispatcher file: + +```cmake +if(CXX_AVX2 AND CXX_FMA) + set(_avx2_fma_flags "-mavx2 -mfma") + if(CXX_F16C) + set(_avx2_fma_flags "${_avx2_fma_flags} -mf16c") + endif() + set_source_files_properties(functions/AVX2_FMA.cpp PROPERTIES COMPILE_FLAGS "${_avx2_fma_flags}") + list(APPEND OPTIMIZATIONS functions/AVX2_FMA.cpp) +endif() + +if(CXX_AVX2) + set(_avx2_flags "-mavx2") + if(CXX_F16C) + set(_avx2_flags "${_avx2_flags} -mf16c") + endif() + set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS "${_avx2_flags}") + list(APPEND OPTIMIZATIONS functions/AVX2.cpp) +endif() + +if(CXX_SSE4) + set(_sse4_flags "-msse4.1") + if(CXX_F16C) + set(_sse4_flags "${_sse4_flags} -mf16c") + endif() + set_source_files_properties(functions/SSE4.cpp PROPERTIES COMPILE_FLAGS "${_sse4_flags}") + list(APPEND OPTIMIZATIONS functions/SSE4.cpp) +endif() +``` + +AVX-512 dispatcher (`AVX512F_BW_VL_VNNI.cpp`) needs no flag change — +`-mavx512f` already enables `_mm512_cvtph_ps`. + +`-mf16c` does not alter the emitted code for the existing SQ8_FP32 +sources, since those sources contain no F16C intrinsics. + +### Tests (`tests/unit/test_spaces.cpp`) + +1. New parameterised class `SQ8_FP16_SpacesOptimizationTest` mirroring + `SQ8_FP32_SpacesOptimizationTest`. Three test bodies for L2 / IP / + Cosine, each comparing the chosen optimised function against the + scalar `SQ8_FP16_*` baseline (`ASSERT_NEAR ... 0.01`). Walks down + AVX512 → AVX2_FMA → AVX2 → SSE4 → scalar by zeroing feature flags + between assertions, exactly like `SQ8_FP32_SpacesOptimizationTest`. + `INSTANTIATE_TEST_SUITE_P` with `testing::Range(16UL, 16 * 2UL + 1)`. + +2. Update existing `SpacesTest.GetDistFunc_*_SQ8_FP16` assertions at + lines ~563–575: when running on x86, the dispatcher now returns the + SIMD `Choose_*` symbol instead of the scalar. AVX-512 selection + depends on `avx512f && avx512bw && avx512vl && avx512vnni` only + (no F16C requirement — 512-bit `_mm512_cvtph_ps` is part of + AVX512F). AVX2+FMA, AVX2, and SSE4 selection additionally requires + `features.f16c` (and `features.avx` for the SSE4 gate). The tests + should call `getCpuOptimizationFeatures()` and assert the expected + `Choose_*` for the host's highest supported tier (same shape used + by `SQ8_FP32_SpacesOptimizationTest`). + +3. Reuse existing helpers: `populate_sq8_fp16_query`, + `populate_float_vec_to_sq8_with_metadata`, + `SQ8_FP16_NotOptimized_{InnerProduct,Cosine,L2Sqr}`. + +### Benchmarks (`tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp`) + +Add per-ISA benches mirroring `bm_spaces_sq8_fp32.cpp`: + +```cpp +#ifdef CPU_FEATURES_ARCH_X86_64 +cpu_features::X86Features opt = cpu_features::GetX86Info().features; + +#ifdef OPT_AVX512_F_BW_VL_VNNI +bool avx512_f_bw_vl_vnni_supported = opt.avx512f && opt.avx512bw && opt.avx512vl && opt.avx512vnni; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F_BW_VL_VNNI, 16, avx512_f_bw_vl_vnni_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F_BW_VL_VNNI, 16, avx512_f_bw_vl_vnni_supported); +#endif + +#ifdef OPT_F16C +#ifdef OPT_AVX2_FMA +bool avx2_fma3_f16c_supported = opt.avx2 && opt.fma3 && opt.f16c; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2_FMA, 16, avx2_fma3_f16c_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2_FMA, 16, avx2_fma3_f16c_supported); +#endif + +#ifdef OPT_AVX2 +bool avx2_f16c_supported = opt.avx2 && opt.f16c; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2, 16, avx2_f16c_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2, 16, avx2_f16c_supported); +#endif + +#ifdef OPT_SSE4 +bool sse4_f16c_supported = opt.sse4_1 && opt.f16c && opt.avx; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SSE4, 16, sse4_f16c_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SSE4, 16, sse4_f16c_supported); +#endif +#endif // OPT_F16C +#endif // x86_64 +``` + +Naive bench lines stay (covers the scalar fallback case). + +## Validation strategy + +1. Unit tests (`SQ8_FP16_SpacesOptimizationTest`) assert numerical + parity against the scalar baseline for all dims in `[16, 32]` + (covers every residual class for the 16-wide chunk). Existing + `SQ8_FP16_NoOpt` parameterised suite continues to exercise small + and odd dims for the scalar reference; combined with the new + optimisation tests this covers each SIMD residual class plus the + scalar fallback. +2. Existing edge-case tests (`SQ8_FP16_EdgeCases.ZeroQueryTest`, + `SQ8_FP16_l2sqr_odd_dim_unaligned_metadata_test`) keep running + against the scalar implementation directly — they exercise + alignment-sensitive paths that are deliberately scalar-only. +3. Microbenchmarks compare per-ISA SQ8_FP16 throughput to the matching + SQ8_FP32 throughput on the same machine. Acceptance: SQ8_FP16 + should be within ~1.0–1.5× of SQ8_FP32 (one extra widening per + chunk, no extra memory pressure since the FP16 query is half the + size of FP32). +4. CI: x86 jobs already exist; verifies the CMake change keeps + building. No new toolchain requirement (binutils 2.34+ already + covers F16C, no AVX-512 FP16 dependency). + +## Risk register + +| Risk | Likelihood | Mitigation | +| --- | --- | --- | +| Adding `-mf16c` to AVX2_FMA.cpp / AVX2.cpp / SSE4.cpp accidentally enables F16C codegen elsewhere | Low | Those sources contain only SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 code; no F16C intrinsics — compiler cannot synthesise F16C without an explicit intrinsic. | +| Older toolchain without F16C support | Low | `CXX_F16C` already detected; `-mf16c` only appended when present. Dispatcher symbols guarded by `#ifdef OPT_F16C`; missing → falls through to scalar. | +| Backport branches diverge in dispatcher | Medium | Change is additive (new headers, new symbols, new gates). No SQ8_FP32 path touched. CMake change is conditional. Backport just cherry-picks the commit. | +| Pre-Ivy Bridge SSE4-only CPUs lose a SIMD tier (no F16C) | Negligible | Fall through to scalar SQ8_FP16. Such CPUs are out of practical support anyway. | +| Numerical drift between FP16→FP32 widening and the scalar `FP16_to_FP32` software path | Low | `vcvtph2ps` follows IEEE 754 half→single conversion exactly; the scalar `FP16_to_FP32` in `float16.h` is bit-faithful for finite values. Tests use `ASSERT_NEAR ... 0.01` slack. | + +## Out-of-scope follow-ups + +- AVX512FP16-native kernels (would use `__m512h` and `vfmadd*ph` + directly on 32 FP16 lanes per 512-bit register, skipping the + widen-to-FP32 step). Deferred for four concrete reasons, not just + "lower priority": + 1. **Deployment baseline.** AVX512FP16 is Sapphire Rapids and + newer (Intel server 2023+) plus very recent AMD parts. Most + production hosts running this library do not have it. The + AVX-512F path delivered here is the right default for the + widely-deployed AVX-512 tier, and a Sapphire-Rapids-only + variant would land underneath the same gating tree, not as a + replacement. + 2. **Numerical fit is awkward for SQ8↔FP16.** The kernel computes + `Σ(q_i · y_i)` where `q_i ∈ [0,255]` (uint8) and `y_i` is + FP16. Each lane product can be as large as + `255 · 65504 ≈ 1.67e7`, which is well above the FP16 finite + range (`±65504`). A pure FP16 accumulator would overflow on + realistic data; the only safe path is to accumulate in FP32 + after a per-chunk `vcvtph2ps`-equivalent — which is exactly + what the AVX-512F path already does. AVX512FP16 mainly buys + FP16-native multiply-add, which we cannot safely use here. + 3. **Marginal speedup over the AVX-512F path proposed here.** + The widening cost is one `_mm512_cvtph_ps` per 16-element + chunk against a kernel that is already memory-bandwidth-bound + (16 bytes of SQ8 storage + 32 bytes of FP16 query per chunk). + Eliminating that one conversion saves a few cycles per chunk + on a path that is gated on memory, not arithmetic throughput. + 4. **Ticket scope.** MOD-14954 enumerates AVX-512, AVX2+FMA, and + SSE4; the plain-AVX2 tier was added during brainstorming as + free coverage. An AVX512FP16 variant is its own ISA tier with + its own gating column in the dispatcher and its own residual + table, and warrants a separate design / benchmarking pass + once the deployment baseline justifies the maintenance cost. + Pure FP16↔FP16 (no SQ8 involved) already has an AVX512FP16_VL path + at `src/VecSim/spaces/functions/AVX512FP16_VL.cpp`; that file is the + natural home should we revisit this later. +- ARM SQ8_FP16 (MOD-14972). +- Reranking flow integration tests under HNSW (separate ticket). From 97467b25b42a5408672ea75eaef6551692edcd42 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Tue, 26 May 2026 11:24:06 +0300 Subject: [PATCH 02/24] Append -mf16c to AVX2_FMA/AVX2/SSE4 dispatcher sources [MOD-14954] MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Enables _mm{,256}_cvtph_ps in the AVX2+FMA, AVX2, and SSE4 dispatcher translation units so the upcoming SQ8↔FP16 kernels can widen FP16 lanes to FP32. The flag is appended only when CXX_F16C is detected; existing SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 sources contain no F16C intrinsics so emitted code for those kernels is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) --- src/VecSim/spaces/CMakeLists.txt | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/src/VecSim/spaces/CMakeLists.txt b/src/VecSim/spaces/CMakeLists.txt index fe354ded5..8babf844b 100644 --- a/src/VecSim/spaces/CMakeLists.txt +++ b/src/VecSim/spaces/CMakeLists.txt @@ -51,14 +51,26 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)") endif() if(CXX_AVX2) - message("Building with AVX2") - set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS -mavx2) + set(_avx2_flags "-mavx2") + if(CXX_F16C) + message("Building functions/AVX2.cpp with AVX2 and F16C") + set(_avx2_flags "${_avx2_flags} -mf16c") + else() + message("Building functions/AVX2.cpp with AVX2 (no F16C)") + endif() + set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS "${_avx2_flags}") list(APPEND OPTIMIZATIONS functions/AVX2.cpp) endif() if(CXX_AVX2 AND CXX_FMA) - message("Building with AVX2 and FMA") - set_source_files_properties(functions/AVX2_FMA.cpp PROPERTIES COMPILE_FLAGS "-mavx2 -mfma") + set(_avx2_fma_flags "-mavx2 -mfma") + if(CXX_F16C) + message("Building functions/AVX2_FMA.cpp with AVX2, FMA, and F16C") + set(_avx2_fma_flags "${_avx2_fma_flags} -mf16c") + else() + message("Building functions/AVX2_FMA.cpp with AVX2 and FMA (no F16C)") + endif() + set_source_files_properties(functions/AVX2_FMA.cpp PROPERTIES COMPILE_FLAGS "${_avx2_fma_flags}") list(APPEND OPTIMIZATIONS functions/AVX2_FMA.cpp) endif() @@ -81,8 +93,14 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)") endif() if(CXX_SSE4) - message("Building with SSE4") - set_source_files_properties(functions/SSE4.cpp PROPERTIES COMPILE_FLAGS -msse4.1) + set(_sse4_flags "-msse4.1") + if(CXX_F16C) + message("Building functions/SSE4.cpp with SSE4.1 and F16C") + set(_sse4_flags "${_sse4_flags} -mf16c") + else() + message("Building functions/SSE4.cpp with SSE4.1 (no F16C)") + endif() + set_source_files_properties(functions/SSE4.cpp PROPERTIES COMPILE_FLAGS "${_sse4_flags}") list(APPEND OPTIMIZATIONS functions/SSE4.cpp) endif() From bab74734c1a2a7a856dee66dc609b2468416f699 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Tue, 26 May 2026 11:26:05 +0300 Subject: [PATCH 03/24] Add SQ8_FP16_SpacesOptimizationTest skeleton [MOD-14954] Parameterised gtest fixture mirroring SQ8_FP32_SpacesOptimizationTest; currently asserts only the scalar fallback path. Per-tier SIMD assertion blocks (AVX-512, AVX2+FMA, AVX2, SSE4) are added alongside the kernel implementations in subsequent commits. Co-Authored-By: Claude Opus 4.7 (1M context) --- tests/unit/test_spaces.cpp | 95 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 95 insertions(+) diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index a6bb88cef..dfbe81f5d 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -3070,6 +3070,101 @@ INSTANTIATE_TEST_SUITE_P(SQ8_FP16_NoOpt, SQ8_FP16_NoOptimizationSpacesTest, testing::Values(1, 5, 7, 8, 9, 15, 16, 17, 31, 32, 33, 47, 48, 49, 63, 64, 65, 127, 128)); +/* ======================== SQ8_FP16 SIMD optimisation tests ========================= */ + +// Walks down the x86 ISA tiers (AVX-512 → AVX2+FMA → AVX2 → SSE4 → scalar) and asserts +// that {IP,Cosine,L2}_SQ8_FP16_GetDistFunc returns the expected Choose_* symbol and that +// its output matches the scalar baseline within 0.01. +class SQ8_FP16_SpacesOptimizationTest : public testing::TestWithParam {}; + +TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { + auto optimization = getCpuOptimizationFeatures(); + size_t dim = GetParam(); + + size_t query_count = + dim + sq8::query_metadata_count() * (sizeof(float) / sizeof(float16)); + std::vector v1_query(query_count); + test_utils::populate_sq8_fp16_query(v1_query.data(), dim, false, 1234); + + size_t quantized_size = + dim * sizeof(uint8_t) + sq8::storage_metadata_count() * sizeof(float); + std::vector v2_compressed(quantized_size); + test_utils::populate_float_vec_to_sq8_with_metadata(v2_compressed.data(), dim, false, 5678); + + dist_func_t arch_opt_func; + float baseline = SQ8_FP16_L2Sqr(v2_compressed.data(), v1_query.data(), dim); + + // Per-tier assertion blocks are added by Tasks 3–6. + + // Scalar fallback. + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, SQ8_FP16_L2Sqr) + << "Unexpected scalar fallback function for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "Scalar fallback with dim " << dim; + ASSERT_EQ(alignment, 0) << "Scalar fallback should set no alignment hint for dim " << dim; +} + +TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { + auto optimization = getCpuOptimizationFeatures(); + size_t dim = GetParam(); + + size_t query_count = + dim + sq8::query_metadata_count() * (sizeof(float) / sizeof(float16)); + std::vector v1_query(query_count); + test_utils::populate_sq8_fp16_query(v1_query.data(), dim, true, 1234); + + size_t quantized_size = + dim * sizeof(uint8_t) + sq8::storage_metadata_count() * sizeof(float); + std::vector v2_compressed(quantized_size); + test_utils::populate_float_vec_to_sq8_with_metadata(v2_compressed.data(), dim, true, 5678); + + dist_func_t arch_opt_func; + float baseline = SQ8_FP16_InnerProduct(v2_compressed.data(), v1_query.data(), dim); + + // Per-tier assertion blocks are added by Tasks 3–6. + + unsigned char alignment = 0; + arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, SQ8_FP16_InnerProduct) + << "Unexpected scalar fallback function for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "Scalar fallback with dim " << dim; + ASSERT_EQ(alignment, 0) << "Scalar fallback should set no alignment hint for dim " << dim; +} + +TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { + auto optimization = getCpuOptimizationFeatures(); + size_t dim = GetParam(); + + size_t query_count = + dim + sq8::query_metadata_count() * (sizeof(float) / sizeof(float16)); + std::vector v1_query(query_count); + test_utils::populate_sq8_fp16_query(v1_query.data(), dim, true, 1234); + + size_t quantized_size = + dim * sizeof(uint8_t) + sq8::storage_metadata_count() * sizeof(float); + std::vector v2_compressed(quantized_size); + test_utils::populate_float_vec_to_sq8_with_metadata(v2_compressed.data(), dim, true, 5678); + + dist_func_t arch_opt_func; + float baseline = SQ8_FP16_Cosine(v2_compressed.data(), v1_query.data(), dim); + + // Per-tier assertion blocks are added by Tasks 3–6. + + unsigned char alignment = 0; + arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, SQ8_FP16_Cosine) + << "Unexpected scalar fallback function for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "Scalar fallback with dim " << dim; + ASSERT_EQ(alignment, 0) << "Scalar fallback should set no alignment hint for dim " << dim; +} + +INSTANTIATE_TEST_SUITE_P(SQ8_FP16_SIMD, SQ8_FP16_SpacesOptimizationTest, + testing::Range(16UL, 16 * 2UL + 1)); + /* ======================== Tests SQ8_FP16 (edge cases) ========================= */ // Zero FP16 query against a non-zero SQ8 storage. IP must be exactly 1.0 (1 - 0), From 671a7cc3cef3bf313f373ff5e51a899bed12d7bb Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Tue, 26 May 2026 11:30:13 +0300 Subject: [PATCH 04/24] =?UTF-8?q?Add=20AVX-512=20SQ8=E2=86=94FP16=20SIMD?= =?UTF-8?q?=20distance=20kernels=20[MOD-14954]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implements asymmetric SQ8 (storage) ↔ FP16 (query) Inner Product, Cosine, and L2² kernels for the AVX-512 F+BW+VL+VNNI tier. Each chunk widens 16 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 16 FP16 lanes via _mm512_cvtph_ps, then fmadds into a 16-lane FP32 accumulator. SQ8 storage and FP16 query metadata reads use load_unaligned to tolerate odd dimensions. Dispatcher branches in IP_space.cpp / L2_space.cpp select the new Choose_SQ8_FP16_*_implementation_AVX512F_BW_VL_VNNI when features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni; otherwise behaviour is unchanged from MOD-15141. A parameterised gtest fixture exercises every residual class in [16, 32] against the scalar baseline. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h | 102 ++++++++++++++++++ src/VecSim/spaces/IP_space.cpp | 45 ++++++-- .../L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h | 36 +++++++ src/VecSim/spaces/L2_space.cpp | 23 +++- .../spaces/functions/AVX512F_BW_VL_VNNI.cpp | 21 ++++ .../spaces/functions/AVX512F_BW_VL_VNNI.h | 4 + tests/unit/test_spaces.cpp | 39 ++++++- 7 files changed, 252 insertions(+), 18 deletions(-) create mode 100644 src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h create mode 100644 src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h diff --git a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h new file mode 100644 index 000000000..55d63d711 --- /dev/null +++ b/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h @@ -0,0 +1,102 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include "VecSim/utils/alignment.h" +#include + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +// Helper: load 16 SQ8 + 16 FP16 lanes, widen both to FP32, fused-multiply-add into sum. +static inline void SQ8_FP16_InnerProductStep_AVX512(const uint8_t *&pVec1, + const float16 *&pVec2, + __m512 &sum) { + // 16 uint8 -> 16 fp32 + __m128i v1_128 = _mm_loadu_si128(reinterpret_cast(pVec1)); + __m512i v1_512 = _mm512_cvtepu8_epi32(v1_128); + __m512 v1_f = _mm512_cvtepi32_ps(v1_512); + + // 16 fp16 -> 16 fp32. _mm512_cvtph_ps is part of AVX512F. + __m256i v2_16 = _mm256_loadu_si256(reinterpret_cast(pVec2)); + __m512 v2_f = _mm512_cvtph_ps(v2_16); + + sum = _mm512_fmadd_ps(v1_f, v2_f, sum); + + pVec1 += 16; + pVec2 += 16; +} + +// Raw inner product Σ((min + delta * q_i) * y_i). Used by both InnerProduct/Cosine wrappers +// and by the L2 kernel. +template // 0..15 +float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, size_t dimension) { + const uint8_t *pVec1 = static_cast(pVec1v); // SQ8 storage + const float16 *pVec2 = static_cast(pVec2v); // FP16 query + const uint8_t *pEnd1 = pVec1 + dimension; + + __m512 sum = _mm512_setzero_ps(); + + if constexpr (residual > 0) { + __mmask16 mask = (1U << residual) - 1; + + __m128i v1_128 = _mm_loadu_si128(reinterpret_cast(pVec1)); + __m512i v1_512 = _mm512_cvtepu8_epi32(v1_128); + __m512 v1_f = _mm512_cvtepi32_ps(v1_512); + + // Safe to read the full 32-byte FP16 chunk: dim >= 16 and the FP16 metadata follows + // the lanes, so the load stays within the query blob. + __m256i v2_16 = _mm256_loadu_si256(reinterpret_cast(pVec2)); + __m512 v2_f = _mm512_cvtph_ps(v2_16); + + // Mask out unused lanes by folding the mask into the multiply. + sum = _mm512_maskz_mul_ps(mask, v1_f, v2_f); + + pVec1 += residual; + pVec2 += residual; + } + + do { + SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum); + } while (pVec1 < pEnd1); + + float quantized_dot = _mm512_reduce_add_ps(sum); + + // SQ8 metadata starts at byte offset `dimension`; for odd `dimension` it is not + // 4-byte aligned, so use load_unaligned. Mirrors the scalar SQ8_FP16_Impl pattern. + const uint8_t *pVec1Base = static_cast(pVec1v); + const uint8_t *params_bytes = pVec1Base + dimension; + const float min_val = load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); + const float delta = load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); + + // FP16 query metadata sits at byte offset 2*dimension; for odd `dimension` it is + // 2-byte aligned only. + const float16 *pVec2Base = static_cast(pVec2v); + const auto *query_meta_bytes = reinterpret_cast(pVec2Base + dimension); + const float y_sum = + load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + + return min_val * y_sum + delta * quantized_dot; +} + +template // 0..15 +float SQ8_FP16_InnerProductSIMD16_AVX512F_BW_VL_VNNI(const void *pVec1v, const void *pVec2v, + size_t dimension) { + return 1.0f - SQ8_FP16_InnerProductImp_AVX512(pVec1v, pVec2v, dimension); +} + +template // 0..15 +float SQ8_FP16_CosineSIMD16_AVX512F_BW_VL_VNNI(const void *pVec1v, const void *pVec2v, + size_t dimension) { + // Cosine distance = 1 - IP for pre-normalised vectors. Aliases InnerProduct, matching the + // SQ8_FP32 pattern. + return SQ8_FP16_InnerProductSIMD16_AVX512F_BW_VL_VNNI(pVec1v, pVec2v, dimension); +} diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp index 55979e25a..a99241180 100644 --- a/src/VecSim/spaces/IP_space.cpp +++ b/src/VecSim/spaces/IP_space.cpp @@ -172,31 +172,56 @@ dist_func_t Cosine_SQ8_FP32_GetDistFunc(size_t dim, unsigned char *alignm } // SQ8-FP16: asymmetric inner product distance between SQ8 storage and FP16 query. -// SIMD chooser slots are added by P1b (MOD-15152) / P1c (MOD-15153); for now this always -// returns the scalar implementation. dist_func_t IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, const void *arch_opt) { unsigned char dummy_alignment; if (alignment == nullptr) { alignment = &dummy_alignment; } - (void)dim; - (void)arch_opt; - return SQ8_FP16_InnerProduct; + + dist_func_t ret_dist_func = SQ8_FP16_InnerProduct; + [[maybe_unused]] auto features = getCpuOptimizationFeatures(arch_opt); + +#ifdef CPU_FEATURES_ARCH_X86_64 + if (dim < 16) { + return ret_dist_func; + } + // Alignment hints below refer to the SQ8 (first) operand per the GetDistFunc contract. +#ifdef OPT_AVX512_F_BW_VL_VNNI + if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) { + if (dim % 16 == 0) // SQ8 chunk = 16 bytes + *alignment = 16 * sizeof(uint8_t); + return Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim); + } +#endif +#endif // x86_64 + return ret_dist_func; } // SQ8-FP16: asymmetric cosine distance between SQ8 storage and FP16 query. -// SIMD chooser slots are added by P1b (MOD-15152) / P1c (MOD-15153); for now this always -// returns the scalar implementation. dist_func_t Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, const void *arch_opt) { unsigned char dummy_alignment; if (alignment == nullptr) { alignment = &dummy_alignment; } - (void)dim; - (void)arch_opt; - return SQ8_FP16_Cosine; + + dist_func_t ret_dist_func = SQ8_FP16_Cosine; + [[maybe_unused]] auto features = getCpuOptimizationFeatures(arch_opt); + +#ifdef CPU_FEATURES_ARCH_X86_64 + if (dim < 16) { + return ret_dist_func; + } +#ifdef OPT_AVX512_F_BW_VL_VNNI + if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) { + if (dim % 16 == 0) + *alignment = 16 * sizeof(uint8_t); + return Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(dim); + } +#endif +#endif // x86_64 + return ret_dist_func; } // SQ8-to-SQ8 Inner Product distance function (both vectors are uint8 quantized with precomputed diff --git a/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h new file mode 100644 index 000000000..101bf285e --- /dev/null +++ b/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h @@ -0,0 +1,36 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include "VecSim/utils/alignment.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +// L2² = x_sum_squares + y_sum_squares - 2 * IP(x, y), computed via the AVX-512 IP impl above. +template // 0..15 +float SQ8_FP16_L2SqrSIMD16_AVX512F_BW_VL_VNNI(const void *pVect1v, const void *pVect2v, + size_t dimension) { + const float ip = SQ8_FP16_InnerProductImp_AVX512(pVect1v, pVect2v, dimension); + + const uint8_t *pVect1 = static_cast(pVect1v); + const uint8_t *params_bytes = pVect1 + dimension; + const float x_sum_sq = + load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + + const float16 *pVect2 = static_cast(pVect2v); + const auto *query_meta_bytes = reinterpret_cast(pVect2 + dimension); + const float y_sum_sq = + load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); + + return x_sum_sq + y_sum_sq - 2.0f * ip; +} diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp index ba3dd7cab..eaf383443 100644 --- a/src/VecSim/spaces/L2_space.cpp +++ b/src/VecSim/spaces/L2_space.cpp @@ -104,17 +104,30 @@ dist_func_t L2_SQ8_FP32_GetDistFunc(size_t dim, unsigned char *alignment, } // SQ8-FP16: asymmetric L2 distance between SQ8 storage and FP16 query. -// SIMD chooser slots are added by P1b (MOD-15152) / P1c (MOD-15153); for now this always -// returns the scalar implementation. dist_func_t L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, const void *arch_opt) { unsigned char dummy_alignment; if (!alignment) { alignment = &dummy_alignment; } - (void)dim; - (void)arch_opt; - return SQ8_FP16_L2Sqr; + + dist_func_t ret_dist_func = SQ8_FP16_L2Sqr; + [[maybe_unused]] auto features = getCpuOptimizationFeatures(arch_opt); + +#ifdef CPU_FEATURES_ARCH_X86_64 + if (dim < 16) { + return ret_dist_func; + } + // Alignment hints below refer to the SQ8 (first) operand per the GetDistFunc contract. +#ifdef OPT_AVX512_F_BW_VL_VNNI + if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) { + if (dim % 16 == 0) + *alignment = 16 * sizeof(uint8_t); + return Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(dim); + } +#endif +#endif // x86_64 + return ret_dist_func; } dist_func_t L2_FP32_GetDistFunc(size_t dim, unsigned char *alignment, const void *arch_opt) { diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp index 3b8813b89..e5e8bb1c2 100644 --- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp +++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp @@ -17,6 +17,9 @@ #include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP32.h" #include "VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP32.h" +#include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h" +#include "VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h" + #include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_SQ8.h" #include "VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_SQ8.h" @@ -75,6 +78,24 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t d CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP32_L2SqrSIMD16_AVX512F_BW_VL_VNNI); return ret_dist_func; } + +// SQ8-to-FP16 distance functions +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX512F_BW_VL_VNNI); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX512F_BW_VL_VNNI); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX512F_BW_VL_VNNI); + return ret_dist_func; +} + // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum) dist_func_t Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim) { dist_func_t ret_dist_func; diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h index fe1583491..b68bfd0a4 100644 --- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h +++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h @@ -24,6 +24,10 @@ dist_func_t Choose_SQ8_FP32_IP_implementation_AVX512F_BW_VL_VNNI(size_t d dist_func_t Choose_SQ8_FP32_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim); dist_func_t Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t dim); +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(size_t dim); + // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum) dist_func_t Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim); dist_func_t Choose_SQ8_SQ8_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim); diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index dfbe81f5d..117886dba 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -3094,7 +3094,18 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { dist_func_t arch_opt_func; float baseline = SQ8_FP16_L2Sqr(v2_compressed.data(), v1_query.data(), dim); - // Per-tier assertion blocks are added by Tasks 3–6. +#ifdef OPT_AVX512_F_BW_VL_VNNI + if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl && + optimization.avx512vnni) { + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX512 with dim " << dim; + optimization.avx512f = 0; + } +#endif // Scalar fallback. unsigned char alignment = 0; @@ -3123,7 +3134,18 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { dist_func_t arch_opt_func; float baseline = SQ8_FP16_InnerProduct(v2_compressed.data(), v1_query.data(), dim); - // Per-tier assertion blocks are added by Tasks 3–6. +#ifdef OPT_AVX512_F_BW_VL_VNNI + if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl && + optimization.avx512vnni) { + unsigned char alignment = 0; + arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX512 with dim " << dim; + optimization.avx512f = 0; + } +#endif unsigned char alignment = 0; arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); @@ -3151,7 +3173,18 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { dist_func_t arch_opt_func; float baseline = SQ8_FP16_Cosine(v2_compressed.data(), v1_query.data(), dim); - // Per-tier assertion blocks are added by Tasks 3–6. +#ifdef OPT_AVX512_F_BW_VL_VNNI + if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl && + optimization.avx512vnni) { + unsigned char alignment = 0; + arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX512 with dim " << dim; + optimization.avx512f = 0; + } +#endif unsigned char alignment = 0; arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); From c2f8340efbaacb4e3846a6e8c176e039195fe41b Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Tue, 26 May 2026 11:33:48 +0300 Subject: [PATCH 05/24] =?UTF-8?q?Add=20AVX2+FMA=20SQ8=E2=86=94FP16=20SIMD?= =?UTF-8?q?=20distance=20kernels=20[MOD-14954]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 8-wide AVX2+FMA kernels widen 8 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 8 FP16 lanes via _mm256_cvtph_ps, then fmadd into a 256-bit FP32 accumulator. Residual (< 8) lanes load the full 16-byte FP16 block, convert, then blend zero across unused lanes — mirroring the existing F16C FP16 kernel pattern. Dispatcher branch in {IP,Cosine,L2}_SQ8_FP16_GetDistFunc selects the new Choose_SQ8_FP16_*_implementation_AVX2_FMA when features.avx2 && features.fma3 && features.f16c. Co-Authored-By: Claude Opus 4.7 (1M context) --- src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h | 95 +++++++++++++++++++++ src/VecSim/spaces/IP_space.cpp | 18 ++++ src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h | 35 ++++++++ src/VecSim/spaces/L2_space.cpp | 9 ++ src/VecSim/spaces/functions/AVX2_FMA.cpp | 23 +++++ src/VecSim/spaces/functions/AVX2_FMA.h | 6 ++ tests/unit/test_spaces.cpp | 39 +++++++++ 7 files changed, 225 insertions(+) create mode 100644 src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h create mode 100644 src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h diff --git a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h new file mode 100644 index 000000000..1d6b4e676 --- /dev/null +++ b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h @@ -0,0 +1,95 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/AVX_utils.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include "VecSim/utils/alignment.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +// 8-wide AVX2+FMA step: 8 SQ8 lanes + 8 FP16 lanes -> 8 FP32 fused-multiply-add. +static inline void SQ8_FP16_InnerProductStep_AVX2_FMA(const uint8_t *&pVect1, + const float16 *&pVect2, + __m256 &sum256) { + __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast(pVect1)); + pVect1 += 8; + __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128); + __m256 v1_f = _mm256_cvtepi32_ps(v1_256); + + __m128i v2_128 = _mm_loadu_si128(reinterpret_cast(pVect2)); + __m256 v2_f = _mm256_cvtph_ps(v2_128); + pVect2 += 8; + + sum256 = _mm256_fmadd_ps(v1_f, v2_f, sum256); +} + +template // 0..15 +float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v, size_t dimension) { + const uint8_t *pVec1 = static_cast(pVec1v); + const float16 *pVec2 = static_cast(pVec2v); + const uint8_t *pEnd1 = pVec1 + dimension; + + __m256 sum256 = _mm256_setzero_ps(); + + if constexpr (residual % 8) { + constexpr int mask = (1 << (residual % 8)) - 1; + + // SQ8 side: load 8 bytes regardless of residual; unused lanes are zeroed by the blend on + // the FP32 query. + __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast(pVec1)); + pVec1 += residual % 8; + __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128); + __m256 v1_f = _mm256_cvtepi32_ps(v1_256); + + // FP16 side: load full 16-byte block (safe — dim >= 16 and metadata follows). + __m128i v2_128 = _mm_loadu_si128(reinterpret_cast(pVec2)); + __m256 v2_f = _mm256_cvtph_ps(v2_128); + v2_f = _mm256_blend_ps(_mm256_setzero_ps(), v2_f, mask); + pVec2 += residual % 8; + + sum256 = _mm256_mul_ps(v1_f, v2_f); + } + + if constexpr (residual >= 8) { + SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum256); + } + + do { + SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum256); + SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum256); + } while (pVec1 < pEnd1); + + float quantized_dot = my_mm256_reduce_add_ps(sum256); + + const uint8_t *pVec1Base = static_cast(pVec1v); + const uint8_t *params_bytes = pVec1Base + dimension; + const float min_val = load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); + const float delta = load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); + + const float16 *pVec2Base = static_cast(pVec2v); + const auto *query_meta_bytes = reinterpret_cast(pVec2Base + dimension); + const float y_sum = + load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + + return min_val * y_sum + delta * quantized_dot; +} + +template // 0..15 +float SQ8_FP16_InnerProductSIMD16_AVX2_FMA(const void *pVec1v, const void *pVec2v, + size_t dimension) { + return 1.0f - SQ8_FP16_InnerProductImp_AVX2_FMA(pVec1v, pVec2v, dimension); +} + +template // 0..15 +float SQ8_FP16_CosineSIMD16_AVX2_FMA(const void *pVec1v, const void *pVec2v, size_t dimension) { + return SQ8_FP16_InnerProductSIMD16_AVX2_FMA(pVec1v, pVec2v, dimension); +} diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp index a99241180..1af5d2c35 100644 --- a/src/VecSim/spaces/IP_space.cpp +++ b/src/VecSim/spaces/IP_space.cpp @@ -194,6 +194,15 @@ dist_func_t IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, return Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim); } #endif +#ifdef OPT_AVX2_FMA +#ifdef OPT_F16C + if (features.avx2 && features.fma3 && features.f16c) { + if (dim % 8 == 0) // SQ8 chunk = 8 bytes + *alignment = 8 * sizeof(uint8_t); + return Choose_SQ8_FP16_IP_implementation_AVX2_FMA(dim); + } +#endif +#endif #endif // x86_64 return ret_dist_func; } @@ -220,6 +229,15 @@ dist_func_t Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignm return Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(dim); } #endif +#ifdef OPT_AVX2_FMA +#ifdef OPT_F16C + if (features.avx2 && features.fma3 && features.f16c) { + if (dim % 8 == 0) + *alignment = 8 * sizeof(uint8_t); + return Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(dim); + } +#endif +#endif #endif // x86_64 return ret_dist_func; } diff --git a/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h new file mode 100644 index 000000000..5f9ad0db6 --- /dev/null +++ b/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h @@ -0,0 +1,35 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/AVX_utils.h" +#include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include "VecSim/utils/alignment.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +template // 0..15 +float SQ8_FP16_L2SqrSIMD16_AVX2_FMA(const void *pVect1v, const void *pVect2v, size_t dimension) { + const float ip = SQ8_FP16_InnerProductImp_AVX2_FMA(pVect1v, pVect2v, dimension); + + const uint8_t *pVect1 = static_cast(pVect1v); + const uint8_t *params_bytes = pVect1 + dimension; + const float x_sum_sq = + load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + + const float16 *pVect2 = static_cast(pVect2v); + const auto *query_meta_bytes = reinterpret_cast(pVect2 + dimension); + const float y_sum_sq = + load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); + + return x_sum_sq + y_sum_sq - 2.0f * ip; +} diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp index eaf383443..79f6c21ae 100644 --- a/src/VecSim/spaces/L2_space.cpp +++ b/src/VecSim/spaces/L2_space.cpp @@ -126,6 +126,15 @@ dist_func_t L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, return Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(dim); } #endif +#ifdef OPT_AVX2_FMA +#ifdef OPT_F16C + if (features.avx2 && features.fma3 && features.f16c) { + if (dim % 8 == 0) + *alignment = 8 * sizeof(uint8_t); + return Choose_SQ8_FP16_L2_implementation_AVX2_FMA(dim); + } +#endif +#endif #endif // x86_64 return ret_dist_func; } diff --git a/src/VecSim/spaces/functions/AVX2_FMA.cpp b/src/VecSim/spaces/functions/AVX2_FMA.cpp index c859128b2..5745a4ddf 100644 --- a/src/VecSim/spaces/functions/AVX2_FMA.cpp +++ b/src/VecSim/spaces/functions/AVX2_FMA.cpp @@ -10,6 +10,11 @@ #include "VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP32.h" #include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP32.h" +#ifdef OPT_F16C +#include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h" +#include "VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h" +#endif + namespace spaces { #include "implementation_chooser.h" @@ -31,6 +36,24 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_AVX2_FMA(size_t dim) { return ret_dist_func; } +#ifdef OPT_F16C +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX2_FMA(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX2_FMA); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX2_FMA); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX2_FMA(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX2_FMA); + return ret_dist_func; +} +#endif + #include "implementation_chooser_cleanup.h" } // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX2_FMA.h b/src/VecSim/spaces/functions/AVX2_FMA.h index b20b1a588..413f55081 100644 --- a/src/VecSim/spaces/functions/AVX2_FMA.h +++ b/src/VecSim/spaces/functions/AVX2_FMA.h @@ -16,4 +16,10 @@ dist_func_t Choose_SQ8_FP32_IP_implementation_AVX2_FMA(size_t dim); dist_func_t Choose_SQ8_FP32_Cosine_implementation_AVX2_FMA(size_t dim); dist_func_t Choose_SQ8_FP32_L2_implementation_AVX2_FMA(size_t dim); +#ifdef OPT_F16C +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX2_FMA(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX2_FMA(size_t dim); +#endif + } // namespace spaces diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index 117886dba..32f5e6991 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -3105,6 +3105,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { << "AVX512 with dim " << dim; optimization.avx512f = 0; } +#endif +#ifdef OPT_AVX2_FMA +#ifdef OPT_F16C + if (optimization.avx2 && optimization.fma3 && optimization.f16c) { + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_AVX2_FMA(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX2+FMA with dim " << dim; + optimization.fma3 = 0; + } +#endif #endif // Scalar fallback. @@ -3145,6 +3158,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { << "AVX512 with dim " << dim; optimization.avx512f = 0; } +#endif +#ifdef OPT_AVX2_FMA +#ifdef OPT_F16C + if (optimization.avx2 && optimization.fma3 && optimization.f16c) { + unsigned char alignment = 0; + arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_AVX2_FMA(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX2+FMA with dim " << dim; + optimization.fma3 = 0; + } +#endif #endif unsigned char alignment = 0; @@ -3184,6 +3210,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { << "AVX512 with dim " << dim; optimization.avx512f = 0; } +#endif +#ifdef OPT_AVX2_FMA +#ifdef OPT_F16C + if (optimization.avx2 && optimization.fma3 && optimization.f16c) { + unsigned char alignment = 0; + arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX2+FMA with dim " << dim; + optimization.fma3 = 0; + } +#endif #endif unsigned char alignment = 0; From 415c2ed64397656126c04e8811a0d5deef0acf11 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Tue, 26 May 2026 11:37:19 +0300 Subject: [PATCH 06/24] =?UTF-8?q?Add=20AVX2=20(no=20FMA)=20SQ8=E2=86=94FP1?= =?UTF-8?q?6=20SIMD=20distance=20kernels=20[MOD-14954]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Mirrors the AVX2+FMA kernels but uses _mm256_mul_ps + _mm256_add_ps instead of _mm256_fmadd_ps so it can run on Haswell-era AVX2 hardware without FMA support (uncommon but matches the existing SQ8_FP32 tiering). Dispatcher gate requires features.avx2 && features.f16c and runs between the AVX2+FMA and SSE4 tiers. Co-Authored-By: Claude Opus 4.7 (1M context) --- src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h | 91 +++++++++++++++++++++++++ src/VecSim/spaces/IP_space.cpp | 18 +++++ src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h | 35 ++++++++++ src/VecSim/spaces/L2_space.cpp | 9 +++ src/VecSim/spaces/functions/AVX2.cpp | 23 +++++++ src/VecSim/spaces/functions/AVX2.h | 6 ++ tests/unit/test_spaces.cpp | 39 +++++++++++ 7 files changed, 221 insertions(+) create mode 100644 src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h create mode 100644 src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h diff --git a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h new file mode 100644 index 000000000..e68e5fa11 --- /dev/null +++ b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h @@ -0,0 +1,91 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/AVX_utils.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include "VecSim/utils/alignment.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +// 8-wide AVX2 step (no FMA): 8 SQ8 lanes + 8 FP16 lanes -> mul + add into sum. +static inline void SQ8_FP16_InnerProductStep_AVX2(const uint8_t *&pVect1, + const float16 *&pVect2, + __m256 &sum256) { + __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast(pVect1)); + pVect1 += 8; + __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128); + __m256 v1_f = _mm256_cvtepi32_ps(v1_256); + + __m128i v2_128 = _mm_loadu_si128(reinterpret_cast(pVect2)); + __m256 v2_f = _mm256_cvtph_ps(v2_128); + pVect2 += 8; + + sum256 = _mm256_add_ps(sum256, _mm256_mul_ps(v1_f, v2_f)); +} + +template // 0..15 +float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size_t dimension) { + const uint8_t *pVec1 = static_cast(pVec1v); + const float16 *pVec2 = static_cast(pVec2v); + const uint8_t *pEnd1 = pVec1 + dimension; + + __m256 sum256 = _mm256_setzero_ps(); + + if constexpr (residual % 8) { + constexpr int mask = (1 << (residual % 8)) - 1; + + __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast(pVec1)); + pVec1 += residual % 8; + __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128); + __m256 v1_f = _mm256_cvtepi32_ps(v1_256); + + __m128i v2_128 = _mm_loadu_si128(reinterpret_cast(pVec2)); + __m256 v2_f = _mm256_cvtph_ps(v2_128); + v2_f = _mm256_blend_ps(_mm256_setzero_ps(), v2_f, mask); + pVec2 += residual % 8; + + sum256 = _mm256_mul_ps(v1_f, v2_f); + } + + if constexpr (residual >= 8) { + SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum256); + } + + do { + SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum256); + SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum256); + } while (pVec1 < pEnd1); + + float quantized_dot = my_mm256_reduce_add_ps(sum256); + + const uint8_t *pVec1Base = static_cast(pVec1v); + const uint8_t *params_bytes = pVec1Base + dimension; + const float min_val = load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); + const float delta = load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); + + const float16 *pVec2Base = static_cast(pVec2v); + const auto *query_meta_bytes = reinterpret_cast(pVec2Base + dimension); + const float y_sum = + load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + + return min_val * y_sum + delta * quantized_dot; +} + +template // 0..15 +float SQ8_FP16_InnerProductSIMD16_AVX2(const void *pVec1v, const void *pVec2v, size_t dimension) { + return 1.0f - SQ8_FP16_InnerProductImp_AVX2(pVec1v, pVec2v, dimension); +} + +template // 0..15 +float SQ8_FP16_CosineSIMD16_AVX2(const void *pVec1v, const void *pVec2v, size_t dimension) { + return SQ8_FP16_InnerProductSIMD16_AVX2(pVec1v, pVec2v, dimension); +} diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp index 1af5d2c35..68308d0b0 100644 --- a/src/VecSim/spaces/IP_space.cpp +++ b/src/VecSim/spaces/IP_space.cpp @@ -203,6 +203,15 @@ dist_func_t IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, } #endif #endif +#ifdef OPT_AVX2 +#ifdef OPT_F16C + if (features.avx2 && features.f16c) { + if (dim % 8 == 0) + *alignment = 8 * sizeof(uint8_t); + return Choose_SQ8_FP16_IP_implementation_AVX2(dim); + } +#endif +#endif #endif // x86_64 return ret_dist_func; } @@ -238,6 +247,15 @@ dist_func_t Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignm } #endif #endif +#ifdef OPT_AVX2 +#ifdef OPT_F16C + if (features.avx2 && features.f16c) { + if (dim % 8 == 0) + *alignment = 8 * sizeof(uint8_t); + return Choose_SQ8_FP16_Cosine_implementation_AVX2(dim); + } +#endif +#endif #endif // x86_64 return ret_dist_func; } diff --git a/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h new file mode 100644 index 000000000..86ec4b66e --- /dev/null +++ b/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h @@ -0,0 +1,35 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/AVX_utils.h" +#include "VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include "VecSim/utils/alignment.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +template // 0..15 +float SQ8_FP16_L2SqrSIMD16_AVX2(const void *pVect1v, const void *pVect2v, size_t dimension) { + const float ip = SQ8_FP16_InnerProductImp_AVX2(pVect1v, pVect2v, dimension); + + const uint8_t *pVect1 = static_cast(pVect1v); + const uint8_t *params_bytes = pVect1 + dimension; + const float x_sum_sq = + load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + + const float16 *pVect2 = static_cast(pVect2v); + const auto *query_meta_bytes = reinterpret_cast(pVect2 + dimension); + const float y_sum_sq = + load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); + + return x_sum_sq + y_sum_sq - 2.0f * ip; +} diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp index 79f6c21ae..2b6a31166 100644 --- a/src/VecSim/spaces/L2_space.cpp +++ b/src/VecSim/spaces/L2_space.cpp @@ -135,6 +135,15 @@ dist_func_t L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, } #endif #endif +#ifdef OPT_AVX2 +#ifdef OPT_F16C + if (features.avx2 && features.f16c) { + if (dim % 8 == 0) + *alignment = 8 * sizeof(uint8_t); + return Choose_SQ8_FP16_L2_implementation_AVX2(dim); + } +#endif +#endif #endif // x86_64 return ret_dist_func; } diff --git a/src/VecSim/spaces/functions/AVX2.cpp b/src/VecSim/spaces/functions/AVX2.cpp index 322ed0aec..7e229b003 100644 --- a/src/VecSim/spaces/functions/AVX2.cpp +++ b/src/VecSim/spaces/functions/AVX2.cpp @@ -13,6 +13,11 @@ #include "VecSim/spaces/IP/IP_AVX2_SQ8_FP32.h" #include "VecSim/spaces/L2/L2_AVX2_SQ8_FP32.h" +#ifdef OPT_F16C +#include "VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h" +#include "VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h" +#endif + namespace spaces { #include "implementation_chooser.h" @@ -47,6 +52,24 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_AVX2(size_t dim) { return ret_dist_func; } +#ifdef OPT_F16C +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX2(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX2); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX2(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX2); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX2(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX2); + return ret_dist_func; +} +#endif + #include "implementation_chooser_cleanup.h" } // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX2.h b/src/VecSim/spaces/functions/AVX2.h index 081c42a4e..45fa2c951 100644 --- a/src/VecSim/spaces/functions/AVX2.h +++ b/src/VecSim/spaces/functions/AVX2.h @@ -19,4 +19,10 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_AVX2(size_t dim); dist_func_t Choose_BF16_IP_implementation_AVX2(size_t dim); dist_func_t Choose_BF16_L2_implementation_AVX2(size_t dim); +#ifdef OPT_F16C +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX2(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX2(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX2(size_t dim); +#endif + } // namespace spaces diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index 32f5e6991..968294eac 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -3118,6 +3118,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { optimization.fma3 = 0; } #endif +#endif +#ifdef OPT_AVX2 +#ifdef OPT_F16C + if (optimization.avx2 && optimization.f16c) { + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_AVX2(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX2 with dim " << dim; + optimization.avx2 = 0; + } +#endif #endif // Scalar fallback. @@ -3171,6 +3184,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { optimization.fma3 = 0; } #endif +#endif +#ifdef OPT_AVX2 +#ifdef OPT_F16C + if (optimization.avx2 && optimization.f16c) { + unsigned char alignment = 0; + arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_AVX2(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX2 with dim " << dim; + optimization.avx2 = 0; + } +#endif #endif unsigned char alignment = 0; @@ -3223,6 +3249,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { optimization.fma3 = 0; } #endif +#endif +#ifdef OPT_AVX2 +#ifdef OPT_F16C + if (optimization.avx2 && optimization.f16c) { + unsigned char alignment = 0; + arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_AVX2(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX2 with dim " << dim; + optimization.avx2 = 0; + } +#endif #endif unsigned char alignment = 0; From 25c5a96d6cfe1deac8cd3275633f247bb0c39e52 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Tue, 26 May 2026 11:41:10 +0300 Subject: [PATCH 07/24] =?UTF-8?q?Add=20SSE4+F16C=20SQ8=E2=86=94FP16=20SIMD?= =?UTF-8?q?=20distance=20kernels=20[MOD-14954]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 4-wide SSE4 kernels widen 4 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 4 FP16 lanes via _mm_cvtph_ps (F16C), then mul+add into a 128-bit FP32 accumulator (SSE4 has no FMA). Residual % 4 lanes are materialised via _mm_set_ps + the scalar FP16_to_FP32 helper, mirroring the existing SSE4 SQ8_FP32 residual pattern. Dispatcher gate requires features.sse4_1 && features.f16c && features.avx since F16C is VEX-encoded — matches the existing F16C/FP16 dispatcher gate. Co-Authored-By: Claude Opus 4.7 (1M context) --- src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h | 109 ++++++++++++++++++++++++ src/VecSim/spaces/IP_space.cpp | 19 +++++ src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h | 35 ++++++++ src/VecSim/spaces/L2_space.cpp | 9 ++ src/VecSim/spaces/functions/SSE4.cpp | 23 +++++ src/VecSim/spaces/functions/SSE4.h | 6 ++ tests/unit/test_spaces.cpp | 39 +++++++++ 7 files changed, 240 insertions(+) create mode 100644 src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h create mode 100644 src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h diff --git a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h new file mode 100644 index 000000000..8fd0e56c1 --- /dev/null +++ b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h @@ -0,0 +1,109 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include "VecSim/utils/alignment.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +// 4-wide SSE4+F16C step: 4 SQ8 lanes + 4 FP16 lanes -> mul + add into sum. +static inline void SQ8_FP16_InnerProductStep_SSE4(const uint8_t *&pVect1, + const float16 *&pVect2, + __m128 &sum) { + __m128i v1_i = + _mm_cvtepu8_epi32(_mm_cvtsi32_si128(*reinterpret_cast(pVect1))); + pVect1 += 4; + __m128 v1_f = _mm_cvtepi32_ps(v1_i); + + __m128i v2_8 = _mm_loadl_epi64(reinterpret_cast(pVect2)); + __m128 v2_f = _mm_cvtph_ps(v2_8); + pVect2 += 4; + + sum = _mm_add_ps(sum, _mm_mul_ps(v1_f, v2_f)); +} + +template // 0..15 +float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2v, + size_t dimension) { + const uint8_t *pVec1 = static_cast(pVec1v); + const float16 *pVec2 = static_cast(pVec2v); + const uint8_t *pEnd1 = pVec1 + dimension; + + __m128 sum = _mm_setzero_ps(); + + if constexpr (residual % 4) { + __m128 v1_f; + __m128 v2_f; + + if constexpr (residual % 4 == 3) { + v1_f = _mm_set_ps(0.0f, static_cast(pVec1[2]), + static_cast(pVec1[1]), + static_cast(pVec1[0])); + v2_f = _mm_set_ps(0.0f, vecsim_types::FP16_to_FP32(pVec2[2]), + vecsim_types::FP16_to_FP32(pVec2[1]), + vecsim_types::FP16_to_FP32(pVec2[0])); + } else if constexpr (residual % 4 == 2) { + v1_f = _mm_set_ps(0.0f, 0.0f, static_cast(pVec1[1]), + static_cast(pVec1[0])); + v2_f = _mm_set_ps(0.0f, 0.0f, vecsim_types::FP16_to_FP32(pVec2[1]), + vecsim_types::FP16_to_FP32(pVec2[0])); + } else if constexpr (residual % 4 == 1) { + v1_f = _mm_set_ps(0.0f, 0.0f, 0.0f, static_cast(pVec1[0])); + v2_f = _mm_set_ps(0.0f, 0.0f, 0.0f, vecsim_types::FP16_to_FP32(pVec2[0])); + } + + pVec1 += residual % 4; + pVec2 += residual % 4; + + sum = _mm_mul_ps(v1_f, v2_f); + } + + if constexpr (residual >= 4) { + SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum); + } + if constexpr (residual >= 8) { + SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum); + } + if constexpr (residual >= 12) { + SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum); + } + + do { + SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum); + } while (pVec1 < pEnd1); + + float PORTABLE_ALIGN16 TmpRes[4]; + _mm_store_ps(TmpRes, sum); + float quantized_dot = TmpRes[0] + TmpRes[1] + TmpRes[2] + TmpRes[3]; + + const uint8_t *pVec1Base = static_cast(pVec1v); + const uint8_t *params_bytes = pVec1Base + dimension; + const float min_val = load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); + const float delta = load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); + + const float16 *pVec2Base = static_cast(pVec2v); + const auto *query_meta_bytes = reinterpret_cast(pVec2Base + dimension); + const float y_sum = + load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + + return min_val * y_sum + delta * quantized_dot; +} + +template // 0..15 +float SQ8_FP16_InnerProductSIMD16_SSE4(const void *pVec1v, const void *pVec2v, size_t dimension) { + return 1.0f - SQ8_FP16_InnerProductSIMD16_SSE4_IMP(pVec1v, pVec2v, dimension); +} + +template // 0..15 +float SQ8_FP16_CosineSIMD16_SSE4(const void *pVec1v, const void *pVec2v, size_t dimension) { + return SQ8_FP16_InnerProductSIMD16_SSE4(pVec1v, pVec2v, dimension); +} diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp index 68308d0b0..37ffc9ed4 100644 --- a/src/VecSim/spaces/IP_space.cpp +++ b/src/VecSim/spaces/IP_space.cpp @@ -212,6 +212,16 @@ dist_func_t IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, } #endif #endif +#ifdef OPT_SSE4 +#ifdef OPT_F16C + // F16C is VEX-encoded — require AVX as well, matching the existing F16C/FP16 dispatcher. + if (features.sse4_1 && features.f16c && features.avx) { + if (dim % 4 == 0) + *alignment = 4 * sizeof(uint8_t); + return Choose_SQ8_FP16_IP_implementation_SSE4(dim); + } +#endif +#endif #endif // x86_64 return ret_dist_func; } @@ -256,6 +266,15 @@ dist_func_t Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignm } #endif #endif +#ifdef OPT_SSE4 +#ifdef OPT_F16C + if (features.sse4_1 && features.f16c && features.avx) { + if (dim % 4 == 0) + *alignment = 4 * sizeof(uint8_t); + return Choose_SQ8_FP16_Cosine_implementation_SSE4(dim); + } +#endif +#endif #endif // x86_64 return ret_dist_func; } diff --git a/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h new file mode 100644 index 000000000..b43492858 --- /dev/null +++ b/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h @@ -0,0 +1,35 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include "VecSim/utils/alignment.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +template // 0..15 +float SQ8_FP16_L2SqrSIMD16_SSE4(const void *pVect1v, const void *pVect2v, size_t dimension) { + const float ip = + SQ8_FP16_InnerProductSIMD16_SSE4_IMP(pVect1v, pVect2v, dimension); + + const uint8_t *pVect1 = static_cast(pVect1v); + const uint8_t *params_bytes = pVect1 + dimension; + const float x_sum_sq = + load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + + const float16 *pVect2 = static_cast(pVect2v); + const auto *query_meta_bytes = reinterpret_cast(pVect2 + dimension); + const float y_sum_sq = + load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); + + return x_sum_sq + y_sum_sq - 2.0f * ip; +} diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp index 2b6a31166..ab5188800 100644 --- a/src/VecSim/spaces/L2_space.cpp +++ b/src/VecSim/spaces/L2_space.cpp @@ -144,6 +144,15 @@ dist_func_t L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, } #endif #endif +#ifdef OPT_SSE4 +#ifdef OPT_F16C + if (features.sse4_1 && features.f16c && features.avx) { + if (dim % 4 == 0) + *alignment = 4 * sizeof(uint8_t); + return Choose_SQ8_FP16_L2_implementation_SSE4(dim); + } +#endif +#endif #endif // x86_64 return ret_dist_func; } diff --git a/src/VecSim/spaces/functions/SSE4.cpp b/src/VecSim/spaces/functions/SSE4.cpp index 5f5bbc1ba..e41762955 100644 --- a/src/VecSim/spaces/functions/SSE4.cpp +++ b/src/VecSim/spaces/functions/SSE4.cpp @@ -10,6 +10,11 @@ #include "VecSim/spaces/IP/IP_SSE4_SQ8_FP32.h" #include "VecSim/spaces/L2/L2_SSE4_SQ8_FP32.h" +#ifdef OPT_F16C +#include "VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h" +#include "VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h" +#endif + namespace spaces { #include "implementation_chooser.h" @@ -32,6 +37,24 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_SSE4(size_t dim) { return ret_dist_func; } +#ifdef OPT_F16C +dist_func_t Choose_SQ8_FP16_IP_implementation_SSE4(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_SSE4); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_Cosine_implementation_SSE4(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_SSE4); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_L2_implementation_SSE4(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_SSE4); + return ret_dist_func; +} +#endif + #include "implementation_chooser_cleanup.h" } // namespace spaces diff --git a/src/VecSim/spaces/functions/SSE4.h b/src/VecSim/spaces/functions/SSE4.h index e47948137..c33187983 100644 --- a/src/VecSim/spaces/functions/SSE4.h +++ b/src/VecSim/spaces/functions/SSE4.h @@ -16,4 +16,10 @@ dist_func_t Choose_SQ8_FP32_IP_implementation_SSE4(size_t dim); dist_func_t Choose_SQ8_FP32_Cosine_implementation_SSE4(size_t dim); dist_func_t Choose_SQ8_FP32_L2_implementation_SSE4(size_t dim); +#ifdef OPT_F16C +dist_func_t Choose_SQ8_FP16_IP_implementation_SSE4(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_SSE4(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_SSE4(size_t dim); +#endif + } // namespace spaces diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index 968294eac..61d3ce4af 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -3131,6 +3131,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { optimization.avx2 = 0; } #endif +#endif +#ifdef OPT_SSE4 +#ifdef OPT_F16C + if (optimization.sse4_1 && optimization.f16c && optimization.avx) { + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_SSE4(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "SSE4 with dim " << dim; + optimization.sse4_1 = 0; + } +#endif #endif // Scalar fallback. @@ -3197,6 +3210,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { optimization.avx2 = 0; } #endif +#endif +#ifdef OPT_SSE4 +#ifdef OPT_F16C + if (optimization.sse4_1 && optimization.f16c && optimization.avx) { + unsigned char alignment = 0; + arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_SSE4(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "SSE4 with dim " << dim; + optimization.sse4_1 = 0; + } +#endif #endif unsigned char alignment = 0; @@ -3262,6 +3288,19 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { optimization.avx2 = 0; } #endif +#endif +#ifdef OPT_SSE4 +#ifdef OPT_F16C + if (optimization.sse4_1 && optimization.f16c && optimization.avx) { + unsigned char alignment = 0; + arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_SSE4(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "SSE4 with dim " << dim; + optimization.sse4_1 = 0; + } +#endif #endif unsigned char alignment = 0; From 4b7f3eb537c2c2c9e18aca79230d660b83a27639 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Tue, 26 May 2026 11:42:52 +0300 Subject: [PATCH 08/24] Update SQ8_FP16 dispatcher assertions to walk SIMD tiers [MOD-14954] The SQ8_FP16 GetDistFunc dispatcher now returns AVX-512 / AVX2+FMA / AVX2 / SSE4 SIMD kernels when the corresponding feature flags are set (only scalar previously). Updates the GetDistFunc_*_SQ8_FP16 asserts to compute the expected function for the host's highest supported tier. Co-Authored-By: Claude Opus 4.7 (1M context) --- tests/unit/test_spaces.cpp | 54 +++++++++++++++++++++++++++++++++----- 1 file changed, 48 insertions(+), 6 deletions(-) diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index 61d3ce4af..53c3a011b 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -560,19 +560,61 @@ TEST_F(SpacesTest, GetDistFuncSQ8Asymmetric) { } TEST_F(SpacesTest, GetDistFuncSQ8FP16Asymmetric) { - // SQ8 storage with FP16 query (asymmetric) - should return scalar SQ8_FP16 functions. - // SIMD chooser slots are added by P1b (MOD-15152) / P1c (MOD-15153); for now the - // dispatcher returns the scalar implementations regardless of dim or arch. + // SQ8 storage with FP16 query (asymmetric). The dispatcher now returns the highest SIMD + // tier available at runtime; assert that and fall back to scalar only if no tier matches. size_t dim = 128; auto l2_func = spaces::GetDistFunc(VecSimMetric_L2, dim, nullptr); auto ip_func = spaces::GetDistFunc(VecSimMetric_IP, dim, nullptr); auto cosine_func = spaces::GetDistFunc(VecSimMetric_Cosine, dim, nullptr); + + auto optimization = getCpuOptimizationFeatures(); + dist_func_t expected_l2 = SQ8_FP16_L2Sqr; + dist_func_t expected_ip = SQ8_FP16_InnerProduct; + dist_func_t expected_cos = SQ8_FP16_Cosine; + +#ifdef CPU_FEATURES_ARCH_X86_64 + if (dim >= 16) { +#ifdef OPT_AVX512_F_BW_VL_VNNI + if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl && + optimization.avx512vnni) { + expected_l2 = Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(dim); + expected_ip = Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim); + expected_cos = Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(dim); + } else +#endif +#if defined(OPT_AVX2_FMA) && defined(OPT_F16C) + if (optimization.avx2 && optimization.fma3 && optimization.f16c) { + expected_l2 = Choose_SQ8_FP16_L2_implementation_AVX2_FMA(dim); + expected_ip = Choose_SQ8_FP16_IP_implementation_AVX2_FMA(dim); + expected_cos = Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(dim); + } else +#endif +#if defined(OPT_AVX2) && defined(OPT_F16C) + if (optimization.avx2 && optimization.f16c) { + expected_l2 = Choose_SQ8_FP16_L2_implementation_AVX2(dim); + expected_ip = Choose_SQ8_FP16_IP_implementation_AVX2(dim); + expected_cos = Choose_SQ8_FP16_Cosine_implementation_AVX2(dim); + } else +#endif +#if defined(OPT_SSE4) && defined(OPT_F16C) + if (optimization.sse4_1 && optimization.f16c && optimization.avx) { + expected_l2 = Choose_SQ8_FP16_L2_implementation_SSE4(dim); + expected_ip = Choose_SQ8_FP16_IP_implementation_SSE4(dim); + expected_cos = Choose_SQ8_FP16_Cosine_implementation_SSE4(dim); + } else +#endif + { + // Falls through to scalar. + } + } +#endif // x86_64 + ASSERT_EQ(l2_func, L2_SQ8_FP16_GetDistFunc(dim, nullptr)); ASSERT_EQ(ip_func, IP_SQ8_FP16_GetDistFunc(dim, nullptr)); ASSERT_EQ(cosine_func, Cosine_SQ8_FP16_GetDistFunc(dim, nullptr)); - ASSERT_EQ(l2_func, SQ8_FP16_L2Sqr); - ASSERT_EQ(ip_func, SQ8_FP16_InnerProduct); - ASSERT_EQ(cosine_func, SQ8_FP16_Cosine); + ASSERT_EQ(l2_func, expected_l2); + ASSERT_EQ(ip_func, expected_ip); + ASSERT_EQ(cosine_func, expected_cos); } #ifdef CPU_FEATURES_ARCH_X86_64 From e21cb3b1566bc418870090b66ff823d5a4e68885 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Tue, 26 May 2026 11:44:10 +0300 Subject: [PATCH 09/24] =?UTF-8?q?Register=20per-ISA=20SQ8=E2=86=94FP16=20m?= =?UTF-8?q?icrobenchmarks=20[MOD-14954]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds AVX-512 / AVX2+FMA / AVX2 / SSE4 benchmark registrations to bm_spaces_sq8_fp16.cpp, mirroring the SQ8_FP32 layout. Gates each tier on the corresponding OPT_* defines plus the runtime feature checks that mirror the dispatcher in IP_space.cpp / L2_space.cpp. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../spaces_benchmarks/bm_spaces_sq8_fp16.cpp | 38 ++++++++++++++++++- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp index 2133a047e..75ede0eb8 100644 --- a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp +++ b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp @@ -50,8 +50,42 @@ class BM_VecSimSpaces_SQ8_FP16 : public benchmark::Fixture { } }; -// Naive (scalar) algorithms. SIMD chooser slots will be added by P1b (MOD-15152) and -// P1c (MOD-15153), following the SQ8_FP32 layout in bm_spaces_sq8_fp32.cpp. +#ifdef CPU_FEATURES_ARCH_X86_64 +cpu_features::X86Features opt = cpu_features::GetX86Info().features; + +// AVX-512 F+BW+VL+VNNI (no F16C requirement — _mm512_cvtph_ps is part of AVX512F). +#ifdef OPT_AVX512_F_BW_VL_VNNI +bool avx512_f_bw_vl_vnni_supported = opt.avx512f && opt.avx512bw && opt.avx512vl && opt.avx512vnni; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F_BW_VL_VNNI, 16, + avx512_f_bw_vl_vnni_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F_BW_VL_VNNI, 16, + avx512_f_bw_vl_vnni_supported); +#endif + +#ifdef OPT_F16C +#ifdef OPT_AVX2_FMA +bool avx2_fma3_f16c_supported = opt.avx2 && opt.fma3 && opt.f16c; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2_FMA, 16, + avx2_fma3_f16c_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2_FMA, 16, + avx2_fma3_f16c_supported); +#endif + +#ifdef OPT_AVX2 +bool avx2_f16c_supported = opt.avx2 && opt.f16c; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2, 16, avx2_f16c_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2, 16, avx2_f16c_supported); +#endif + +#ifdef OPT_SSE4 +bool sse4_f16c_supported = opt.sse4_1 && opt.f16c && opt.avx; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SSE4, 16, sse4_f16c_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SSE4, 16, sse4_f16c_supported); +#endif +#endif // OPT_F16C +#endif // x86_64 + +// Naive (scalar) baseline — always registered as the comparison anchor. INITIALIZE_NAIVE_BM(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, InnerProduct, 16); INITIALIZE_NAIVE_BM(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, Cosine, 16); From 4c8828e8d7d76b6b388288d4b10805c1488e6b9b Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Tue, 26 May 2026 14:07:31 +0300 Subject: [PATCH 10/24] =?UTF-8?q?Reformat=20SQ8=E2=86=94FP16=20SIMD=20kern?= =?UTF-8?q?els=20for=20consistent=20line=20breaks?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h | 6 ++---- src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h | 6 ++---- .../spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h | 6 ++---- src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h | 16 ++++++---------- src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h | 3 +-- src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h | 3 +-- .../spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h | 3 +-- src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h | 6 ++---- 8 files changed, 17 insertions(+), 32 deletions(-) diff --git a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h index 1d6b4e676..130fe4eb0 100644 --- a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h @@ -18,8 +18,7 @@ using float16 = vecsim_types::float16; // 8-wide AVX2+FMA step: 8 SQ8 lanes + 8 FP16 lanes -> 8 FP32 fused-multiply-add. static inline void SQ8_FP16_InnerProductStep_AVX2_FMA(const uint8_t *&pVect1, - const float16 *&pVect2, - __m256 &sum256) { + const float16 *&pVect2, __m256 &sum256) { __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast(pVect1)); pVect1 += 8; __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128); @@ -77,8 +76,7 @@ float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v, const float16 *pVec2Base = static_cast(pVec2v); const auto *query_meta_bytes = reinterpret_cast(pVec2Base + dimension); - const float y_sum = - load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + const float y_sum = load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); return min_val * y_sum + delta * quantized_dot; } diff --git a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h index e68e5fa11..1e29fe63d 100644 --- a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h @@ -17,8 +17,7 @@ using sq8 = vecsim_types::sq8; using float16 = vecsim_types::float16; // 8-wide AVX2 step (no FMA): 8 SQ8 lanes + 8 FP16 lanes -> mul + add into sum. -static inline void SQ8_FP16_InnerProductStep_AVX2(const uint8_t *&pVect1, - const float16 *&pVect2, +static inline void SQ8_FP16_InnerProductStep_AVX2(const uint8_t *&pVect1, const float16 *&pVect2, __m256 &sum256) { __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast(pVect1)); pVect1 += 8; @@ -74,8 +73,7 @@ float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size const float16 *pVec2Base = static_cast(pVec2v); const auto *query_meta_bytes = reinterpret_cast(pVec2Base + dimension); - const float y_sum = - load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + const float y_sum = load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); return min_val * y_sum + delta * quantized_dot; } diff --git a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h index 55d63d711..62532c56c 100644 --- a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h @@ -17,8 +17,7 @@ using sq8 = vecsim_types::sq8; using float16 = vecsim_types::float16; // Helper: load 16 SQ8 + 16 FP16 lanes, widen both to FP32, fused-multiply-add into sum. -static inline void SQ8_FP16_InnerProductStep_AVX512(const uint8_t *&pVec1, - const float16 *&pVec2, +static inline void SQ8_FP16_InnerProductStep_AVX512(const uint8_t *&pVec1, const float16 *&pVec2, __m512 &sum) { // 16 uint8 -> 16 fp32 __m128i v1_128 = _mm_loadu_si128(reinterpret_cast(pVec1)); @@ -81,8 +80,7 @@ float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, si // 2-byte aligned only. const float16 *pVec2Base = static_cast(pVec2v); const auto *query_meta_bytes = reinterpret_cast(pVec2Base + dimension); - const float y_sum = - load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + const float y_sum = load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); return min_val * y_sum + delta * quantized_dot; } diff --git a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h index 8fd0e56c1..43b61fd25 100644 --- a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h @@ -16,11 +16,9 @@ using sq8 = vecsim_types::sq8; using float16 = vecsim_types::float16; // 4-wide SSE4+F16C step: 4 SQ8 lanes + 4 FP16 lanes -> mul + add into sum. -static inline void SQ8_FP16_InnerProductStep_SSE4(const uint8_t *&pVect1, - const float16 *&pVect2, +static inline void SQ8_FP16_InnerProductStep_SSE4(const uint8_t *&pVect1, const float16 *&pVect2, __m128 &sum) { - __m128i v1_i = - _mm_cvtepu8_epi32(_mm_cvtsi32_si128(*reinterpret_cast(pVect1))); + __m128i v1_i = _mm_cvtepu8_epi32(_mm_cvtsi32_si128(*reinterpret_cast(pVect1))); pVect1 += 4; __m128 v1_f = _mm_cvtepi32_ps(v1_i); @@ -45,15 +43,14 @@ float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2 __m128 v2_f; if constexpr (residual % 4 == 3) { - v1_f = _mm_set_ps(0.0f, static_cast(pVec1[2]), - static_cast(pVec1[1]), + v1_f = _mm_set_ps(0.0f, static_cast(pVec1[2]), static_cast(pVec1[1]), static_cast(pVec1[0])); v2_f = _mm_set_ps(0.0f, vecsim_types::FP16_to_FP32(pVec2[2]), vecsim_types::FP16_to_FP32(pVec2[1]), vecsim_types::FP16_to_FP32(pVec2[0])); } else if constexpr (residual % 4 == 2) { - v1_f = _mm_set_ps(0.0f, 0.0f, static_cast(pVec1[1]), - static_cast(pVec1[0])); + v1_f = + _mm_set_ps(0.0f, 0.0f, static_cast(pVec1[1]), static_cast(pVec1[0])); v2_f = _mm_set_ps(0.0f, 0.0f, vecsim_types::FP16_to_FP32(pVec2[1]), vecsim_types::FP16_to_FP32(pVec2[0])); } else if constexpr (residual % 4 == 1) { @@ -92,8 +89,7 @@ float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2 const float16 *pVec2Base = static_cast(pVec2v); const auto *query_meta_bytes = reinterpret_cast(pVec2Base + dimension); - const float y_sum = - load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + const float y_sum = load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); return min_val * y_sum + delta * quantized_dot; } diff --git a/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h index 5f9ad0db6..38809e9c2 100644 --- a/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h +++ b/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h @@ -23,8 +23,7 @@ float SQ8_FP16_L2SqrSIMD16_AVX2_FMA(const void *pVect1v, const void *pVect2v, si const uint8_t *pVect1 = static_cast(pVect1v); const uint8_t *params_bytes = pVect1 + dimension; - const float x_sum_sq = - load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + const float x_sum_sq = load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); const float16 *pVect2 = static_cast(pVect2v); const auto *query_meta_bytes = reinterpret_cast(pVect2 + dimension); diff --git a/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h index 86ec4b66e..98bb29c05 100644 --- a/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h +++ b/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h @@ -23,8 +23,7 @@ float SQ8_FP16_L2SqrSIMD16_AVX2(const void *pVect1v, const void *pVect2v, size_t const uint8_t *pVect1 = static_cast(pVect1v); const uint8_t *params_bytes = pVect1 + dimension; - const float x_sum_sq = - load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + const float x_sum_sq = load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); const float16 *pVect2 = static_cast(pVect2v); const auto *query_meta_bytes = reinterpret_cast(pVect2 + dimension); diff --git a/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h index 101bf285e..635f30904 100644 --- a/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h +++ b/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h @@ -24,8 +24,7 @@ float SQ8_FP16_L2SqrSIMD16_AVX512F_BW_VL_VNNI(const void *pVect1v, const void *p const uint8_t *pVect1 = static_cast(pVect1v); const uint8_t *params_bytes = pVect1 + dimension; - const float x_sum_sq = - load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + const float x_sum_sq = load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); const float16 *pVect2 = static_cast(pVect2v); const auto *query_meta_bytes = reinterpret_cast(pVect2 + dimension); diff --git a/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h index b43492858..75bbd46f8 100644 --- a/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h +++ b/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h @@ -18,13 +18,11 @@ using float16 = vecsim_types::float16; template // 0..15 float SQ8_FP16_L2SqrSIMD16_SSE4(const void *pVect1v, const void *pVect2v, size_t dimension) { - const float ip = - SQ8_FP16_InnerProductSIMD16_SSE4_IMP(pVect1v, pVect2v, dimension); + const float ip = SQ8_FP16_InnerProductSIMD16_SSE4_IMP(pVect1v, pVect2v, dimension); const uint8_t *pVect1 = static_cast(pVect1v); const uint8_t *params_bytes = pVect1 + dimension; - const float x_sum_sq = - load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + const float x_sum_sq = load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); const float16 *pVect2 = static_cast(pVect2v); const auto *query_meta_bytes = reinterpret_cast(pVect2 + dimension); From fdc5c1cd04603cd6d3d007b48b52126347736cf0 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 10:56:09 +0300 Subject: [PATCH 11/24] =?UTF-8?q?Address=20PR=20review=20findings=20for=20?= =?UTF-8?q?SQ8=E2=86=94FP16=20x86=20kernels=20[MOD-14954]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - CMake: gate `-mf16c` on CXX_F16C AND CXX_FMA AND CXX_AVX (matches OPT_F16C macro) and append `-mavx` to the SSE4 dispatcher when adding -mf16c, since F16C is VEX-encoded and requires AVX state. Mirrors the existing F16C.cpp recipe and prevents miscompiles on toolchains with F16C but without AVX. - IP_SSE4_SQ8_FP16.h: replace `*reinterpret_cast(pVect1)` with `load_unaligned(pVect1)` to remove strict-aliasing UB on the uint8_t SQ8 lane load. - IP_AVX2{,_FMA}_SQ8_FP16.h: improve the residual-mask comment to spell out the asymmetric-mask reasoning (SQ8 unmasked is safe because the FP16 query blend forces those FP32 query lanes to 0 → garbage·0=0). - IP_AVX{512,2,2_FMA,SSE4}_SQ8_FP16.h: add the `IP = min·y_sum + delta·Σ(q·y)` algebraic-identity comment header that AVX-512 already carried, plus a precondition note that callers must enforce dim >= 16 (matches the established SQ8_FP32 convention; no runtime assert because sibling SQ8_FP32 SIMD kernels also rely on the dispatcher gate). - test_spaces.cpp: route the SQ8_FP16 edge-case tests (ZeroQuery, ConstantStorage, MixedSignQuery) through {IP,Cosine,L2}_SQ8_FP16_GetDistFunc so the runtime-selected SIMD tier is actually exercised on those inputs, not just the scalar reference. - test_spaces.cpp: add SQ8_FP16_SIMD_HighDim suite with dims {64, 128, 256, 512, 1024} so multi-iteration do-while loop bugs would fire (the existing [16, 32] range covers at most two AVX-512 chunk iterations). - test_spaces.cpp: add SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised — a single test that emits per-tier coverage to stderr and GTEST_SKIPs when no SIMD tier is available, so CI runners without AVX-512 do not silently report zero tier-1 coverage. - test_spaces.cpp: scalar-fallback `alignment` checks now seed the value with 0xFF and assert it remains 0xFF, verifying the dispatcher contract ("scalar leaves caller's value untouched") instead of just measuring that the variable's pre-zeroed init survived. - test_spaces.cpp: drop the stale MOD-15152/MOD-15153 wiring-TODO comment on SQ8_FP16_NoOptimizationSpacesTest now that the SIMD tiers are wired. - bm_spaces_sq8_fp16.cpp: drop the matching stale comment. Out of scope (separate ticket): two-accumulator FMA refactor (also affects SQ8_FP32) and the SSE4 residual `_mm_cvtph_ps` perf opportunity. Co-Authored-By: Claude Opus 4.7 (1M context) --- src/VecSim/spaces/CMakeLists.txt | 19 ++- src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h | 19 ++- src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h | 18 +++ .../IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h | 3 + src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h | 16 ++- .../spaces_benchmarks/bm_spaces_sq8_fp16.cpp | 5 +- tests/unit/test_spaces.cpp | 128 ++++++++++++++---- 7 files changed, 173 insertions(+), 35 deletions(-) diff --git a/src/VecSim/spaces/CMakeLists.txt b/src/VecSim/spaces/CMakeLists.txt index 8babf844b..a580916d2 100644 --- a/src/VecSim/spaces/CMakeLists.txt +++ b/src/VecSim/spaces/CMakeLists.txt @@ -50,9 +50,16 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)") list(APPEND OPTIMIZATIONS functions/AVX512F_BW_VL_VNNI.cpp) endif() + # F16C is VEX-encoded and requires AVX state, so it is only meaningful when the toolchain + # can also emit AVX/FMA. Mirrors the OPT_F16C macro condition in x86_64InstructionFlags.cmake. + set(_has_full_f16c FALSE) + if(CXX_F16C AND CXX_FMA AND CXX_AVX) + set(_has_full_f16c TRUE) + endif() + if(CXX_AVX2) set(_avx2_flags "-mavx2") - if(CXX_F16C) + if(_has_full_f16c) message("Building functions/AVX2.cpp with AVX2 and F16C") set(_avx2_flags "${_avx2_flags} -mf16c") else() @@ -64,7 +71,7 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)") if(CXX_AVX2 AND CXX_FMA) set(_avx2_fma_flags "-mavx2 -mfma") - if(CXX_F16C) + if(_has_full_f16c) message("Building functions/AVX2_FMA.cpp with AVX2, FMA, and F16C") set(_avx2_fma_flags "${_avx2_fma_flags} -mf16c") else() @@ -94,9 +101,11 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)") if(CXX_SSE4) set(_sse4_flags "-msse4.1") - if(CXX_F16C) - message("Building functions/SSE4.cpp with SSE4.1 and F16C") - set(_sse4_flags "${_sse4_flags} -mf16c") + if(_has_full_f16c) + # F16C is VEX-encoded → must compile with -mavx alongside -mf16c, matching the + # F16C.cpp recipe above. + message("Building functions/SSE4.cpp with SSE4.1, AVX, and F16C") + set(_sse4_flags "${_sse4_flags} -mavx -mf16c") else() message("Building functions/SSE4.cpp with SSE4.1 (no F16C)") endif() diff --git a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h index 130fe4eb0..eda8b393e 100644 --- a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h @@ -16,6 +16,17 @@ using sq8 = vecsim_types::sq8; using float16 = vecsim_types::float16; +/* + * Asymmetric SQ8 (storage) ↔ FP16 (query) inner product using algebraic identity: + * IP(x, y) = Σ(x_i * y_i) + * ≈ Σ((min + delta * q_i) * y_i) + * = min * Σy_i + delta * Σ(q_i * y_i) + * = min * y_sum + delta * quantized_dot_product + * + * FP16 query lanes are widened to FP32 per 8-lane chunk via _mm256_cvtph_ps (F16C); + * inner-loop arithmetic runs in FP32 with _mm256_fmadd_ps. + */ + // 8-wide AVX2+FMA step: 8 SQ8 lanes + 8 FP16 lanes -> 8 FP32 fused-multiply-add. static inline void SQ8_FP16_InnerProductStep_AVX2_FMA(const uint8_t *&pVect1, const float16 *&pVect2, __m256 &sum256) { @@ -31,6 +42,9 @@ static inline void SQ8_FP16_InnerProductStep_AVX2_FMA(const uint8_t *&pVect1, sum256 = _mm256_fmadd_ps(v1_f, v2_f, sum256); } +// Precondition: dim >= 16. Caller is the dispatcher in IP_space.cpp / L2_space.cpp. +// The residual block reads 8 SQ8 bytes and 16 FP16 bytes unconditionally; shorter blobs would +// under-read. template // 0..15 float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v, size_t dimension) { const uint8_t *pVec1 = static_cast(pVec1v); @@ -42,8 +56,9 @@ float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v, if constexpr (residual % 8) { constexpr int mask = (1 << (residual % 8)) - 1; - // SQ8 side: load 8 bytes regardless of residual; unused lanes are zeroed by the blend on - // the FP32 query. + // Single-side mask is sufficient: SQ8 lanes beyond `residual` may hold garbage, but the + // FP16 query blend below forces those FP32 query lanes to 0, so garbage·0=0 contributes + // nothing to the dot product. SQ8 load is intentionally unmasked. __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast(pVec1)); pVec1 += residual % 8; __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128); diff --git a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h index 1e29fe63d..028d7d3e0 100644 --- a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h @@ -16,6 +16,18 @@ using sq8 = vecsim_types::sq8; using float16 = vecsim_types::float16; +/* + * Asymmetric SQ8 (storage) ↔ FP16 (query) inner product using algebraic identity: + * IP(x, y) = Σ(x_i * y_i) + * ≈ Σ((min + delta * q_i) * y_i) + * = min * Σy_i + delta * Σ(q_i * y_i) + * = min * y_sum + delta * quantized_dot_product + * + * FP16 query lanes are widened to FP32 per 8-lane chunk via _mm256_cvtph_ps (F16C); + * inner-loop arithmetic runs in FP32 with separate _mm256_mul_ps + _mm256_add_ps + * (no FMA tier — Haswell-era AVX2 without FMA support). + */ + // 8-wide AVX2 step (no FMA): 8 SQ8 lanes + 8 FP16 lanes -> mul + add into sum. static inline void SQ8_FP16_InnerProductStep_AVX2(const uint8_t *&pVect1, const float16 *&pVect2, __m256 &sum256) { @@ -31,6 +43,9 @@ static inline void SQ8_FP16_InnerProductStep_AVX2(const uint8_t *&pVect1, const sum256 = _mm256_add_ps(sum256, _mm256_mul_ps(v1_f, v2_f)); } +// Precondition: dim >= 16. Caller is the dispatcher in IP_space.cpp / L2_space.cpp. +// The residual block reads 8 SQ8 bytes and 16 FP16 bytes unconditionally; shorter blobs would +// under-read. template // 0..15 float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size_t dimension) { const uint8_t *pVec1 = static_cast(pVec1v); @@ -42,6 +57,9 @@ float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size if constexpr (residual % 8) { constexpr int mask = (1 << (residual % 8)) - 1; + // Single-side mask is sufficient: SQ8 lanes beyond `residual` may hold garbage, but the + // FP16 query blend below forces those FP32 query lanes to 0, so garbage·0=0 contributes + // nothing to the dot product. SQ8 load is intentionally unmasked. __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast(pVec1)); pVec1 += residual % 8; __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128); diff --git a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h index 62532c56c..07f5d3456 100644 --- a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h @@ -36,6 +36,9 @@ static inline void SQ8_FP16_InnerProductStep_AVX512(const uint8_t *&pVec1, const // Raw inner product Σ((min + delta * q_i) * y_i). Used by both InnerProduct/Cosine wrappers // and by the L2 kernel. +// Precondition: dim >= 16. Caller is the dispatcher in IP_space.cpp / L2_space.cpp, which gates +// this. The residual block reads 16 SQ8 bytes and 32 FP16 bytes unconditionally; shorter blobs +// would under-read. template // 0..15 float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, size_t dimension) { const uint8_t *pVec1 = static_cast(pVec1v); // SQ8 storage diff --git a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h index 43b61fd25..e5ca51860 100644 --- a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h @@ -15,10 +15,22 @@ using sq8 = vecsim_types::sq8; using float16 = vecsim_types::float16; +/* + * Asymmetric SQ8 (storage) ↔ FP16 (query) inner product using algebraic identity: + * IP(x, y) = Σ(x_i * y_i) + * ≈ Σ((min + delta * q_i) * y_i) + * = min * Σy_i + delta * Σ(q_i * y_i) + * = min * y_sum + delta * quantized_dot_product + * + * FP16 query lanes are widened to FP32 per 4-lane chunk via _mm_cvtph_ps (F16C); + * inner-loop arithmetic runs in FP32 with separate _mm_mul_ps + _mm_add_ps (SSE4 has no FMA). + */ + // 4-wide SSE4+F16C step: 4 SQ8 lanes + 4 FP16 lanes -> mul + add into sum. static inline void SQ8_FP16_InnerProductStep_SSE4(const uint8_t *&pVect1, const float16 *&pVect2, __m128 &sum) { - __m128i v1_i = _mm_cvtepu8_epi32(_mm_cvtsi32_si128(*reinterpret_cast(pVect1))); + // Alignment-safe 4-byte load of SQ8 lanes via load_unaligned (no strict-aliasing UB). + __m128i v1_i = _mm_cvtepu8_epi32(_mm_cvtsi32_si128(load_unaligned(pVect1))); pVect1 += 4; __m128 v1_f = _mm_cvtepi32_ps(v1_i); @@ -29,6 +41,8 @@ static inline void SQ8_FP16_InnerProductStep_SSE4(const uint8_t *&pVect1, const sum = _mm_add_ps(sum, _mm_mul_ps(v1_f, v2_f)); } +// Precondition: dim >= 16. Caller is the dispatcher in IP_space.cpp / L2_space.cpp. +// Shorter blobs would underflow the residual ladder + final do-while loop. template // 0..15 float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2v, size_t dimension) { diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp index 75ede0eb8..f81a9d89d 100644 --- a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp +++ b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp @@ -15,8 +15,9 @@ using float16 = vecsim_types::float16; /** * SQ8-to-FP16 benchmarks: SQ8 quantized storage with FP16 query. - * Only naive (scalar) benchmarks are registered for now; SIMD chooser symbols are added - * by P1b (MOD-15152, x86) and P1c (MOD-15153, ARM). + * Registers the naive (scalar) baseline plus per-ISA SIMD variants (x86: AVX-512 / AVX2+FMA / + * AVX2 / SSE4 — gated on the matching OPT_* defines and runtime CPU features). ARM kernels + * land via MOD-14972. */ class BM_VecSimSpaces_SQ8_FP16 : public benchmark::Fixture { protected: diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index 53c3a011b..2cccd1183 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -3042,8 +3042,9 @@ TEST(SQ8_FP32_EdgeCases, CosineExtremeValuesTest) { // Parameterized tests that verify the scalar SQ8_FP16 kernels against the not-optimized // baseline across multiple dimensions, including odd dimensions and SIMD-boundary residues. -// SIMD chooser slots are added by P1b (MOD-15152) / P1c (MOD-15153); the dispatcher always -// returns the scalar implementation for now. +// The SIMD-tier dispatcher coverage lives in SQ8_FP16_SpacesOptimizationTest below; this +// suite intentionally exercises the scalar reference directly to keep it as a fixed baseline +// the SIMD tiers are compared against. class SQ8_FP16_NoOptimizationSpacesTest : public testing::TestWithParam {}; TEST_P(SQ8_FP16_NoOptimizationSpacesTest, SQ8_FP16_L2SqrTest) { @@ -3188,14 +3189,18 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { #endif #endif - // Scalar fallback. - unsigned char alignment = 0; + // Scalar fallback. Init alignment to a sentinel (0xFF) so the assert below actually verifies + // that the dispatcher LEAVES THE VALUE UNTOUCHED on the scalar path — initialising to 0 then + // asserting `== 0` would pass even if the dispatcher were a no-op. + unsigned char alignment = 0xFF; arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); ASSERT_EQ(arch_opt_func, SQ8_FP16_L2Sqr) << "Unexpected scalar fallback function for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "Scalar fallback with dim " << dim; - ASSERT_EQ(alignment, 0) << "Scalar fallback should set no alignment hint for dim " << dim; + ASSERT_EQ(alignment, 0xFF) << "Scalar fallback must leave caller's alignment value untouched " + "(dim " + << dim << ")"; } TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { @@ -3267,13 +3272,16 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { #endif #endif - unsigned char alignment = 0; + // Scalar fallback — see L2 test for the 0xFF sentinel rationale. + unsigned char alignment = 0xFF; arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); ASSERT_EQ(arch_opt_func, SQ8_FP16_InnerProduct) << "Unexpected scalar fallback function for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "Scalar fallback with dim " << dim; - ASSERT_EQ(alignment, 0) << "Scalar fallback should set no alignment hint for dim " << dim; + ASSERT_EQ(alignment, 0xFF) << "Scalar fallback must leave caller's alignment value untouched " + "(dim " + << dim << ")"; } TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { @@ -3345,22 +3353,80 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { #endif #endif - unsigned char alignment = 0; + // Scalar fallback — see L2 test for the 0xFF sentinel rationale. + unsigned char alignment = 0xFF; arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); ASSERT_EQ(arch_opt_func, SQ8_FP16_Cosine) << "Unexpected scalar fallback function for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "Scalar fallback with dim " << dim; - ASSERT_EQ(alignment, 0) << "Scalar fallback should set no alignment hint for dim " << dim; + ASSERT_EQ(alignment, 0xFF) << "Scalar fallback must leave caller's alignment value untouched " + "(dim " + << dim << ")"; } +// Dim range [16, 32] covers every residual class for the 16-element chunk used by every tier. INSTANTIATE_TEST_SUITE_P(SQ8_FP16_SIMD, SQ8_FP16_SpacesOptimizationTest, testing::Range(16UL, 16 * 2UL + 1)); +// Higher dimensions surface multi-iteration loop bugs (pointer stride, do-while termination +// off-by-one) that the [16, 32] range does not exercise because the AVX-512 inner loop runs at +// most twice in that range. +INSTANTIATE_TEST_SUITE_P(SQ8_FP16_SIMD_HighDim, SQ8_FP16_SpacesOptimizationTest, + testing::Values(64UL, 128UL, 256UL, 512UL, 1024UL)); + +// Surfaces which SIMD tiers were actually exercised on the current host. Without this, a CI +// runner that lacks AVX-512 silently passes with zero tier-1 coverage. Logs per-tier presence +// to stderr and GTEST_SKIPs only when no SIMD tier is available at all. +TEST(SQ8_FP16_SIMD_TierCoverage, ReportTiersExercised) { + auto opt = getCpuOptimizationFeatures(); + bool any_simd = false; + +#ifdef CPU_FEATURES_ARCH_X86_64 +#ifdef OPT_AVX512_F_BW_VL_VNNI + if (opt.avx512f && opt.avx512bw && opt.avx512vl && opt.avx512vnni) { + std::cerr << "[SQ8_FP16] AVX-512 F+BW+VL+VNNI tier exercised\n"; + any_simd = true; + } else { + std::cerr << "[SQ8_FP16] AVX-512 F+BW+VL+VNNI tier NOT exercised on this host\n"; + } +#endif +#if defined(OPT_AVX2_FMA) && defined(OPT_F16C) + if (opt.avx2 && opt.fma3 && opt.f16c) { + std::cerr << "[SQ8_FP16] AVX2+FMA+F16C tier exercised\n"; + any_simd = true; + } else { + std::cerr << "[SQ8_FP16] AVX2+FMA+F16C tier NOT exercised on this host\n"; + } +#endif +#if defined(OPT_AVX2) && defined(OPT_F16C) + if (opt.avx2 && opt.f16c) { + std::cerr << "[SQ8_FP16] AVX2+F16C tier exercised\n"; + any_simd = true; + } else { + std::cerr << "[SQ8_FP16] AVX2+F16C tier NOT exercised on this host\n"; + } +#endif +#if defined(OPT_SSE4) && defined(OPT_F16C) + if (opt.sse4_1 && opt.f16c && opt.avx) { + std::cerr << "[SQ8_FP16] SSE4+F16C+AVX tier exercised\n"; + any_simd = true; + } else { + std::cerr << "[SQ8_FP16] SSE4+F16C+AVX tier NOT exercised on this host\n"; + } +#endif +#endif // x86_64 + + if (!any_simd) { + GTEST_SKIP() << "No SQ8_FP16 SIMD tier available on this host — scalar fallback only."; + } +} + /* ======================== Tests SQ8_FP16 (edge cases) ========================= */ // Zero FP16 query against a non-zero SQ8 storage. IP must be exactly 1.0 (1 - 0), -// L2² must equal Σ dequantized². +// L2² must equal Σ dequantized². Routes through the dispatcher so the runtime-selected +// SIMD tier (AVX-512 / AVX2+FMA / AVX2 / SSE4 / scalar) is exercised, not just scalar. TEST(SQ8_FP16_EdgeCases, ZeroQueryTest) { size_t dim = 64; @@ -3375,20 +3441,24 @@ TEST(SQ8_FP16_EdgeCases, ZeroQueryTest) { test_utils::populate_float_vec_to_sq8_with_metadata(v_nonzero_quantized.data(), dim, false, 1234); + auto ip_func = IP_SQ8_FP16_GetDistFunc(dim, nullptr); + auto l2_func = L2_SQ8_FP16_GetDistFunc(dim, nullptr); + float ip_baseline = test_utils::SQ8_FP16_NotOptimized_InnerProduct(v_nonzero_quantized.data(), v_zero_query.data(), dim); - float ip = SQ8_FP16_InnerProduct(v_nonzero_quantized.data(), v_zero_query.data(), dim); - ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Zero-query SQ8_FP16_InnerProduct mismatch"; + float ip = ip_func(v_nonzero_quantized.data(), v_zero_query.data(), dim); + ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Zero-query SQ8_FP16 IP mismatch"; ASSERT_NEAR(ip, 1.0f, 0.01f) << "Zero-query IP must equal 1.0 (1 - 0)"; float l2_baseline = test_utils::SQ8_FP16_NotOptimized_L2Sqr(v_nonzero_quantized.data(), v_zero_query.data(), dim); - float l2 = SQ8_FP16_L2Sqr(v_nonzero_quantized.data(), v_zero_query.data(), dim); - ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Zero-query SQ8_FP16_L2Sqr mismatch"; + float l2 = l2_func(v_nonzero_quantized.data(), v_zero_query.data(), dim); + ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Zero-query SQ8_FP16 L2 mismatch"; } // Constant SQ8 storage (all values identical => delta = 0). Storage quantizer sets delta to 1.0 -// to avoid div-by-zero, so verify the kernels still match the dequantization baseline. +// to avoid div-by-zero, so verify the kernels still match the dequantization baseline. Routes +// through the dispatcher so the runtime-selected SIMD tier sees the edge input. TEST(SQ8_FP16_EdgeCases, ConstantStorageTest) { size_t dim = 64; @@ -3404,19 +3474,23 @@ TEST(SQ8_FP16_EdgeCases, ConstantStorageTest) { test_utils::quantize_float_vec_to_sq8_with_metadata(v_const.data(), dim, v_const_quantized.data()); + auto ip_func = IP_SQ8_FP16_GetDistFunc(dim, nullptr); + auto l2_func = L2_SQ8_FP16_GetDistFunc(dim, nullptr); + float ip_baseline = test_utils::SQ8_FP16_NotOptimized_InnerProduct(v_const_quantized.data(), v_query.data(), dim); - float ip = SQ8_FP16_InnerProduct(v_const_quantized.data(), v_query.data(), dim); - ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Constant-storage SQ8_FP16_InnerProduct mismatch"; + float ip = ip_func(v_const_quantized.data(), v_query.data(), dim); + ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Constant-storage SQ8_FP16 IP mismatch"; float l2_baseline = test_utils::SQ8_FP16_NotOptimized_L2Sqr(v_const_quantized.data(), v_query.data(), dim); - float l2 = SQ8_FP16_L2Sqr(v_const_quantized.data(), v_query.data(), dim); - ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Constant-storage SQ8_FP16_L2Sqr mismatch"; + float l2 = l2_func(v_const_quantized.data(), v_query.data(), dim); + ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Constant-storage SQ8_FP16 L2 mismatch"; } // Mixed-sign FP16 query (alternating positive/negative values) verifies sign handling // in the FP16->FP32 widening path and in the algebraic identity used by the kernels. +// Routes through the dispatcher so the runtime-selected SIMD tier sees the edge input. TEST(SQ8_FP16_EdgeCases, MixedSignQueryTest) { size_t dim = 64; @@ -3436,20 +3510,24 @@ TEST(SQ8_FP16_EdgeCases, MixedSignQueryTest) { std::vector v_quantized(quantized_size); test_utils::populate_float_vec_to_sq8_with_metadata(v_quantized.data(), dim, false, 9876); + auto ip_func = IP_SQ8_FP16_GetDistFunc(dim, nullptr); + auto cos_func = Cosine_SQ8_FP16_GetDistFunc(dim, nullptr); + auto l2_func = L2_SQ8_FP16_GetDistFunc(dim, nullptr); + float ip_baseline = test_utils::SQ8_FP16_NotOptimized_InnerProduct(v_quantized.data(), v_query.data(), dim); - float ip = SQ8_FP16_InnerProduct(v_quantized.data(), v_query.data(), dim); - ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Mixed-sign SQ8_FP16_InnerProduct mismatch"; + float ip = ip_func(v_quantized.data(), v_query.data(), dim); + ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Mixed-sign SQ8_FP16 IP mismatch"; float cos_baseline = test_utils::SQ8_FP16_NotOptimized_Cosine(v_quantized.data(), v_query.data(), dim); - float cos = SQ8_FP16_Cosine(v_quantized.data(), v_query.data(), dim); - ASSERT_NEAR(cos, cos_baseline, 0.01f) << "Mixed-sign SQ8_FP16_Cosine mismatch"; + float cos = cos_func(v_quantized.data(), v_query.data(), dim); + ASSERT_NEAR(cos, cos_baseline, 0.01f) << "Mixed-sign SQ8_FP16 Cosine mismatch"; float l2_baseline = test_utils::SQ8_FP16_NotOptimized_L2Sqr(v_quantized.data(), v_query.data(), dim); - float l2 = SQ8_FP16_L2Sqr(v_quantized.data(), v_query.data(), dim); - ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Mixed-sign SQ8_FP16_L2Sqr mismatch"; + float l2 = l2_func(v_quantized.data(), v_query.data(), dim); + ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Mixed-sign SQ8_FP16 L2 mismatch"; } /* ======================== Tests SQ8_SQ8 ========================= */ From ce16f6be01abe39e59bc27aa41517e536f53e9d2 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 11:24:43 +0300 Subject: [PATCH 12/24] =?UTF-8?q?Add=20multi-accumulator=20ILP=20to=20SQ8?= =?UTF-8?q?=E2=86=94FP16=20x86=20kernels=20[MOD-14954]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Break the FMA / mul+add dependency chain in all four SQ8↔FP16 IP kernels by widening the inner loop to use multiple independent accumulators. L2 kernels inherit the change through their `…InnerProductImp_…` call. - IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h: 1 → 4 accumulators, unroll-4 main loop (64 lanes/iter) with a 16-lane tail for the 0..3 remaining chunks. - IP_AVX2_FMA_SQ8_FP16.h, IP_AVX2_SQ8_FP16.h: 1 → 2 accumulators; the existing 2-step unrolled body now routes each step to an independent accumulator. The `residual >= 8` half-chunk feeds the second accumulator so the prologue also breaks the dependency chain. - IP_SSE4_SQ8_FP16.h: 1 → 2 accumulators; do-while unrolled 1 → 2 steps per iteration (4 → 8 lanes/iter). Residual-ladder steps alternate between sum_a and sum_b for prologue ILP. Correctness invariant: residual block consumes exactly `residual` lanes (0..15) → remaining tail is always a multiple of 16, so the unrolled loops (multiples of 8 / 16 / 64) terminate exactly. Verified by 131 SQ8_FP16 unit tests + 115 under ASan. --- src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h | 15 +++++++---- src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h | 15 +++++++---- .../IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h | 27 ++++++++++++++++--- src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h | 19 ++++++++----- 4 files changed, 56 insertions(+), 20 deletions(-) diff --git a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h index eda8b393e..a4c1612ea 100644 --- a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h @@ -51,7 +51,10 @@ float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v, const float16 *pVec2 = static_cast(pVec2v); const uint8_t *pEnd1 = pVec1 + dimension; - __m256 sum256 = _mm256_setzero_ps(); + // Two independent accumulators break the FMA dependency chain so consecutive iterations + // can issue in parallel through both FMA ports. + __m256 sum_a = _mm256_setzero_ps(); + __m256 sum_b = _mm256_setzero_ps(); if constexpr (residual % 8) { constexpr int mask = (1 << (residual % 8)) - 1; @@ -70,18 +73,20 @@ float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v, v2_f = _mm256_blend_ps(_mm256_setzero_ps(), v2_f, mask); pVec2 += residual % 8; - sum256 = _mm256_mul_ps(v1_f, v2_f); + sum_a = _mm256_mul_ps(v1_f, v2_f); } if constexpr (residual >= 8) { - SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum256); + // Route the half-residual chunk to the second accumulator for ILP. + SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum_b); } do { - SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum256); - SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum256); + SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum_a); + SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum_b); } while (pVec1 < pEnd1); + __m256 sum256 = _mm256_add_ps(sum_a, sum_b); float quantized_dot = my_mm256_reduce_add_ps(sum256); const uint8_t *pVec1Base = static_cast(pVec1v); diff --git a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h index 028d7d3e0..3a01d80f2 100644 --- a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h @@ -52,7 +52,10 @@ float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size const float16 *pVec2 = static_cast(pVec2v); const uint8_t *pEnd1 = pVec1 + dimension; - __m256 sum256 = _mm256_setzero_ps(); + // Two independent accumulators break the mul→add dependency chain on Haswell-class CPUs + // without FMA, where the add cannot retire before the prior mul. + __m256 sum_a = _mm256_setzero_ps(); + __m256 sum_b = _mm256_setzero_ps(); if constexpr (residual % 8) { constexpr int mask = (1 << (residual % 8)) - 1; @@ -70,18 +73,20 @@ float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size v2_f = _mm256_blend_ps(_mm256_setzero_ps(), v2_f, mask); pVec2 += residual % 8; - sum256 = _mm256_mul_ps(v1_f, v2_f); + sum_a = _mm256_mul_ps(v1_f, v2_f); } if constexpr (residual >= 8) { - SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum256); + // Route the half-residual chunk to the second accumulator for ILP. + SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum_b); } do { - SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum256); - SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum256); + SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum_a); + SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum_b); } while (pVec1 < pEnd1); + __m256 sum256 = _mm256_add_ps(sum_a, sum_b); float quantized_dot = my_mm256_reduce_add_ps(sum256); const uint8_t *pVec1Base = static_cast(pVec1v); diff --git a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h index 07f5d3456..fa0d508b4 100644 --- a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h @@ -45,7 +45,12 @@ float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, si const float16 *pVec2 = static_cast(pVec2v); // FP16 query const uint8_t *pEnd1 = pVec1 + dimension; - __m512 sum = _mm512_setzero_ps(); + // Four independent accumulators break the FMA dependency chain so the inner loop can + // saturate both FMA ports on Sapphire Rapids / Zen 4. + __m512 sum0 = _mm512_setzero_ps(); + __m512 sum1 = _mm512_setzero_ps(); + __m512 sum2 = _mm512_setzero_ps(); + __m512 sum3 = _mm512_setzero_ps(); if constexpr (residual > 0) { __mmask16 mask = (1U << residual) - 1; @@ -60,15 +65,29 @@ float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, si __m512 v2_f = _mm512_cvtph_ps(v2_16); // Mask out unused lanes by folding the mask into the multiply. - sum = _mm512_maskz_mul_ps(mask, v1_f, v2_f); + sum0 = _mm512_maskz_mul_ps(mask, v1_f, v2_f); pVec1 += residual; pVec2 += residual; } - do { + // Main unrolled loop: 4 chunks of 16 lanes per iteration, one chunk per accumulator. + // Residual leaves `dim - residual` lanes remaining (a multiple of 16), so the + // pointer comparison stays exact. + while (pVec1 + 64 <= pEnd1) { + SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum0); + SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum1); + SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum2); + SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum3); + } + + // Reduce the four accumulators into one. + __m512 sum = _mm512_add_ps(_mm512_add_ps(sum0, sum1), _mm512_add_ps(sum2, sum3)); + + // Tail: at most three remaining 16-lane chunks. + while (pVec1 < pEnd1) { SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum); - } while (pVec1 < pEnd1); + } float quantized_dot = _mm512_reduce_add_ps(sum); diff --git a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h index e5ca51860..871a189dc 100644 --- a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h @@ -50,7 +50,9 @@ float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2 const float16 *pVec2 = static_cast(pVec2v); const uint8_t *pEnd1 = pVec1 + dimension; - __m128 sum = _mm_setzero_ps(); + // Two independent accumulators break the mul→add dependency chain (SSE4 lacks FMA). + __m128 sum_a = _mm_setzero_ps(); + __m128 sum_b = _mm_setzero_ps(); if constexpr (residual % 4) { __m128 v1_f; @@ -75,23 +77,28 @@ float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2 pVec1 += residual % 4; pVec2 += residual % 4; - sum = _mm_mul_ps(v1_f, v2_f); + sum_a = _mm_mul_ps(v1_f, v2_f); } + // Alternate the residual-ladder steps across the two accumulators for ILP. if constexpr (residual >= 4) { - SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum); + SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_b); } if constexpr (residual >= 8) { - SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum); + SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_a); } if constexpr (residual >= 12) { - SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum); + SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_b); } + // Remaining lanes after the residual block are a multiple of 16, hence a multiple of 8, + // so two 4-lane steps per iteration consume the tail exactly. do { - SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum); + SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_a); + SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_b); } while (pVec1 < pEnd1); + __m128 sum = _mm_add_ps(sum_a, sum_b); float PORTABLE_ALIGN16 TmpRes[4]; _mm_store_ps(TmpRes, sum); float quantized_dot = TmpRes[0] + TmpRes[1] + TmpRes[2] + TmpRes[3]; From 658c485b9e3d601fa702ff21b13ee7e3c4eb48cb Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 13:47:29 +0300 Subject: [PATCH 13/24] =?UTF-8?q?Drop=20misleading=20VNNI=20suffix=20from?= =?UTF-8?q?=20SQ8=E2=86=94FP16=20AVX-512=20kernel=20[MOD-14954]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The SQ8↔FP16 AVX-512 kernel does not actually issue any VNNI instruction — the inner loop is FP32 FMA (`_mm512_fmadd_ps`) over lanes widened from SQ8 (`_mm512_cvtepu8_epi32` + `_mm512_cvtepi32_ps`) and FP16 (`_mm512_cvtph_ps`). Real VNNI use would require an integer-encoded query, which is a different kernel entirely. The file/function names are renamed to match what the kernel actually uses (AVX-512F). The dispatcher .cpp/.h files stay named after the runtime tier (AVX512F_BW_VL_VNNI) since the SQ8↔FP16 kernel still registers under that tier alongside the genuinely VNNI-using SQ8↔SQ8 / INT8 / UINT8 kernels — the gate is a CPU-feature gate, not an ISA claim. The same misnomer exists for SQ8↔FP32; tracked separately so the rename there can ship as its own commit. Also: fix a strict-aliasing-class UB introduced by the AVX-512 unroll-4 loop. `while (pVec1 + 64 <= pEnd1)` forms a pointer past one-past-end of the SQ8 storage object when fewer than 64 lane bytes remain, which is UB in C++ regardless of dereference. Switched to pointer subtraction (`static_cast(pEnd1 - pVec1) >= 64`). Renames: - IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> IP_AVX512F_SQ8_FP16.h - L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> L2_AVX512F_SQ8_FP16.h - SQ8_FP16_{InnerProduct,Cosine,L2Sqr}SIMD16_AVX512F_BW_VL_VNNI -> _AVX512F - Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F_BW_VL_VNNI -> _AVX512F Verified: 131 SQ8_FP16 unit tests + 115 under ASan. --- ..._VNNI_SQ8_FP16.h => IP_AVX512F_SQ8_FP16.h} | 12 +++++++----- src/VecSim/spaces/IP_space.cpp | 4 ++-- ..._VNNI_SQ8_FP16.h => L2_AVX512F_SQ8_FP16.h} | 5 ++--- src/VecSim/spaces/L2_space.cpp | 2 +- .../spaces/functions/AVX512F_BW_VL_VNNI.cpp | 19 ++++++++++--------- .../spaces/functions/AVX512F_BW_VL_VNNI.h | 8 +++++--- .../spaces_benchmarks/bm_spaces_sq8_fp16.cpp | 6 ++++-- tests/unit/test_spaces.cpp | 12 ++++++------ 8 files changed, 37 insertions(+), 31 deletions(-) rename src/VecSim/spaces/IP/{IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h => IP_AVX512F_SQ8_FP16.h} (90%) rename src/VecSim/spaces/L2/{L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h => L2_AVX512F_SQ8_FP16.h} (85%) diff --git a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h similarity index 90% rename from src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h rename to src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h index fa0d508b4..955f431f6 100644 --- a/src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h @@ -73,8 +73,10 @@ float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, si // Main unrolled loop: 4 chunks of 16 lanes per iteration, one chunk per accumulator. // Residual leaves `dim - residual` lanes remaining (a multiple of 16), so the - // pointer comparison stays exact. - while (pVec1 + 64 <= pEnd1) { + // pointer comparison stays exact. Compare via pointer subtraction (not + // `pVec1 + 64 <= pEnd1`) so we never form a pointer past one-past-the-end, + // which would be UB in C++ regardless of dereference. + while (static_cast(pEnd1 - pVec1) >= 64) { SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum0); SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum1); SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum2); @@ -108,15 +110,15 @@ float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, si } template // 0..15 -float SQ8_FP16_InnerProductSIMD16_AVX512F_BW_VL_VNNI(const void *pVec1v, const void *pVec2v, +float SQ8_FP16_InnerProductSIMD16_AVX512F(const void *pVec1v, const void *pVec2v, size_t dimension) { return 1.0f - SQ8_FP16_InnerProductImp_AVX512(pVec1v, pVec2v, dimension); } template // 0..15 -float SQ8_FP16_CosineSIMD16_AVX512F_BW_VL_VNNI(const void *pVec1v, const void *pVec2v, +float SQ8_FP16_CosineSIMD16_AVX512F(const void *pVec1v, const void *pVec2v, size_t dimension) { // Cosine distance = 1 - IP for pre-normalised vectors. Aliases InnerProduct, matching the // SQ8_FP32 pattern. - return SQ8_FP16_InnerProductSIMD16_AVX512F_BW_VL_VNNI(pVec1v, pVec2v, dimension); + return SQ8_FP16_InnerProductSIMD16_AVX512F(pVec1v, pVec2v, dimension); } diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp index 37ffc9ed4..1fd7381b7 100644 --- a/src/VecSim/spaces/IP_space.cpp +++ b/src/VecSim/spaces/IP_space.cpp @@ -191,7 +191,7 @@ dist_func_t IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) { if (dim % 16 == 0) // SQ8 chunk = 16 bytes *alignment = 16 * sizeof(uint8_t); - return Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim); + return Choose_SQ8_FP16_IP_implementation_AVX512F(dim); } #endif #ifdef OPT_AVX2_FMA @@ -245,7 +245,7 @@ dist_func_t Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignm if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) { if (dim % 16 == 0) *alignment = 16 * sizeof(uint8_t); - return Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(dim); + return Choose_SQ8_FP16_Cosine_implementation_AVX512F(dim); } #endif #ifdef OPT_AVX2_FMA diff --git a/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h similarity index 85% rename from src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h rename to src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h index 635f30904..384870b21 100644 --- a/src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h +++ b/src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h @@ -8,7 +8,7 @@ */ #pragma once #include "VecSim/spaces/space_includes.h" -#include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h" +#include "VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h" #include "VecSim/types/sq8.h" #include "VecSim/types/float16.h" #include "VecSim/utils/alignment.h" @@ -18,8 +18,7 @@ using float16 = vecsim_types::float16; // L2² = x_sum_squares + y_sum_squares - 2 * IP(x, y), computed via the AVX-512 IP impl above. template // 0..15 -float SQ8_FP16_L2SqrSIMD16_AVX512F_BW_VL_VNNI(const void *pVect1v, const void *pVect2v, - size_t dimension) { +float SQ8_FP16_L2SqrSIMD16_AVX512F(const void *pVect1v, const void *pVect2v, size_t dimension) { const float ip = SQ8_FP16_InnerProductImp_AVX512(pVect1v, pVect2v, dimension); const uint8_t *pVect1 = static_cast(pVect1v); diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp index ab5188800..0ada05f76 100644 --- a/src/VecSim/spaces/L2_space.cpp +++ b/src/VecSim/spaces/L2_space.cpp @@ -123,7 +123,7 @@ dist_func_t L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) { if (dim % 16 == 0) *alignment = 16 * sizeof(uint8_t); - return Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(dim); + return Choose_SQ8_FP16_L2_implementation_AVX512F(dim); } #endif #ifdef OPT_AVX2_FMA diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp index e5e8bb1c2..145300f24 100644 --- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp +++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp @@ -17,8 +17,8 @@ #include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP32.h" #include "VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP32.h" -#include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h" -#include "VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h" +#include "VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h" +#include "VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h" #include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_SQ8.h" #include "VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_SQ8.h" @@ -79,20 +79,21 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t d return ret_dist_func; } -// SQ8-to-FP16 distance functions -dist_func_t Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim) { +// SQ8-to-FP16 distance functions. The kernels themselves only use AVX-512F (cvtph_ps + FMA); +// they register under the VNNI tier solely for CPU-feature dispatch. +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX512F(size_t dim) { dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX512F_BW_VL_VNNI); + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX512F); return ret_dist_func; } -dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim) { +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX512F(size_t dim) { dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX512F_BW_VL_VNNI); + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX512F); return ret_dist_func; } -dist_func_t Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(size_t dim) { +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX512F(size_t dim) { dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX512F_BW_VL_VNNI); + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX512F); return ret_dist_func; } diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h index b68bfd0a4..13dd9e8a8 100644 --- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h +++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h @@ -24,9 +24,11 @@ dist_func_t Choose_SQ8_FP32_IP_implementation_AVX512F_BW_VL_VNNI(size_t d dist_func_t Choose_SQ8_FP32_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim); dist_func_t Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t dim); -dist_func_t Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim); -dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim); -dist_func_t Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(size_t dim); +// SQ8-to-FP16 kernels only use AVX-512F instructions; they are declared here because +// they register under the VNNI tier for CPU-feature dispatch. +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX512F(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX512F(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX512F(size_t dim); // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum) dist_func_t Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim); diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp index f81a9d89d..04cb13eea 100644 --- a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp +++ b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp @@ -57,9 +57,11 @@ cpu_features::X86Features opt = cpu_features::GetX86Info().features; // AVX-512 F+BW+VL+VNNI (no F16C requirement — _mm512_cvtph_ps is part of AVX512F). #ifdef OPT_AVX512_F_BW_VL_VNNI bool avx512_f_bw_vl_vnni_supported = opt.avx512f && opt.avx512bw && opt.avx512vl && opt.avx512vnni; -INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F_BW_VL_VNNI, 16, +// Kernel itself only needs AVX-512F (cvtph_ps + FMA); the VNNI feature check keeps it on the +// same dispatch tier as the rest of this file. +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16, avx512_f_bw_vl_vnni_supported); -INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F_BW_VL_VNNI, 16, +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16, avx512_f_bw_vl_vnni_supported); #endif diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index 2cccd1183..04618672a 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -577,9 +577,9 @@ TEST_F(SpacesTest, GetDistFuncSQ8FP16Asymmetric) { #ifdef OPT_AVX512_F_BW_VL_VNNI if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl && optimization.avx512vnni) { - expected_l2 = Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(dim); - expected_ip = Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim); - expected_cos = Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(dim); + expected_l2 = Choose_SQ8_FP16_L2_implementation_AVX512F(dim); + expected_ip = Choose_SQ8_FP16_IP_implementation_AVX512F(dim); + expected_cos = Choose_SQ8_FP16_Cosine_implementation_AVX512F(dim); } else #endif #if defined(OPT_AVX2_FMA) && defined(OPT_F16C) @@ -3142,7 +3142,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { optimization.avx512vnni) { unsigned char alignment = 0; arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); - ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_AVX512F_BW_VL_VNNI(dim)) + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_AVX512F(dim)) << "Unexpected distance function chosen for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "AVX512 with dim " << dim; @@ -3225,7 +3225,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { optimization.avx512vnni) { unsigned char alignment = 0; arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); - ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim)) + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_AVX512F(dim)) << "Unexpected distance function chosen for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "AVX512 with dim " << dim; @@ -3306,7 +3306,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { optimization.avx512vnni) { unsigned char alignment = 0; arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); - ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_AVX512F_BW_VL_VNNI(dim)) + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_AVX512F(dim)) << "Unexpected distance function chosen for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "AVX512 with dim " << dim; From fe69f8588eb07fded243674ac4fc470fe50f6dfa Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 13:57:22 +0300 Subject: [PATCH 14/24] =?UTF-8?q?Remove=20SQ8=E2=86=94FP16=20design=20doc?= =?UTF-8?q?=20from=20PR=20[MOD-14954]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Design doc was added in ad941b8f for planning; not appropriate as a long-lived in-repo artifact. Keep externally (Confluence / scratch) rather than ship with the kernel commit. --- .../2026-05-26-sq8-fp16-x86-kernels-design.md | 385 ------------------ 1 file changed, 385 deletions(-) delete mode 100644 docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md diff --git a/docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md b/docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md deleted file mode 100644 index 1ef7a787a..000000000 --- a/docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md +++ /dev/null @@ -1,385 +0,0 @@ -# SQ8↔FP16 SIMD distance kernels — Intel x86 (MOD-14954) - -## Goal - -Add asymmetric SQ8 (storage) ↔ FP16 (query) distance kernels for Inner -Product, Cosine, and L2² on Intel x86 across four ISA tiers: - -- AVX-512 (F + BW + VL + VNNI bundle already used for SQ8_FP32) -- AVX2 + FMA -- AVX2 (no FMA) -- SSE4.1 - -Each kernel converts FP16 query lanes to FP32 per SIMD chunk; the inner -multiply-accumulate runs in FP32. SQ8 metadata and FP32 query metadata -(precomputed sums) stay scalar and are read with the same algebraic -identity used by the SQ8_FP32 kernels: - -```text -IP(x, y) = min · y_sum + delta · Σ(q_i · y_i) -L2²(x, y) = x_sum_squares + y_sum_squares − 2 · IP(x, y) -``` - -Wire the new kernels into the dispatcher tables so -`{IP,Cosine,L2}_SQ8_FP16_GetDistFunc` returns the best SIMD path -available at runtime instead of the scalar fallback delivered by -MOD-15141. - -## Non-goals - -- No new metric (only IP / Cosine / L2²). -- No change to scalar `SQ8_FP16_*` reference; existing tests against - `SQ8_FP16_NotOptimized_*` remain the correctness baseline. -- No ARM kernels (MOD-14972 covers ARM). -- No SQ8↔FP32 changes; existing kernels untouched. - -## Scope and constraints - -- FP16 query layout is `[float16 values (dim)] [y_sum (float)] - [y_sum_squares (float, L2 only)]`. Trailing metadata is FP32 and may - sit at an offset that is not a multiple of 4 when `dim` is odd; use - `load_unaligned` to read it (mirrors scalar `SQ8_FP16_Impl`). -- All four ISA tiers need a way to widen FP16 → FP32. The 512-bit - variant (`_mm512_cvtph_ps`) is in AVX512F. The 256-bit and 128-bit - variants (`_mm256_cvtph_ps`, `_mm_cvtph_ps`) require the F16C - extension. F16C is its own ISA flag; AVX2/SSE4.1 do not imply it. -- Existing dispatcher source files (`AVX2_FMA.cpp`, `AVX2.cpp`, - `SSE4.cpp`) are compiled without `-mf16c`. We add `-mf16c` to those - files in CMake (conditional on `CXX_F16C`), guard the new SQ8_FP16 - symbols behind `#ifdef OPT_F16C`, and add `features.f16c &&` to the - dispatch gates for the AVX2/SSE4 tiers. The AVX-512 tier needs no - F16C gate. -- `dim` must be ≥ 16 for the AVX-512/AVX2 SIMD paths and ≥ 16 for SSE4 - (matches existing SQ8_FP32 contract). -- SQ8 storage is read as `uint8_t`; alignment hint returned by - `*_GetDistFunc` continues to refer to the SQ8 (first) operand. Hints: - 16 / 8 / 8 / 4 bytes for AVX-512 / AVX2+FMA / AVX2 / SSE4 when - `dim % chunk == 0`, else 0. - -## File-level design - -### New SIMD headers (8 files) - -Per ISA tier × {IP, L2}: - -```text -src/VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h -src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h -src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h -src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h -src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h -src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h -src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h -src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h -``` - -Each IP header exposes: - -- `template float SQ8_FP16_InnerProductImp_(const void*, const void*, size_t)` — raw inner product (no `1 -`), used by both InnerProduct/Cosine wrappers and the L2 kernel. -- `template float SQ8_FP16_InnerProductSIMD16_(...)` — returns `1.0f - Imp`. -- `template float SQ8_FP16_CosineSIMD16_(...)` — aliases InnerProduct (vectors are pre-normalised, mirrors SQ8_FP32 pattern). - -Each L2 header `#include`s the matching IP header and exposes: - -- `template float SQ8_FP16_L2SqrSIMD16_(...)` — computes `x_sum_sq + y_sum_sq − 2·Imp(...)`. - -`` strings: - -- `AVX512F_BW_VL_VNNI` -- `AVX2_FMA` -- `AVX2` -- `SSE4` - -All four headers' inner loops: - -1. Load 16 SQ8 bytes (one chunk) and widen to 16×FP32. -2. Load 16 FP16 query lanes and widen to 16×FP32 (`_mm512_cvtph_ps`, - two `_mm256_cvtph_ps` calls, two `_mm256_cvtph_ps` for plain AVX2, - or four `_mm_cvtph_ps` for SSE4 — chunk granularity matches the - existing SQ8_FP32 layout for that tier). -3. Fuse-multiply-add (or mul + add for SSE4 and plain AVX2) into the - FP32 accumulator(s). -4. After the loop, horizontal-reduce and apply - `min_val · y_sum + delta · quantized_dot`. - -L2 kernels additionally read `x_sum_squares` from SQ8 metadata and -`y_sum_squares` from query metadata, return -`x_sum_sq + y_sum_sq − 2·ip`. **Both** the SQ8 storage metadata -(`min_val`, `delta`, `x_sum_squares`) and the FP16 query metadata -(`y_sum`, `y_sum_squares`) are read with `load_unaligned`. SQ8 -metadata starts at byte offset `dim` after the quantised lanes — for -odd `dim` that offset is not 4-byte aligned. FP16 query metadata -starts at byte offset `2*dim` after the FP16 lanes — odd `dim` leaves -it 2-byte aligned. Mirrors the scalar `SQ8_FP16_InnerProduct_Impl` -pattern in `src/VecSim/spaces/IP/IP.cpp`. - -Residual handling: - -- **AVX-512** (residual 0..15): load the full 256-bit FP16 chunk - (`_mm256_loadu_si256` over 32 bytes; the chunk is always within the - query blob since `dim >= 16` and the FP16 metadata follows), convert with - `_mm512_cvtph_ps`, then mask away unused lanes via - `_mm512_maskz_mov_ps(mask, v2_f)` (or fold the mask into the - FP32 multiply with `_mm512_maskz_mul_ps`). The SQ8 side uses - `_mm_loadu_si128` + `_mm512_cvtepu8_epi32` + `_mm512_cvtepi32_ps` - and is also masked. -- **AVX2+FMA / AVX2** (residual 0..15, split into a 0..7 head plus a - conditional 8-wide pre-step): for the 0..7 head, load the full - 128-bit FP16 block (`_mm_loadu_si128`), convert with - `_mm256_cvtph_ps`, then zero out unused lanes via - `_mm256_blend_ps(_mm256_setzero_ps(), v2_f, residuals_mask)` — - mirroring the existing F16C `FP16_InnerProductSIMD32_F16C` blend - pattern. The SQ8 side uses `_mm_loadl_epi64` (8 bytes) + - `_mm256_cvtepu8_epi32` + `_mm256_cvtepi32_ps`. When residual ≥ 8, - one extra full 8-wide step runs before the do-while loop, matching - the SQ8_FP32 AVX2[+FMA] residual layout. -- **SSE4** (residual 0..15, split into 4-wide pre-steps): for the - 0..3 head, materialise the FP32 lanes via `_mm_set_ps(0, ..., 0, - FP16_to_FP32(pVec2[k]), ...)` paired with `_mm_set_ps` on the SQ8 - side — mirrors the existing SSE4 SQ8_FP32 `_mm_set_ps` residual - path. For residual ≥ 4 / ≥ 8 / ≥ 12, run 1 / 2 / 3 extra 4-wide - steps before the do-while loop. Each 4-wide step loads 8 bytes of - FP16 (`_mm_loadl_epi64`), converts with `_mm_cvtph_ps`, and loads - 4 SQ8 bytes via `_mm_cvtsi32_si128` + `_mm_cvtepu8_epi32` + - `_mm_cvtepi32_ps`. - -### Dispatcher edits - -Per existing ISA dispatcher (no new dispatcher files): - -| File | Add declarations / definitions | -| --- | --- | -| `src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.{h,cpp}` | `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F_BW_VL_VNNI` | -| `src/VecSim/spaces/functions/AVX2_FMA.{h,cpp}` | `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX2_FMA`, guarded by `#ifdef OPT_F16C` | -| `src/VecSim/spaces/functions/AVX2.{h,cpp}` | `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX2`, guarded by `#ifdef OPT_F16C` | -| `src/VecSim/spaces/functions/SSE4.{h,cpp}` | `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_SSE4`, guarded by `#ifdef OPT_F16C` | - -Each `Choose_*` uses the existing `CHOOSE_IMPLEMENTATION(out, dim, 16, -func)` macro (16-element residual table — matches SQ8_FP32 contract). - -`src/VecSim/spaces/IP_space.cpp` — extend `IP_SQ8_FP16_GetDistFunc` and -`Cosine_SQ8_FP16_GetDistFunc`. `L2_space.cpp` — extend -`L2_SQ8_FP16_GetDistFunc`. New body shape (IP shown; L2/Cosine -identical): - -```cpp -dist_func_t ret_dist_func = SQ8_FP16_InnerProduct; -[[maybe_unused]] auto features = getCpuOptimizationFeatures(arch_opt); - -#ifdef CPU_FEATURES_ARCH_X86_64 -if (dim < 16) { - return ret_dist_func; -} -#ifdef OPT_AVX512_F_BW_VL_VNNI -if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) { - if (dim % 16 == 0) *alignment = 16 * sizeof(uint8_t); - return Choose_SQ8_FP16_IP_implementation_AVX512F_BW_VL_VNNI(dim); -} -#endif -#ifdef OPT_AVX2_FMA -#ifdef OPT_F16C -if (features.avx2 && features.fma3 && features.f16c) { - if (dim % 8 == 0) *alignment = 8 * sizeof(uint8_t); - return Choose_SQ8_FP16_IP_implementation_AVX2_FMA(dim); -} -#endif -#endif -#ifdef OPT_AVX2 -#ifdef OPT_F16C -if (features.avx2 && features.f16c) { - if (dim % 8 == 0) *alignment = 8 * sizeof(uint8_t); - return Choose_SQ8_FP16_IP_implementation_AVX2(dim); -} -#endif -#endif -#ifdef OPT_SSE4 -#ifdef OPT_F16C -// F16C instructions are VEX-encoded — require AVX as well, matching the -// existing FP16/F16C dispatcher gate in IP_space.cpp. -if (features.sse4_1 && features.f16c && features.avx) { - if (dim % 4 == 0) *alignment = 4 * sizeof(uint8_t); - return Choose_SQ8_FP16_IP_implementation_SSE4(dim); -} -#endif -#endif -#endif // x86_64 -return ret_dist_func; -``` - -ARM block (`OPT_SVE2` / `OPT_SVE` / `OPT_NEON`) is left as-is — the -SQ8_FP16 ARM kernels arrive via MOD-14972. - -### CMake change - -`src/VecSim/spaces/CMakeLists.txt` — when both `CXX_F16C` and the -parent ISA flag are present, add `-mf16c` to the dispatcher file: - -```cmake -if(CXX_AVX2 AND CXX_FMA) - set(_avx2_fma_flags "-mavx2 -mfma") - if(CXX_F16C) - set(_avx2_fma_flags "${_avx2_fma_flags} -mf16c") - endif() - set_source_files_properties(functions/AVX2_FMA.cpp PROPERTIES COMPILE_FLAGS "${_avx2_fma_flags}") - list(APPEND OPTIMIZATIONS functions/AVX2_FMA.cpp) -endif() - -if(CXX_AVX2) - set(_avx2_flags "-mavx2") - if(CXX_F16C) - set(_avx2_flags "${_avx2_flags} -mf16c") - endif() - set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS "${_avx2_flags}") - list(APPEND OPTIMIZATIONS functions/AVX2.cpp) -endif() - -if(CXX_SSE4) - set(_sse4_flags "-msse4.1") - if(CXX_F16C) - set(_sse4_flags "${_sse4_flags} -mf16c") - endif() - set_source_files_properties(functions/SSE4.cpp PROPERTIES COMPILE_FLAGS "${_sse4_flags}") - list(APPEND OPTIMIZATIONS functions/SSE4.cpp) -endif() -``` - -AVX-512 dispatcher (`AVX512F_BW_VL_VNNI.cpp`) needs no flag change — -`-mavx512f` already enables `_mm512_cvtph_ps`. - -`-mf16c` does not alter the emitted code for the existing SQ8_FP32 -sources, since those sources contain no F16C intrinsics. - -### Tests (`tests/unit/test_spaces.cpp`) - -1. New parameterised class `SQ8_FP16_SpacesOptimizationTest` mirroring - `SQ8_FP32_SpacesOptimizationTest`. Three test bodies for L2 / IP / - Cosine, each comparing the chosen optimised function against the - scalar `SQ8_FP16_*` baseline (`ASSERT_NEAR ... 0.01`). Walks down - AVX512 → AVX2_FMA → AVX2 → SSE4 → scalar by zeroing feature flags - between assertions, exactly like `SQ8_FP32_SpacesOptimizationTest`. - `INSTANTIATE_TEST_SUITE_P` with `testing::Range(16UL, 16 * 2UL + 1)`. - -2. Update existing `SpacesTest.GetDistFunc_*_SQ8_FP16` assertions at - lines ~563–575: when running on x86, the dispatcher now returns the - SIMD `Choose_*` symbol instead of the scalar. AVX-512 selection - depends on `avx512f && avx512bw && avx512vl && avx512vnni` only - (no F16C requirement — 512-bit `_mm512_cvtph_ps` is part of - AVX512F). AVX2+FMA, AVX2, and SSE4 selection additionally requires - `features.f16c` (and `features.avx` for the SSE4 gate). The tests - should call `getCpuOptimizationFeatures()` and assert the expected - `Choose_*` for the host's highest supported tier (same shape used - by `SQ8_FP32_SpacesOptimizationTest`). - -3. Reuse existing helpers: `populate_sq8_fp16_query`, - `populate_float_vec_to_sq8_with_metadata`, - `SQ8_FP16_NotOptimized_{InnerProduct,Cosine,L2Sqr}`. - -### Benchmarks (`tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp`) - -Add per-ISA benches mirroring `bm_spaces_sq8_fp32.cpp`: - -```cpp -#ifdef CPU_FEATURES_ARCH_X86_64 -cpu_features::X86Features opt = cpu_features::GetX86Info().features; - -#ifdef OPT_AVX512_F_BW_VL_VNNI -bool avx512_f_bw_vl_vnni_supported = opt.avx512f && opt.avx512bw && opt.avx512vl && opt.avx512vnni; -INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F_BW_VL_VNNI, 16, avx512_f_bw_vl_vnni_supported); -INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F_BW_VL_VNNI, 16, avx512_f_bw_vl_vnni_supported); -#endif - -#ifdef OPT_F16C -#ifdef OPT_AVX2_FMA -bool avx2_fma3_f16c_supported = opt.avx2 && opt.fma3 && opt.f16c; -INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2_FMA, 16, avx2_fma3_f16c_supported); -INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2_FMA, 16, avx2_fma3_f16c_supported); -#endif - -#ifdef OPT_AVX2 -bool avx2_f16c_supported = opt.avx2 && opt.f16c; -INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2, 16, avx2_f16c_supported); -INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2, 16, avx2_f16c_supported); -#endif - -#ifdef OPT_SSE4 -bool sse4_f16c_supported = opt.sse4_1 && opt.f16c && opt.avx; -INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SSE4, 16, sse4_f16c_supported); -INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SSE4, 16, sse4_f16c_supported); -#endif -#endif // OPT_F16C -#endif // x86_64 -``` - -Naive bench lines stay (covers the scalar fallback case). - -## Validation strategy - -1. Unit tests (`SQ8_FP16_SpacesOptimizationTest`) assert numerical - parity against the scalar baseline for all dims in `[16, 32]` - (covers every residual class for the 16-wide chunk). Existing - `SQ8_FP16_NoOpt` parameterised suite continues to exercise small - and odd dims for the scalar reference; combined with the new - optimisation tests this covers each SIMD residual class plus the - scalar fallback. -2. Existing edge-case tests (`SQ8_FP16_EdgeCases.ZeroQueryTest`, - `SQ8_FP16_l2sqr_odd_dim_unaligned_metadata_test`) keep running - against the scalar implementation directly — they exercise - alignment-sensitive paths that are deliberately scalar-only. -3. Microbenchmarks compare per-ISA SQ8_FP16 throughput to the matching - SQ8_FP32 throughput on the same machine. Acceptance: SQ8_FP16 - should be within ~1.0–1.5× of SQ8_FP32 (one extra widening per - chunk, no extra memory pressure since the FP16 query is half the - size of FP32). -4. CI: x86 jobs already exist; verifies the CMake change keeps - building. No new toolchain requirement (binutils 2.34+ already - covers F16C, no AVX-512 FP16 dependency). - -## Risk register - -| Risk | Likelihood | Mitigation | -| --- | --- | --- | -| Adding `-mf16c` to AVX2_FMA.cpp / AVX2.cpp / SSE4.cpp accidentally enables F16C codegen elsewhere | Low | Those sources contain only SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 code; no F16C intrinsics — compiler cannot synthesise F16C without an explicit intrinsic. | -| Older toolchain without F16C support | Low | `CXX_F16C` already detected; `-mf16c` only appended when present. Dispatcher symbols guarded by `#ifdef OPT_F16C`; missing → falls through to scalar. | -| Backport branches diverge in dispatcher | Medium | Change is additive (new headers, new symbols, new gates). No SQ8_FP32 path touched. CMake change is conditional. Backport just cherry-picks the commit. | -| Pre-Ivy Bridge SSE4-only CPUs lose a SIMD tier (no F16C) | Negligible | Fall through to scalar SQ8_FP16. Such CPUs are out of practical support anyway. | -| Numerical drift between FP16→FP32 widening and the scalar `FP16_to_FP32` software path | Low | `vcvtph2ps` follows IEEE 754 half→single conversion exactly; the scalar `FP16_to_FP32` in `float16.h` is bit-faithful for finite values. Tests use `ASSERT_NEAR ... 0.01` slack. | - -## Out-of-scope follow-ups - -- AVX512FP16-native kernels (would use `__m512h` and `vfmadd*ph` - directly on 32 FP16 lanes per 512-bit register, skipping the - widen-to-FP32 step). Deferred for four concrete reasons, not just - "lower priority": - 1. **Deployment baseline.** AVX512FP16 is Sapphire Rapids and - newer (Intel server 2023+) plus very recent AMD parts. Most - production hosts running this library do not have it. The - AVX-512F path delivered here is the right default for the - widely-deployed AVX-512 tier, and a Sapphire-Rapids-only - variant would land underneath the same gating tree, not as a - replacement. - 2. **Numerical fit is awkward for SQ8↔FP16.** The kernel computes - `Σ(q_i · y_i)` where `q_i ∈ [0,255]` (uint8) and `y_i` is - FP16. Each lane product can be as large as - `255 · 65504 ≈ 1.67e7`, which is well above the FP16 finite - range (`±65504`). A pure FP16 accumulator would overflow on - realistic data; the only safe path is to accumulate in FP32 - after a per-chunk `vcvtph2ps`-equivalent — which is exactly - what the AVX-512F path already does. AVX512FP16 mainly buys - FP16-native multiply-add, which we cannot safely use here. - 3. **Marginal speedup over the AVX-512F path proposed here.** - The widening cost is one `_mm512_cvtph_ps` per 16-element - chunk against a kernel that is already memory-bandwidth-bound - (16 bytes of SQ8 storage + 32 bytes of FP16 query per chunk). - Eliminating that one conversion saves a few cycles per chunk - on a path that is gated on memory, not arithmetic throughput. - 4. **Ticket scope.** MOD-14954 enumerates AVX-512, AVX2+FMA, and - SSE4; the plain-AVX2 tier was added during brainstorming as - free coverage. An AVX512FP16 variant is its own ISA tier with - its own gating column in the dispatcher and its own residual - table, and warrants a separate design / benchmarking pass - once the deployment baseline justifies the maintenance cost. - Pure FP16↔FP16 (no SQ8 involved) already has an AVX512FP16_VL path - at `src/VecSim/spaces/functions/AVX512FP16_VL.cpp`; that file is the - natural home should we revisit this later. -- ARM SQ8_FP16 (MOD-14972). -- Reranking flow integration tests under HNSW (separate ticket). From 2a4ef92bc91a1e17e5cc590908a1c10c5aa12127 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 14:12:19 +0300 Subject: [PATCH 15/24] =?UTF-8?q?Simplify=20SQ8=E2=86=94FP16=20tests=20to?= =?UTF-8?q?=20match=20sister=20conventions=20[MOD-14954]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two trims, both restoring pre-existing patterns elsewhere in the file: 1. `GetDistFuncSQ8FP16Asymmetric` had grown a runtime SIMD-tier walk that duplicated coverage already provided by `SQ8_FP16_SpacesOptimizationTest`. Reduced to the bare dispatcher-equality check used by the FP32 / SQ8↔SQ8 sister tests at lines 540-548 and 551-559. 2. The `SQ8_FP16_EdgeCases` tests (`ZeroQueryTest`, `ConstantStorageTest`, `MixedSignQueryTest`) were routed through `{IP,Cosine,L2}_SQ8_FP16_GetDistFunc(dim, nullptr)` to force runtime SIMD dispatch on adversarial inputs. Reverted to direct scalar calls (`SQ8_FP16_InnerProduct`, etc.) — the original pre-fdc5c1cd shape. Coverage rationale: the SIMD kernels are branchless on input values (verified by grep — no value-dependent `if` in any tier). Every code path is therefore exercised by `SQ8_FP16_SpacesOptimizationTest`'s random inputs at multiple dims. The edge-case tests verify the *algebraic identity* (IP of zero query = 1.0, constant storage matches dequant baseline, mixed-sign handling) — scalar correctness on these inputs is what was actually being checked, and the SIMD path matches scalar via the SpacesOptimizationTest tier walk. Net: 77 lines removed from the test file, matches sister conventions, no coverage gap. --- tests/unit/test_spaces.cpp | 97 ++++++++------------------------------ 1 file changed, 20 insertions(+), 77 deletions(-) diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index 04618672a..a4da4abb8 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -560,61 +560,15 @@ TEST_F(SpacesTest, GetDistFuncSQ8Asymmetric) { } TEST_F(SpacesTest, GetDistFuncSQ8FP16Asymmetric) { - // SQ8 storage with FP16 query (asymmetric). The dispatcher now returns the highest SIMD - // tier available at runtime; assert that and fall back to scalar only if no tier matches. + // SQ8 storage with FP16 query (asymmetric) - should return SQ8_FP16 functions. + // Per-ISA dispatcher walk coverage lives in the SQ8_FP16 SpacesOptimizationTest below. size_t dim = 128; auto l2_func = spaces::GetDistFunc(VecSimMetric_L2, dim, nullptr); auto ip_func = spaces::GetDistFunc(VecSimMetric_IP, dim, nullptr); auto cosine_func = spaces::GetDistFunc(VecSimMetric_Cosine, dim, nullptr); - - auto optimization = getCpuOptimizationFeatures(); - dist_func_t expected_l2 = SQ8_FP16_L2Sqr; - dist_func_t expected_ip = SQ8_FP16_InnerProduct; - dist_func_t expected_cos = SQ8_FP16_Cosine; - -#ifdef CPU_FEATURES_ARCH_X86_64 - if (dim >= 16) { -#ifdef OPT_AVX512_F_BW_VL_VNNI - if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl && - optimization.avx512vnni) { - expected_l2 = Choose_SQ8_FP16_L2_implementation_AVX512F(dim); - expected_ip = Choose_SQ8_FP16_IP_implementation_AVX512F(dim); - expected_cos = Choose_SQ8_FP16_Cosine_implementation_AVX512F(dim); - } else -#endif -#if defined(OPT_AVX2_FMA) && defined(OPT_F16C) - if (optimization.avx2 && optimization.fma3 && optimization.f16c) { - expected_l2 = Choose_SQ8_FP16_L2_implementation_AVX2_FMA(dim); - expected_ip = Choose_SQ8_FP16_IP_implementation_AVX2_FMA(dim); - expected_cos = Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(dim); - } else -#endif -#if defined(OPT_AVX2) && defined(OPT_F16C) - if (optimization.avx2 && optimization.f16c) { - expected_l2 = Choose_SQ8_FP16_L2_implementation_AVX2(dim); - expected_ip = Choose_SQ8_FP16_IP_implementation_AVX2(dim); - expected_cos = Choose_SQ8_FP16_Cosine_implementation_AVX2(dim); - } else -#endif -#if defined(OPT_SSE4) && defined(OPT_F16C) - if (optimization.sse4_1 && optimization.f16c && optimization.avx) { - expected_l2 = Choose_SQ8_FP16_L2_implementation_SSE4(dim); - expected_ip = Choose_SQ8_FP16_IP_implementation_SSE4(dim); - expected_cos = Choose_SQ8_FP16_Cosine_implementation_SSE4(dim); - } else -#endif - { - // Falls through to scalar. - } - } -#endif // x86_64 - ASSERT_EQ(l2_func, L2_SQ8_FP16_GetDistFunc(dim, nullptr)); ASSERT_EQ(ip_func, IP_SQ8_FP16_GetDistFunc(dim, nullptr)); ASSERT_EQ(cosine_func, Cosine_SQ8_FP16_GetDistFunc(dim, nullptr)); - ASSERT_EQ(l2_func, expected_l2); - ASSERT_EQ(ip_func, expected_ip); - ASSERT_EQ(cosine_func, expected_cos); } #ifdef CPU_FEATURES_ARCH_X86_64 @@ -3425,8 +3379,9 @@ TEST(SQ8_FP16_SIMD_TierCoverage, ReportTiersExercised) { /* ======================== Tests SQ8_FP16 (edge cases) ========================= */ // Zero FP16 query against a non-zero SQ8 storage. IP must be exactly 1.0 (1 - 0), -// L2² must equal Σ dequantized². Routes through the dispatcher so the runtime-selected -// SIMD tier (AVX-512 / AVX2+FMA / AVX2 / SSE4 / scalar) is exercised, not just scalar. +// L2² must equal Σ dequantized². Math correctness on adversarial inputs is verified +// against the scalar reference; SIMD tier coverage with branchless kernels is provided +// separately by SQ8_FP16_SpacesOptimizationTest. TEST(SQ8_FP16_EdgeCases, ZeroQueryTest) { size_t dim = 64; @@ -3441,24 +3396,20 @@ TEST(SQ8_FP16_EdgeCases, ZeroQueryTest) { test_utils::populate_float_vec_to_sq8_with_metadata(v_nonzero_quantized.data(), dim, false, 1234); - auto ip_func = IP_SQ8_FP16_GetDistFunc(dim, nullptr); - auto l2_func = L2_SQ8_FP16_GetDistFunc(dim, nullptr); - float ip_baseline = test_utils::SQ8_FP16_NotOptimized_InnerProduct(v_nonzero_quantized.data(), v_zero_query.data(), dim); - float ip = ip_func(v_nonzero_quantized.data(), v_zero_query.data(), dim); - ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Zero-query SQ8_FP16 IP mismatch"; + float ip = SQ8_FP16_InnerProduct(v_nonzero_quantized.data(), v_zero_query.data(), dim); + ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Zero-query SQ8_FP16_InnerProduct mismatch"; ASSERT_NEAR(ip, 1.0f, 0.01f) << "Zero-query IP must equal 1.0 (1 - 0)"; float l2_baseline = test_utils::SQ8_FP16_NotOptimized_L2Sqr(v_nonzero_quantized.data(), v_zero_query.data(), dim); - float l2 = l2_func(v_nonzero_quantized.data(), v_zero_query.data(), dim); - ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Zero-query SQ8_FP16 L2 mismatch"; + float l2 = SQ8_FP16_L2Sqr(v_nonzero_quantized.data(), v_zero_query.data(), dim); + ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Zero-query SQ8_FP16_L2Sqr mismatch"; } // Constant SQ8 storage (all values identical => delta = 0). Storage quantizer sets delta to 1.0 -// to avoid div-by-zero, so verify the kernels still match the dequantization baseline. Routes -// through the dispatcher so the runtime-selected SIMD tier sees the edge input. +// to avoid div-by-zero, so verify the kernels still match the dequantization baseline. TEST(SQ8_FP16_EdgeCases, ConstantStorageTest) { size_t dim = 64; @@ -3474,23 +3425,19 @@ TEST(SQ8_FP16_EdgeCases, ConstantStorageTest) { test_utils::quantize_float_vec_to_sq8_with_metadata(v_const.data(), dim, v_const_quantized.data()); - auto ip_func = IP_SQ8_FP16_GetDistFunc(dim, nullptr); - auto l2_func = L2_SQ8_FP16_GetDistFunc(dim, nullptr); - float ip_baseline = test_utils::SQ8_FP16_NotOptimized_InnerProduct(v_const_quantized.data(), v_query.data(), dim); - float ip = ip_func(v_const_quantized.data(), v_query.data(), dim); - ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Constant-storage SQ8_FP16 IP mismatch"; + float ip = SQ8_FP16_InnerProduct(v_const_quantized.data(), v_query.data(), dim); + ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Constant-storage SQ8_FP16_InnerProduct mismatch"; float l2_baseline = test_utils::SQ8_FP16_NotOptimized_L2Sqr(v_const_quantized.data(), v_query.data(), dim); - float l2 = l2_func(v_const_quantized.data(), v_query.data(), dim); - ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Constant-storage SQ8_FP16 L2 mismatch"; + float l2 = SQ8_FP16_L2Sqr(v_const_quantized.data(), v_query.data(), dim); + ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Constant-storage SQ8_FP16_L2Sqr mismatch"; } // Mixed-sign FP16 query (alternating positive/negative values) verifies sign handling // in the FP16->FP32 widening path and in the algebraic identity used by the kernels. -// Routes through the dispatcher so the runtime-selected SIMD tier sees the edge input. TEST(SQ8_FP16_EdgeCases, MixedSignQueryTest) { size_t dim = 64; @@ -3510,24 +3457,20 @@ TEST(SQ8_FP16_EdgeCases, MixedSignQueryTest) { std::vector v_quantized(quantized_size); test_utils::populate_float_vec_to_sq8_with_metadata(v_quantized.data(), dim, false, 9876); - auto ip_func = IP_SQ8_FP16_GetDistFunc(dim, nullptr); - auto cos_func = Cosine_SQ8_FP16_GetDistFunc(dim, nullptr); - auto l2_func = L2_SQ8_FP16_GetDistFunc(dim, nullptr); - float ip_baseline = test_utils::SQ8_FP16_NotOptimized_InnerProduct(v_quantized.data(), v_query.data(), dim); - float ip = ip_func(v_quantized.data(), v_query.data(), dim); - ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Mixed-sign SQ8_FP16 IP mismatch"; + float ip = SQ8_FP16_InnerProduct(v_quantized.data(), v_query.data(), dim); + ASSERT_NEAR(ip, ip_baseline, 0.01f) << "Mixed-sign SQ8_FP16_InnerProduct mismatch"; float cos_baseline = test_utils::SQ8_FP16_NotOptimized_Cosine(v_quantized.data(), v_query.data(), dim); - float cos = cos_func(v_quantized.data(), v_query.data(), dim); - ASSERT_NEAR(cos, cos_baseline, 0.01f) << "Mixed-sign SQ8_FP16 Cosine mismatch"; + float cos = SQ8_FP16_Cosine(v_quantized.data(), v_query.data(), dim); + ASSERT_NEAR(cos, cos_baseline, 0.01f) << "Mixed-sign SQ8_FP16_Cosine mismatch"; float l2_baseline = test_utils::SQ8_FP16_NotOptimized_L2Sqr(v_quantized.data(), v_query.data(), dim); - float l2 = l2_func(v_quantized.data(), v_query.data(), dim); - ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Mixed-sign SQ8_FP16 L2 mismatch"; + float l2 = SQ8_FP16_L2Sqr(v_quantized.data(), v_query.data(), dim); + ASSERT_NEAR(l2, l2_baseline, 0.01f) << "Mixed-sign SQ8_FP16_L2Sqr mismatch"; } /* ======================== Tests SQ8_SQ8 ========================= */ From 929f694cdfd106eaba30c94cee87d8ef3564b18d Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 14:26:38 +0300 Subject: [PATCH 16/24] =?UTF-8?q?Split=20SQ8=E2=86=94FP16=20F16C=20kernels?= =?UTF-8?q?=20into=20sibling=20TUs=20[MOD-14954]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The SQ8↔FP16 kernels in the SSE4, AVX2, and AVX2+FMA tiers depend on F16C (`_mm_cvtph_ps` / `_mm256_cvtph_ps`), while every other kernel in those dispatcher TUs is F16C-clean. The previous arrangement mixed both under `#ifdef OPT_F16C` blocks inside the base dispatcher .cpp/.h files. Split each tier's F16C-dependent kernels off into a sibling TU: functions/SSE4.cpp → SSE4 + SQ8_FP32 (no F16C) functions/SSE4_F16C.cpp → SQ8_FP16 only (requires -mavx -mf16c) functions/AVX2.cpp → AVX2 + BF16 + SQ8_FP32 (no F16C) functions/AVX2_F16C.cpp → SQ8_FP16 only (requires -mf16c) functions/AVX2_FMA.cpp → SQ8_FP32 (no F16C) functions/AVX2_FMA_F16C.cpp → SQ8_FP16 only (requires -mf16c) The AVX-512 tier is unaffected — its SQ8_FP16 kernel uses `_mm512_cvtph_ps`, which is part of AVX-512F and not F16C. CMake now compiles each sibling TU conditionally on `_has_full_f16c` and applies the F16C flags only there. Base TUs no longer carry `-mf16c`, since they no longer reference F16C intrinsics. Result: - No `#ifdef OPT_F16C` directives in `functions/*.cpp` or `functions/*.h`. - Compile-time isolation: an F16C intrinsic accidentally added outside a `_F16C` sibling will fail to build, not silently miscompile. - Caller sites (`IP_space.cpp`, `L2_space.cpp`, `test_spaces.cpp`, `bm_spaces.h`) still gate the *calls* with `#ifdef OPT_F16C`; the new sibling .h includes are unconditional, since declarations alone don't link-error and the calls remain guarded. Verified: 131 SQ8_FP16 unit tests + 115 ASan + 1166 full test_spaces suite (covers other ISA tiers SQ8_FP32 / BF16 / INT8 / UINT8 to confirm no regression from the dispatcher restructure). --- src/VecSim/spaces/CMakeLists.txt | 55 ++++++++++--------- src/VecSim/spaces/IP_space.cpp | 3 + src/VecSim/spaces/L2_space.cpp | 3 + src/VecSim/spaces/functions/AVX2.cpp | 22 -------- src/VecSim/spaces/functions/AVX2.h | 5 -- src/VecSim/spaces/functions/AVX2_F16C.cpp | 35 ++++++++++++ src/VecSim/spaces/functions/AVX2_F16C.h | 23 ++++++++ src/VecSim/spaces/functions/AVX2_FMA.cpp | 22 -------- src/VecSim/spaces/functions/AVX2_FMA.h | 5 -- src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp | 35 ++++++++++++ src/VecSim/spaces/functions/AVX2_FMA_F16C.h | 23 ++++++++ src/VecSim/spaces/functions/SSE4.cpp | 22 -------- src/VecSim/spaces/functions/SSE4.h | 5 -- src/VecSim/spaces/functions/SSE4_F16C.cpp | 35 ++++++++++++ src/VecSim/spaces/functions/SSE4_F16C.h | 23 ++++++++ tests/benchmark/spaces_benchmarks/bm_spaces.h | 3 + tests/unit/test_spaces.cpp | 3 + 17 files changed, 215 insertions(+), 107 deletions(-) create mode 100644 src/VecSim/spaces/functions/AVX2_F16C.cpp create mode 100644 src/VecSim/spaces/functions/AVX2_F16C.h create mode 100644 src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp create mode 100644 src/VecSim/spaces/functions/AVX2_FMA_F16C.h create mode 100644 src/VecSim/spaces/functions/SSE4_F16C.cpp create mode 100644 src/VecSim/spaces/functions/SSE4_F16C.h diff --git a/src/VecSim/spaces/CMakeLists.txt b/src/VecSim/spaces/CMakeLists.txt index a580916d2..9b7477837 100644 --- a/src/VecSim/spaces/CMakeLists.txt +++ b/src/VecSim/spaces/CMakeLists.txt @@ -57,30 +57,33 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)") set(_has_full_f16c TRUE) endif() + # Base AVX2 / AVX2+FMA dispatcher TUs hold only kernels with no F16C dependency. + # SQ8↔FP16 kernels (which require F16C) live in sibling TUs functions/AVX2_F16C.cpp and + # functions/AVX2_FMA_F16C.cpp, compiled only when _has_full_f16c is true. if(CXX_AVX2) - set(_avx2_flags "-mavx2") - if(_has_full_f16c) - message("Building functions/AVX2.cpp with AVX2 and F16C") - set(_avx2_flags "${_avx2_flags} -mf16c") - else() - message("Building functions/AVX2.cpp with AVX2 (no F16C)") - endif() - set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS "${_avx2_flags}") + message("Building functions/AVX2.cpp with AVX2") + set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS "-mavx2") list(APPEND OPTIMIZATIONS functions/AVX2.cpp) endif() + if(CXX_AVX2 AND _has_full_f16c) + message("Building functions/AVX2_F16C.cpp with AVX2 and F16C") + set_source_files_properties(functions/AVX2_F16C.cpp PROPERTIES COMPILE_FLAGS "-mavx2 -mf16c") + list(APPEND OPTIMIZATIONS functions/AVX2_F16C.cpp) + endif() + if(CXX_AVX2 AND CXX_FMA) - set(_avx2_fma_flags "-mavx2 -mfma") - if(_has_full_f16c) - message("Building functions/AVX2_FMA.cpp with AVX2, FMA, and F16C") - set(_avx2_fma_flags "${_avx2_fma_flags} -mf16c") - else() - message("Building functions/AVX2_FMA.cpp with AVX2 and FMA (no F16C)") - endif() - set_source_files_properties(functions/AVX2_FMA.cpp PROPERTIES COMPILE_FLAGS "${_avx2_fma_flags}") + message("Building functions/AVX2_FMA.cpp with AVX2 and FMA") + set_source_files_properties(functions/AVX2_FMA.cpp PROPERTIES COMPILE_FLAGS "-mavx2 -mfma") list(APPEND OPTIMIZATIONS functions/AVX2_FMA.cpp) endif() + if(CXX_AVX2 AND CXX_FMA AND _has_full_f16c) + message("Building functions/AVX2_FMA_F16C.cpp with AVX2, FMA, and F16C") + set_source_files_properties(functions/AVX2_FMA_F16C.cpp PROPERTIES COMPILE_FLAGS "-mavx2 -mfma -mf16c") + list(APPEND OPTIMIZATIONS functions/AVX2_FMA_F16C.cpp) + endif() + if(CXX_F16C AND CXX_FMA AND CXX_AVX) message("Building with CXX_F16C") set_source_files_properties(functions/F16C.cpp PROPERTIES COMPILE_FLAGS "-mf16c -mfma -mavx") @@ -100,19 +103,19 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)") endif() if(CXX_SSE4) - set(_sse4_flags "-msse4.1") - if(_has_full_f16c) - # F16C is VEX-encoded → must compile with -mavx alongside -mf16c, matching the - # F16C.cpp recipe above. - message("Building functions/SSE4.cpp with SSE4.1, AVX, and F16C") - set(_sse4_flags "${_sse4_flags} -mavx -mf16c") - else() - message("Building functions/SSE4.cpp with SSE4.1 (no F16C)") - endif() - set_source_files_properties(functions/SSE4.cpp PROPERTIES COMPILE_FLAGS "${_sse4_flags}") + message("Building functions/SSE4.cpp with SSE4.1") + set_source_files_properties(functions/SSE4.cpp PROPERTIES COMPILE_FLAGS "-msse4.1") list(APPEND OPTIMIZATIONS functions/SSE4.cpp) endif() + # SSE4 SQ8↔FP16 kernels need F16C, which is VEX-encoded → require -mavx alongside -mf16c + # (mirrors the F16C.cpp recipe above). + if(CXX_SSE4 AND _has_full_f16c) + message("Building functions/SSE4_F16C.cpp with SSE4.1, AVX, and F16C") + set_source_files_properties(functions/SSE4_F16C.cpp PROPERTIES COMPILE_FLAGS "-msse4.1 -mavx -mf16c") + list(APPEND OPTIMIZATIONS functions/SSE4_F16C.cpp) + endif() + if(CXX_SSE) message("Building with SSE") set_source_files_properties(functions/SSE.cpp PROPERTIES COMPILE_FLAGS -msse) diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp index 1fd7381b7..cdc086683 100644 --- a/src/VecSim/spaces/IP_space.cpp +++ b/src/VecSim/spaces/IP_space.cpp @@ -20,9 +20,12 @@ #include "VecSim/spaces/functions/AVX512BF16_VL.h" #include "VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h" #include "VecSim/spaces/functions/AVX2.h" +#include "VecSim/spaces/functions/AVX2_F16C.h" #include "VecSim/spaces/functions/AVX2_FMA.h" +#include "VecSim/spaces/functions/AVX2_FMA_F16C.h" #include "VecSim/spaces/functions/SSE3.h" #include "VecSim/spaces/functions/SSE4.h" +#include "VecSim/spaces/functions/SSE4_F16C.h" #include "VecSim/spaces/functions/NEON.h" #include "VecSim/spaces/functions/NEON_DOTPROD.h" #include "VecSim/spaces/functions/NEON_HP.h" diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp index 0ada05f76..dcf8b2376 100644 --- a/src/VecSim/spaces/L2_space.cpp +++ b/src/VecSim/spaces/L2_space.cpp @@ -19,9 +19,12 @@ #include "VecSim/spaces/functions/AVX512FP16_VL.h" #include "VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h" #include "VecSim/spaces/functions/AVX2.h" +#include "VecSim/spaces/functions/AVX2_F16C.h" #include "VecSim/spaces/functions/AVX2_FMA.h" +#include "VecSim/spaces/functions/AVX2_FMA_F16C.h" #include "VecSim/spaces/functions/SSE3.h" #include "VecSim/spaces/functions/SSE4.h" +#include "VecSim/spaces/functions/SSE4_F16C.h" #include "VecSim/spaces/functions/NEON.h" #include "VecSim/spaces/functions/NEON_DOTPROD.h" #include "VecSim/spaces/functions/NEON_HP.h" diff --git a/src/VecSim/spaces/functions/AVX2.cpp b/src/VecSim/spaces/functions/AVX2.cpp index 7e229b003..0e6737f30 100644 --- a/src/VecSim/spaces/functions/AVX2.cpp +++ b/src/VecSim/spaces/functions/AVX2.cpp @@ -13,10 +13,6 @@ #include "VecSim/spaces/IP/IP_AVX2_SQ8_FP32.h" #include "VecSim/spaces/L2/L2_AVX2_SQ8_FP32.h" -#ifdef OPT_F16C -#include "VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h" -#include "VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h" -#endif namespace spaces { @@ -52,24 +48,6 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_AVX2(size_t dim) { return ret_dist_func; } -#ifdef OPT_F16C -dist_func_t Choose_SQ8_FP16_IP_implementation_AVX2(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX2); - return ret_dist_func; -} -dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX2(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX2); - return ret_dist_func; -} -dist_func_t Choose_SQ8_FP16_L2_implementation_AVX2(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX2); - return ret_dist_func; -} -#endif - #include "implementation_chooser_cleanup.h" } // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX2.h b/src/VecSim/spaces/functions/AVX2.h index 45fa2c951..283f6b95e 100644 --- a/src/VecSim/spaces/functions/AVX2.h +++ b/src/VecSim/spaces/functions/AVX2.h @@ -19,10 +19,5 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_AVX2(size_t dim); dist_func_t Choose_BF16_IP_implementation_AVX2(size_t dim); dist_func_t Choose_BF16_L2_implementation_AVX2(size_t dim); -#ifdef OPT_F16C -dist_func_t Choose_SQ8_FP16_IP_implementation_AVX2(size_t dim); -dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX2(size_t dim); -dist_func_t Choose_SQ8_FP16_L2_implementation_AVX2(size_t dim); -#endif } // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX2_F16C.cpp b/src/VecSim/spaces/functions/AVX2_F16C.cpp new file mode 100644 index 000000000..3d298e81b --- /dev/null +++ b/src/VecSim/spaces/functions/AVX2_F16C.cpp @@ -0,0 +1,35 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#include "AVX2_F16C.h" +#include "VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h" +#include "VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h" + +namespace spaces { + +#include "implementation_chooser.h" + +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX2(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX2); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX2(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX2); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX2(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX2); + return ret_dist_func; +} + +#include "implementation_chooser_cleanup.h" + +} // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX2_F16C.h b/src/VecSim/spaces/functions/AVX2_F16C.h new file mode 100644 index 000000000..95a171199 --- /dev/null +++ b/src/VecSim/spaces/functions/AVX2_F16C.h @@ -0,0 +1,23 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once + +#include "VecSim/spaces/spaces.h" + +// SQ8↔FP16 kernels for the AVX2 (no FMA) tier. Live in a sibling TU compiled only when the +// toolchain supports F16C (via `-mf16c`), so this header has no preprocessor guard. Callers +// still gate the calls themselves with `#ifdef OPT_F16C`. + +namespace spaces { + +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX2(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX2(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX2(size_t dim); + +} // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX2_FMA.cpp b/src/VecSim/spaces/functions/AVX2_FMA.cpp index 5745a4ddf..288b8c6cb 100644 --- a/src/VecSim/spaces/functions/AVX2_FMA.cpp +++ b/src/VecSim/spaces/functions/AVX2_FMA.cpp @@ -10,10 +10,6 @@ #include "VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP32.h" #include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP32.h" -#ifdef OPT_F16C -#include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h" -#include "VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h" -#endif namespace spaces { @@ -36,24 +32,6 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_AVX2_FMA(size_t dim) { return ret_dist_func; } -#ifdef OPT_F16C -dist_func_t Choose_SQ8_FP16_IP_implementation_AVX2_FMA(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX2_FMA); - return ret_dist_func; -} -dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX2_FMA); - return ret_dist_func; -} -dist_func_t Choose_SQ8_FP16_L2_implementation_AVX2_FMA(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX2_FMA); - return ret_dist_func; -} -#endif - #include "implementation_chooser_cleanup.h" } // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX2_FMA.h b/src/VecSim/spaces/functions/AVX2_FMA.h index 413f55081..21a364177 100644 --- a/src/VecSim/spaces/functions/AVX2_FMA.h +++ b/src/VecSim/spaces/functions/AVX2_FMA.h @@ -16,10 +16,5 @@ dist_func_t Choose_SQ8_FP32_IP_implementation_AVX2_FMA(size_t dim); dist_func_t Choose_SQ8_FP32_Cosine_implementation_AVX2_FMA(size_t dim); dist_func_t Choose_SQ8_FP32_L2_implementation_AVX2_FMA(size_t dim); -#ifdef OPT_F16C -dist_func_t Choose_SQ8_FP16_IP_implementation_AVX2_FMA(size_t dim); -dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(size_t dim); -dist_func_t Choose_SQ8_FP16_L2_implementation_AVX2_FMA(size_t dim); -#endif } // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp b/src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp new file mode 100644 index 000000000..4e9dd8131 --- /dev/null +++ b/src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp @@ -0,0 +1,35 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#include "AVX2_FMA_F16C.h" +#include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h" +#include "VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h" + +namespace spaces { + +#include "implementation_chooser.h" + +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX2_FMA(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX2_FMA); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX2_FMA); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX2_FMA(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX2_FMA); + return ret_dist_func; +} + +#include "implementation_chooser_cleanup.h" + +} // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX2_FMA_F16C.h b/src/VecSim/spaces/functions/AVX2_FMA_F16C.h new file mode 100644 index 000000000..7943ff4eb --- /dev/null +++ b/src/VecSim/spaces/functions/AVX2_FMA_F16C.h @@ -0,0 +1,23 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once + +#include "VecSim/spaces/spaces.h" + +// SQ8↔FP16 kernels for the AVX2+FMA tier. Live in a sibling TU compiled only when the +// toolchain supports F16C (via `-mf16c`), so this header has no preprocessor guard. Callers +// still gate the calls themselves with `#ifdef OPT_F16C`. + +namespace spaces { + +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX2_FMA(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX2_FMA(size_t dim); + +} // namespace spaces diff --git a/src/VecSim/spaces/functions/SSE4.cpp b/src/VecSim/spaces/functions/SSE4.cpp index e41762955..1a21d0000 100644 --- a/src/VecSim/spaces/functions/SSE4.cpp +++ b/src/VecSim/spaces/functions/SSE4.cpp @@ -10,10 +10,6 @@ #include "VecSim/spaces/IP/IP_SSE4_SQ8_FP32.h" #include "VecSim/spaces/L2/L2_SSE4_SQ8_FP32.h" -#ifdef OPT_F16C -#include "VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h" -#include "VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h" -#endif namespace spaces { @@ -37,24 +33,6 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_SSE4(size_t dim) { return ret_dist_func; } -#ifdef OPT_F16C -dist_func_t Choose_SQ8_FP16_IP_implementation_SSE4(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_SSE4); - return ret_dist_func; -} -dist_func_t Choose_SQ8_FP16_Cosine_implementation_SSE4(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_SSE4); - return ret_dist_func; -} -dist_func_t Choose_SQ8_FP16_L2_implementation_SSE4(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_SSE4); - return ret_dist_func; -} -#endif - #include "implementation_chooser_cleanup.h" } // namespace spaces diff --git a/src/VecSim/spaces/functions/SSE4.h b/src/VecSim/spaces/functions/SSE4.h index c33187983..b1d49c32a 100644 --- a/src/VecSim/spaces/functions/SSE4.h +++ b/src/VecSim/spaces/functions/SSE4.h @@ -16,10 +16,5 @@ dist_func_t Choose_SQ8_FP32_IP_implementation_SSE4(size_t dim); dist_func_t Choose_SQ8_FP32_Cosine_implementation_SSE4(size_t dim); dist_func_t Choose_SQ8_FP32_L2_implementation_SSE4(size_t dim); -#ifdef OPT_F16C -dist_func_t Choose_SQ8_FP16_IP_implementation_SSE4(size_t dim); -dist_func_t Choose_SQ8_FP16_Cosine_implementation_SSE4(size_t dim); -dist_func_t Choose_SQ8_FP16_L2_implementation_SSE4(size_t dim); -#endif } // namespace spaces diff --git a/src/VecSim/spaces/functions/SSE4_F16C.cpp b/src/VecSim/spaces/functions/SSE4_F16C.cpp new file mode 100644 index 000000000..91a11885f --- /dev/null +++ b/src/VecSim/spaces/functions/SSE4_F16C.cpp @@ -0,0 +1,35 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#include "SSE4_F16C.h" +#include "VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h" +#include "VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h" + +namespace spaces { + +#include "implementation_chooser.h" + +dist_func_t Choose_SQ8_FP16_IP_implementation_SSE4(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_SSE4); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_Cosine_implementation_SSE4(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_SSE4); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_L2_implementation_SSE4(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_SSE4); + return ret_dist_func; +} + +#include "implementation_chooser_cleanup.h" + +} // namespace spaces diff --git a/src/VecSim/spaces/functions/SSE4_F16C.h b/src/VecSim/spaces/functions/SSE4_F16C.h new file mode 100644 index 000000000..2459c216c --- /dev/null +++ b/src/VecSim/spaces/functions/SSE4_F16C.h @@ -0,0 +1,23 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once + +#include "VecSim/spaces/spaces.h" + +// SQ8↔FP16 kernels for the SSE4 tier. Live in a sibling TU compiled only when the toolchain +// supports F16C (via `-mf16c -mavx`), so this header has no preprocessor guard. Callers +// still gate the calls themselves with `#ifdef OPT_F16C`. + +namespace spaces { + +dist_func_t Choose_SQ8_FP16_IP_implementation_SSE4(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_SSE4(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_SSE4(size_t dim); + +} // namespace spaces diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces.h b/tests/benchmark/spaces_benchmarks/bm_spaces.h index d99bcc4ca..2303eac0a 100644 --- a/tests/benchmark/spaces_benchmarks/bm_spaces.h +++ b/tests/benchmark/spaces_benchmarks/bm_spaces.h @@ -24,9 +24,12 @@ #include "VecSim/spaces/functions/AVX512BF16_VL.h" #include "VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h" #include "VecSim/spaces/functions/AVX2.h" +#include "VecSim/spaces/functions/AVX2_F16C.h" #include "VecSim/spaces/functions/AVX2_FMA.h" +#include "VecSim/spaces/functions/AVX2_FMA_F16C.h" #include "VecSim/spaces/functions/F16C.h" #include "VecSim/spaces/functions/SSE4.h" +#include "VecSim/spaces/functions/SSE4_F16C.h" #include "VecSim/spaces/functions/SSE3.h" #include "VecSim/spaces/functions/SSE.h" #include "VecSim/spaces/functions/NEON.h" diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index a4da4abb8..9d082a315 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -32,9 +32,12 @@ #include "VecSim/spaces/functions/AVX512FP16_VL.h" #include "VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h" #include "VecSim/spaces/functions/AVX2.h" +#include "VecSim/spaces/functions/AVX2_F16C.h" #include "VecSim/spaces/functions/AVX2_FMA.h" +#include "VecSim/spaces/functions/AVX2_FMA_F16C.h" #include "VecSim/spaces/functions/SSE3.h" #include "VecSim/spaces/functions/SSE4.h" +#include "VecSim/spaces/functions/SSE4_F16C.h" #include "VecSim/spaces/functions/F16C.h" #include "VecSim/spaces/functions/NEON.h" #include "VecSim/spaces/functions/NEON_DOTPROD.h" From b689840f946d43f70023b7eb7f3cc0536ca721ea Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 14:38:12 +0300 Subject: [PATCH 17/24] =?UTF-8?q?Move=20SQ8=E2=86=94FP16=20AVX-512=20dispa?= =?UTF-8?q?tch=20to=20AVX512F=20tier=20+=20flatten=20F16C=20guards=20[MOD-?= =?UTF-8?q?14954]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two related cleanups in the SQ8↔FP16 dispatch path: 1. The AVX-512 SQ8↔FP16 kernel only uses AVX-512F instructions (`_mm512_cvtph_ps`, `_mm512_fmadd_ps`, etc.) but was registered under the VNNI tier (`OPT_AVX512_F_BW_VL_VNNI` + check of avx512f/bw/vl/vnni). That meant CPUs with AVX-512F but no VNNI (Skylake-X, some Cascade Lake variants, etc.) would fall through to AVX2_FMA even though they can run the AVX-512 kernel. Moved the `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F` definitions from `functions/AVX512F_BW_VL_VNNI.cpp` to `functions/AVX512F.cpp`, with matching header reshuffle. Dispatch sites now gate on `OPT_AVX512F` + `features.avx512f`. 2. F16C is a transversal requirement across the non-AVX-512 SQ8↔FP16 tiers (SSE4, AVX2, AVX2+FMA) — every one of them widens FP16 query lanes via `vcvtph2ps`. Per-tier nested `#ifdef OPT_F16C` was hoisted into a single outer block around the three ISA branches in `IP_SQ8_FP16_GetDistFunc`, `Cosine_SQ8_FP16_GetDistFunc`, and `L2_SQ8_FP16_GetDistFunc`. Verified: 131 SQ8_FP16 release + 115 ASan + 1166 full test_spaces suite. --- src/VecSim/spaces/IP_space.cpp | 27 ++++++++----------- src/VecSim/spaces/L2_space.cpp | 15 +++++------ src/VecSim/spaces/functions/AVX2.cpp | 1 - src/VecSim/spaces/functions/AVX2.h | 1 - src/VecSim/spaces/functions/AVX512F.cpp | 21 +++++++++++++++ src/VecSim/spaces/functions/AVX512F.h | 5 ++++ .../spaces/functions/AVX512F_BW_VL_VNNI.cpp | 21 +-------------- .../spaces/functions/AVX512F_BW_VL_VNNI.h | 6 +---- src/VecSim/spaces/functions/SSE4.h | 1 - .../spaces_benchmarks/bm_spaces_sq8_fp16.cpp | 14 ++++------ tests/unit/test_spaces.cpp | 15 +++++------ 11 files changed, 57 insertions(+), 70 deletions(-) diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp index cdc086683..b57971b60 100644 --- a/src/VecSim/spaces/IP_space.cpp +++ b/src/VecSim/spaces/IP_space.cpp @@ -190,33 +190,32 @@ dist_func_t IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, return ret_dist_func; } // Alignment hints below refer to the SQ8 (first) operand per the GetDistFunc contract. -#ifdef OPT_AVX512_F_BW_VL_VNNI - if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) { + // AVX-512 tier only needs AVX-512F (cvtph_ps is part of AVX-512F, no VNNI/BW/VL required). +#ifdef OPT_AVX512F + if (features.avx512f) { if (dim % 16 == 0) // SQ8 chunk = 16 bytes *alignment = 16 * sizeof(uint8_t); return Choose_SQ8_FP16_IP_implementation_AVX512F(dim); } #endif -#ifdef OPT_AVX2_FMA + // F16C is required by every non-AVX-512 SQ8↔FP16 tier (vcvtph2ps), so the guard is hoisted + // around all three. #ifdef OPT_F16C +#ifdef OPT_AVX2_FMA if (features.avx2 && features.fma3 && features.f16c) { if (dim % 8 == 0) // SQ8 chunk = 8 bytes *alignment = 8 * sizeof(uint8_t); return Choose_SQ8_FP16_IP_implementation_AVX2_FMA(dim); } #endif -#endif #ifdef OPT_AVX2 -#ifdef OPT_F16C if (features.avx2 && features.f16c) { if (dim % 8 == 0) *alignment = 8 * sizeof(uint8_t); return Choose_SQ8_FP16_IP_implementation_AVX2(dim); } #endif -#endif #ifdef OPT_SSE4 -#ifdef OPT_F16C // F16C is VEX-encoded — require AVX as well, matching the existing F16C/FP16 dispatcher. if (features.sse4_1 && features.f16c && features.avx) { if (dim % 4 == 0) @@ -224,7 +223,7 @@ dist_func_t IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, return Choose_SQ8_FP16_IP_implementation_SSE4(dim); } #endif -#endif +#endif // OPT_F16C #endif // x86_64 return ret_dist_func; } @@ -244,40 +243,36 @@ dist_func_t Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignm if (dim < 16) { return ret_dist_func; } -#ifdef OPT_AVX512_F_BW_VL_VNNI - if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) { +#ifdef OPT_AVX512F + if (features.avx512f) { if (dim % 16 == 0) *alignment = 16 * sizeof(uint8_t); return Choose_SQ8_FP16_Cosine_implementation_AVX512F(dim); } #endif -#ifdef OPT_AVX2_FMA #ifdef OPT_F16C +#ifdef OPT_AVX2_FMA if (features.avx2 && features.fma3 && features.f16c) { if (dim % 8 == 0) *alignment = 8 * sizeof(uint8_t); return Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(dim); } #endif -#endif #ifdef OPT_AVX2 -#ifdef OPT_F16C if (features.avx2 && features.f16c) { if (dim % 8 == 0) *alignment = 8 * sizeof(uint8_t); return Choose_SQ8_FP16_Cosine_implementation_AVX2(dim); } #endif -#endif #ifdef OPT_SSE4 -#ifdef OPT_F16C if (features.sse4_1 && features.f16c && features.avx) { if (dim % 4 == 0) *alignment = 4 * sizeof(uint8_t); return Choose_SQ8_FP16_Cosine_implementation_SSE4(dim); } #endif -#endif +#endif // OPT_F16C #endif // x86_64 return ret_dist_func; } diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp index dcf8b2376..43020399f 100644 --- a/src/VecSim/spaces/L2_space.cpp +++ b/src/VecSim/spaces/L2_space.cpp @@ -122,40 +122,39 @@ dist_func_t L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, return ret_dist_func; } // Alignment hints below refer to the SQ8 (first) operand per the GetDistFunc contract. -#ifdef OPT_AVX512_F_BW_VL_VNNI - if (features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni) { + // AVX-512 tier only needs AVX-512F (cvtph_ps is part of AVX-512F, no VNNI/BW/VL required). +#ifdef OPT_AVX512F + if (features.avx512f) { if (dim % 16 == 0) *alignment = 16 * sizeof(uint8_t); return Choose_SQ8_FP16_L2_implementation_AVX512F(dim); } #endif -#ifdef OPT_AVX2_FMA + // F16C is required by every non-AVX-512 SQ8↔FP16 tier (vcvtph2ps), so the guard is hoisted + // around all three. #ifdef OPT_F16C +#ifdef OPT_AVX2_FMA if (features.avx2 && features.fma3 && features.f16c) { if (dim % 8 == 0) *alignment = 8 * sizeof(uint8_t); return Choose_SQ8_FP16_L2_implementation_AVX2_FMA(dim); } #endif -#endif #ifdef OPT_AVX2 -#ifdef OPT_F16C if (features.avx2 && features.f16c) { if (dim % 8 == 0) *alignment = 8 * sizeof(uint8_t); return Choose_SQ8_FP16_L2_implementation_AVX2(dim); } #endif -#endif #ifdef OPT_SSE4 -#ifdef OPT_F16C if (features.sse4_1 && features.f16c && features.avx) { if (dim % 4 == 0) *alignment = 4 * sizeof(uint8_t); return Choose_SQ8_FP16_L2_implementation_SSE4(dim); } #endif -#endif +#endif // OPT_F16C #endif // x86_64 return ret_dist_func; } diff --git a/src/VecSim/spaces/functions/AVX2.cpp b/src/VecSim/spaces/functions/AVX2.cpp index 0e6737f30..322ed0aec 100644 --- a/src/VecSim/spaces/functions/AVX2.cpp +++ b/src/VecSim/spaces/functions/AVX2.cpp @@ -13,7 +13,6 @@ #include "VecSim/spaces/IP/IP_AVX2_SQ8_FP32.h" #include "VecSim/spaces/L2/L2_AVX2_SQ8_FP32.h" - namespace spaces { #include "implementation_chooser.h" diff --git a/src/VecSim/spaces/functions/AVX2.h b/src/VecSim/spaces/functions/AVX2.h index 283f6b95e..081c42a4e 100644 --- a/src/VecSim/spaces/functions/AVX2.h +++ b/src/VecSim/spaces/functions/AVX2.h @@ -19,5 +19,4 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_AVX2(size_t dim); dist_func_t Choose_BF16_IP_implementation_AVX2(size_t dim); dist_func_t Choose_BF16_L2_implementation_AVX2(size_t dim); - } // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX512F.cpp b/src/VecSim/spaces/functions/AVX512F.cpp index e765f4c8b..feb261fb4 100644 --- a/src/VecSim/spaces/functions/AVX512F.cpp +++ b/src/VecSim/spaces/functions/AVX512F.cpp @@ -11,10 +11,12 @@ #include "VecSim/spaces/L2/L2_AVX512F_FP16.h" #include "VecSim/spaces/L2/L2_AVX512F_FP32.h" #include "VecSim/spaces/L2/L2_AVX512F_FP64.h" +#include "VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h" #include "VecSim/spaces/IP/IP_AVX512F_FP16.h" #include "VecSim/spaces/IP/IP_AVX512F_FP32.h" #include "VecSim/spaces/IP/IP_AVX512F_FP64.h" +#include "VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h" namespace spaces { @@ -56,6 +58,25 @@ dist_func_t Choose_FP16_L2_implementation_AVX512F(size_t dim) { return ret_dist_func; } +// SQ8↔FP16 kernels only use AVX-512F (cvtph_ps + FMA), so they register here rather than under +// the VNNI tier — CPUs with AVX-512F but no VNNI (Skylake-X, some Cascade Lake variants) can use +// these kernels. +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX512F(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX512F); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX512F(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX512F); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX512F(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX512F); + return ret_dist_func; +} + #include "implementation_chooser_cleanup.h" } // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX512F.h b/src/VecSim/spaces/functions/AVX512F.h index fd36f312f..8d600f961 100644 --- a/src/VecSim/spaces/functions/AVX512F.h +++ b/src/VecSim/spaces/functions/AVX512F.h @@ -20,4 +20,9 @@ dist_func_t Choose_FP16_L2_implementation_AVX512F(size_t dim); dist_func_t Choose_FP32_L2_implementation_AVX512F(size_t dim); dist_func_t Choose_FP64_L2_implementation_AVX512F(size_t dim); +// SQ8↔FP16 kernels — only need AVX-512F, not VNNI/BW/VL. +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX512F(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX512F(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX512F(size_t dim); + } // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp index 145300f24..090b192bf 100644 --- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp +++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp @@ -17,9 +17,6 @@ #include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_FP32.h" #include "VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_FP32.h" -#include "VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h" -#include "VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h" - #include "VecSim/spaces/IP/IP_AVX512F_BW_VL_VNNI_SQ8_SQ8.h" #include "VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8_SQ8.h" @@ -79,23 +76,7 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t d return ret_dist_func; } -// SQ8-to-FP16 distance functions. The kernels themselves only use AVX-512F (cvtph_ps + FMA); -// they register under the VNNI tier solely for CPU-feature dispatch. -dist_func_t Choose_SQ8_FP16_IP_implementation_AVX512F(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX512F); - return ret_dist_func; -} -dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX512F(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX512F); - return ret_dist_func; -} -dist_func_t Choose_SQ8_FP16_L2_implementation_AVX512F(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX512F); - return ret_dist_func; -} +// SQ8-to-FP16 dispatch lives in functions/AVX512F.cpp — the kernel only needs AVX-512F. // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum) dist_func_t Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim) { diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h index 13dd9e8a8..13cf06264 100644 --- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h +++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h @@ -24,11 +24,7 @@ dist_func_t Choose_SQ8_FP32_IP_implementation_AVX512F_BW_VL_VNNI(size_t d dist_func_t Choose_SQ8_FP32_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim); dist_func_t Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t dim); -// SQ8-to-FP16 kernels only use AVX-512F instructions; they are declared here because -// they register under the VNNI tier for CPU-feature dispatch. -dist_func_t Choose_SQ8_FP16_IP_implementation_AVX512F(size_t dim); -dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX512F(size_t dim); -dist_func_t Choose_SQ8_FP16_L2_implementation_AVX512F(size_t dim); +// SQ8-to-FP16 dispatch declared in AVX512F.h — kernel only needs AVX-512F. // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum) dist_func_t Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim); diff --git a/src/VecSim/spaces/functions/SSE4.h b/src/VecSim/spaces/functions/SSE4.h index b1d49c32a..e47948137 100644 --- a/src/VecSim/spaces/functions/SSE4.h +++ b/src/VecSim/spaces/functions/SSE4.h @@ -16,5 +16,4 @@ dist_func_t Choose_SQ8_FP32_IP_implementation_SSE4(size_t dim); dist_func_t Choose_SQ8_FP32_Cosine_implementation_SSE4(size_t dim); dist_func_t Choose_SQ8_FP32_L2_implementation_SSE4(size_t dim); - } // namespace spaces diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp index 04cb13eea..d6e49a180 100644 --- a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp +++ b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp @@ -54,15 +54,11 @@ class BM_VecSimSpaces_SQ8_FP16 : public benchmark::Fixture { #ifdef CPU_FEATURES_ARCH_X86_64 cpu_features::X86Features opt = cpu_features::GetX86Info().features; -// AVX-512 F+BW+VL+VNNI (no F16C requirement — _mm512_cvtph_ps is part of AVX512F). -#ifdef OPT_AVX512_F_BW_VL_VNNI -bool avx512_f_bw_vl_vnni_supported = opt.avx512f && opt.avx512bw && opt.avx512vl && opt.avx512vnni; -// Kernel itself only needs AVX-512F (cvtph_ps + FMA); the VNNI feature check keeps it on the -// same dispatch tier as the rest of this file. -INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16, - avx512_f_bw_vl_vnni_supported); -INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16, - avx512_f_bw_vl_vnni_supported); +// AVX-512F is sufficient — _mm512_cvtph_ps is part of AVX-512F, no F16C/VNNI/BW/VL needed. +#ifdef OPT_AVX512F +bool avx512f_supported = opt.avx512f; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16, avx512f_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16, avx512f_supported); #endif #ifdef OPT_F16C diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index 9d082a315..b880b6f13 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -3094,9 +3094,8 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { dist_func_t arch_opt_func; float baseline = SQ8_FP16_L2Sqr(v2_compressed.data(), v1_query.data(), dim); -#ifdef OPT_AVX512_F_BW_VL_VNNI - if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl && - optimization.avx512vnni) { +#ifdef OPT_AVX512F + if (optimization.avx512f) { unsigned char alignment = 0; arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_AVX512F(dim)) @@ -3177,9 +3176,8 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { dist_func_t arch_opt_func; float baseline = SQ8_FP16_InnerProduct(v2_compressed.data(), v1_query.data(), dim); -#ifdef OPT_AVX512_F_BW_VL_VNNI - if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl && - optimization.avx512vnni) { +#ifdef OPT_AVX512F + if (optimization.avx512f) { unsigned char alignment = 0; arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_AVX512F(dim)) @@ -3258,9 +3256,8 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { dist_func_t arch_opt_func; float baseline = SQ8_FP16_Cosine(v2_compressed.data(), v1_query.data(), dim); -#ifdef OPT_AVX512_F_BW_VL_VNNI - if (optimization.avx512f && optimization.avx512bw && optimization.avx512vl && - optimization.avx512vnni) { +#ifdef OPT_AVX512F + if (optimization.avx512f) { unsigned char alignment = 0; arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_AVX512F(dim)) From 839fe3c669f4ed9371b1e0663a0e6da6a73e6321 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 14:47:32 +0300 Subject: [PATCH 18/24] Clean up whitespace and formatting inconsistencies Remove extraneous blank lines in SSE4 and AVX2_FMA source files, fix indentation in AVX512F SQ8_FP16 function signatures, and reformat benchmark macro invocation to fit line length conventions. --- src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h | 5 ++--- src/VecSim/spaces/functions/AVX2_FMA.cpp | 1 - src/VecSim/spaces/functions/AVX2_FMA.h | 1 - src/VecSim/spaces/functions/SSE4.cpp | 1 - tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp | 3 ++- 5 files changed, 4 insertions(+), 7 deletions(-) diff --git a/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h index 955f431f6..7ba9c0412 100644 --- a/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h @@ -111,13 +111,12 @@ float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, si template // 0..15 float SQ8_FP16_InnerProductSIMD16_AVX512F(const void *pVec1v, const void *pVec2v, - size_t dimension) { + size_t dimension) { return 1.0f - SQ8_FP16_InnerProductImp_AVX512(pVec1v, pVec2v, dimension); } template // 0..15 -float SQ8_FP16_CosineSIMD16_AVX512F(const void *pVec1v, const void *pVec2v, - size_t dimension) { +float SQ8_FP16_CosineSIMD16_AVX512F(const void *pVec1v, const void *pVec2v, size_t dimension) { // Cosine distance = 1 - IP for pre-normalised vectors. Aliases InnerProduct, matching the // SQ8_FP32 pattern. return SQ8_FP16_InnerProductSIMD16_AVX512F(pVec1v, pVec2v, dimension); diff --git a/src/VecSim/spaces/functions/AVX2_FMA.cpp b/src/VecSim/spaces/functions/AVX2_FMA.cpp index 288b8c6cb..c859128b2 100644 --- a/src/VecSim/spaces/functions/AVX2_FMA.cpp +++ b/src/VecSim/spaces/functions/AVX2_FMA.cpp @@ -10,7 +10,6 @@ #include "VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP32.h" #include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP32.h" - namespace spaces { #include "implementation_chooser.h" diff --git a/src/VecSim/spaces/functions/AVX2_FMA.h b/src/VecSim/spaces/functions/AVX2_FMA.h index 21a364177..b20b1a588 100644 --- a/src/VecSim/spaces/functions/AVX2_FMA.h +++ b/src/VecSim/spaces/functions/AVX2_FMA.h @@ -16,5 +16,4 @@ dist_func_t Choose_SQ8_FP32_IP_implementation_AVX2_FMA(size_t dim); dist_func_t Choose_SQ8_FP32_Cosine_implementation_AVX2_FMA(size_t dim); dist_func_t Choose_SQ8_FP32_L2_implementation_AVX2_FMA(size_t dim); - } // namespace spaces diff --git a/src/VecSim/spaces/functions/SSE4.cpp b/src/VecSim/spaces/functions/SSE4.cpp index 1a21d0000..5f5bbc1ba 100644 --- a/src/VecSim/spaces/functions/SSE4.cpp +++ b/src/VecSim/spaces/functions/SSE4.cpp @@ -10,7 +10,6 @@ #include "VecSim/spaces/IP/IP_SSE4_SQ8_FP32.h" #include "VecSim/spaces/L2/L2_SSE4_SQ8_FP32.h" - namespace spaces { #include "implementation_chooser.h" diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp index d6e49a180..ba3030064 100644 --- a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp +++ b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp @@ -58,7 +58,8 @@ cpu_features::X86Features opt = cpu_features::GetX86Info().features; #ifdef OPT_AVX512F bool avx512f_supported = opt.avx512f; INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16, avx512f_supported); -INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16, avx512f_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16, + avx512f_supported); #endif #ifdef OPT_F16C From 3565985ef9849e51f567707dbdb71936dccce984 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 15:41:43 +0300 Subject: [PATCH 19/24] Remove obsolete SQ8-to-FP16 dispatch comments The comments referencing SQ8-to-FP16 dispatch location are no longer accurate after the recent refactoring that moved the dispatch logic. Clean up these stale comments from the AVX512F_BW_VL_VNNI files. --- src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp | 2 -- src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h | 2 -- 2 files changed, 4 deletions(-) diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp index 090b192bf..712bdda4e 100644 --- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp +++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp @@ -76,8 +76,6 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t d return ret_dist_func; } -// SQ8-to-FP16 dispatch lives in functions/AVX512F.cpp — the kernel only needs AVX-512F. - // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum) dist_func_t Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim) { dist_func_t ret_dist_func; diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h index 13cf06264..fe1583491 100644 --- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h +++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h @@ -24,8 +24,6 @@ dist_func_t Choose_SQ8_FP32_IP_implementation_AVX512F_BW_VL_VNNI(size_t d dist_func_t Choose_SQ8_FP32_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim); dist_func_t Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t dim); -// SQ8-to-FP16 dispatch declared in AVX512F.h — kernel only needs AVX-512F. - // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum) dist_func_t Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim); dist_func_t Choose_SQ8_SQ8_Cosine_implementation_AVX512F_BW_VL_VNNI(size_t dim); From 771bb39e5754c715e3cc5c2fba767c7bd183581f Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Sun, 31 May 2026 11:58:54 +0300 Subject: [PATCH 20/24] =?UTF-8?q?Hoist=20OPT=5FF16C=20guard=20around=20low?= =?UTF-8?q?er=20SIMD=20tiers=20in=20SQ8=E2=86=94FP16=20tests=20[MOD-14954]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Mirrors the dispatcher layout in IP_space.cpp / L2_space.cpp where a single OPT_F16C guard wraps the AVX2+FMA, AVX2, and SSE4 branches. Each test body (L2/IP/Cosine) and the TierCoverage report now use the same single-guard shape. Also retargets the TierCoverage AVX-512 check from OPT_AVX512_F_BW_VL_VNNI to OPT_AVX512F, matching the dispatcher's new AVX-512F-only gate. Co-Authored-By: Claude Opus 4.7 (1M context) --- tests/unit/test_spaces.cpp | 47 ++++++++++++++++++-------------------- 1 file changed, 22 insertions(+), 25 deletions(-) diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index b880b6f13..71573ed03 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -3105,8 +3105,10 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { optimization.avx512f = 0; } #endif -#ifdef OPT_AVX2_FMA + // F16C is required by every non-AVX-512 SQ8↔FP16 tier (vcvtph2ps), so the guard is hoisted + // around all three — matches the dispatcher layout in L2_space.cpp. #ifdef OPT_F16C +#ifdef OPT_AVX2_FMA if (optimization.avx2 && optimization.fma3 && optimization.f16c) { unsigned char alignment = 0; arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); @@ -3117,9 +3119,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { optimization.fma3 = 0; } #endif -#endif #ifdef OPT_AVX2 -#ifdef OPT_F16C if (optimization.avx2 && optimization.f16c) { unsigned char alignment = 0; arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); @@ -3130,9 +3130,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { optimization.avx2 = 0; } #endif -#endif #ifdef OPT_SSE4 -#ifdef OPT_F16C if (optimization.sse4_1 && optimization.f16c && optimization.avx) { unsigned char alignment = 0; arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); @@ -3143,7 +3141,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { optimization.sse4_1 = 0; } #endif -#endif +#endif // OPT_F16C // Scalar fallback. Init alignment to a sentinel (0xFF) so the assert below actually verifies // that the dispatcher LEAVES THE VALUE UNTOUCHED on the scalar path — initialising to 0 then @@ -3187,8 +3185,10 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { optimization.avx512f = 0; } #endif -#ifdef OPT_AVX2_FMA + // F16C is required by every non-AVX-512 SQ8↔FP16 tier (vcvtph2ps), so the guard is hoisted + // around all three — matches the dispatcher layout in IP_space.cpp. #ifdef OPT_F16C +#ifdef OPT_AVX2_FMA if (optimization.avx2 && optimization.fma3 && optimization.f16c) { unsigned char alignment = 0; arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); @@ -3199,9 +3199,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { optimization.fma3 = 0; } #endif -#endif #ifdef OPT_AVX2 -#ifdef OPT_F16C if (optimization.avx2 && optimization.f16c) { unsigned char alignment = 0; arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); @@ -3212,9 +3210,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { optimization.avx2 = 0; } #endif -#endif #ifdef OPT_SSE4 -#ifdef OPT_F16C if (optimization.sse4_1 && optimization.f16c && optimization.avx) { unsigned char alignment = 0; arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); @@ -3225,7 +3221,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { optimization.sse4_1 = 0; } #endif -#endif +#endif // OPT_F16C // Scalar fallback — see L2 test for the 0xFF sentinel rationale. unsigned char alignment = 0xFF; @@ -3267,8 +3263,10 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { optimization.avx512f = 0; } #endif -#ifdef OPT_AVX2_FMA + // F16C is required by every non-AVX-512 SQ8↔FP16 tier (vcvtph2ps), so the guard is hoisted + // around all three — matches the dispatcher layout in IP_space.cpp. #ifdef OPT_F16C +#ifdef OPT_AVX2_FMA if (optimization.avx2 && optimization.fma3 && optimization.f16c) { unsigned char alignment = 0; arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); @@ -3279,9 +3277,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { optimization.fma3 = 0; } #endif -#endif #ifdef OPT_AVX2 -#ifdef OPT_F16C if (optimization.avx2 && optimization.f16c) { unsigned char alignment = 0; arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); @@ -3292,9 +3288,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { optimization.avx2 = 0; } #endif -#endif #ifdef OPT_SSE4 -#ifdef OPT_F16C if (optimization.sse4_1 && optimization.f16c && optimization.avx) { unsigned char alignment = 0; arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); @@ -3305,7 +3299,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { optimization.sse4_1 = 0; } #endif -#endif +#endif // OPT_F16C // Scalar fallback — see L2 test for the 0xFF sentinel rationale. unsigned char alignment = 0xFF; @@ -3337,15 +3331,17 @@ TEST(SQ8_FP16_SIMD_TierCoverage, ReportTiersExercised) { bool any_simd = false; #ifdef CPU_FEATURES_ARCH_X86_64 -#ifdef OPT_AVX512_F_BW_VL_VNNI - if (opt.avx512f && opt.avx512bw && opt.avx512vl && opt.avx512vnni) { - std::cerr << "[SQ8_FP16] AVX-512 F+BW+VL+VNNI tier exercised\n"; +#ifdef OPT_AVX512F + if (opt.avx512f) { + std::cerr << "[SQ8_FP16] AVX-512F tier exercised\n"; any_simd = true; } else { - std::cerr << "[SQ8_FP16] AVX-512 F+BW+VL+VNNI tier NOT exercised on this host\n"; + std::cerr << "[SQ8_FP16] AVX-512F tier NOT exercised on this host\n"; } #endif -#if defined(OPT_AVX2_FMA) && defined(OPT_F16C) + // F16C guards all non-AVX-512 SQ8↔FP16 tiers — matches the dispatcher layout. +#ifdef OPT_F16C +#ifdef OPT_AVX2_FMA if (opt.avx2 && opt.fma3 && opt.f16c) { std::cerr << "[SQ8_FP16] AVX2+FMA+F16C tier exercised\n"; any_simd = true; @@ -3353,7 +3349,7 @@ TEST(SQ8_FP16_SIMD_TierCoverage, ReportTiersExercised) { std::cerr << "[SQ8_FP16] AVX2+FMA+F16C tier NOT exercised on this host\n"; } #endif -#if defined(OPT_AVX2) && defined(OPT_F16C) +#ifdef OPT_AVX2 if (opt.avx2 && opt.f16c) { std::cerr << "[SQ8_FP16] AVX2+F16C tier exercised\n"; any_simd = true; @@ -3361,7 +3357,7 @@ TEST(SQ8_FP16_SIMD_TierCoverage, ReportTiersExercised) { std::cerr << "[SQ8_FP16] AVX2+F16C tier NOT exercised on this host\n"; } #endif -#if defined(OPT_SSE4) && defined(OPT_F16C) +#ifdef OPT_SSE4 if (opt.sse4_1 && opt.f16c && opt.avx) { std::cerr << "[SQ8_FP16] SSE4+F16C+AVX tier exercised\n"; any_simd = true; @@ -3369,6 +3365,7 @@ TEST(SQ8_FP16_SIMD_TierCoverage, ReportTiersExercised) { std::cerr << "[SQ8_FP16] SSE4+F16C+AVX tier NOT exercised on this host\n"; } #endif +#endif // OPT_F16C #endif // x86_64 if (!any_simd) { From 8fe3d7431fd0c3f54bd000461be4623048b2aa79 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Sun, 31 May 2026 12:24:56 +0300 Subject: [PATCH 21/24] =?UTF-8?q?Drop=20non-idiomatic=20SQ8=E2=86=94FP16?= =?UTF-8?q?=20tier-coverage=20reporter=20test=20[MOD-14954]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised was an outlier — no other data type has a std::cerr-based coverage reporter. Per-tier coverage is already provided by SQ8_FP16_SpacesOptimizationTest (which walks AVX-512 → AVX2+FMA → AVX2 → SSE4 → scalar by clearing feature flags), and ISA-lane presence is handled by the CI matrix, matching the convention used by every other type's SpacesOptimizationTest. Co-Authored-By: Claude Opus 4.8 (1M context) --- tests/unit/test_spaces.cpp | 50 -------------------------------------- 1 file changed, 50 deletions(-) diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index 71573ed03..ae54e931e 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -3323,56 +3323,6 @@ INSTANTIATE_TEST_SUITE_P(SQ8_FP16_SIMD, SQ8_FP16_SpacesOptimizationTest, INSTANTIATE_TEST_SUITE_P(SQ8_FP16_SIMD_HighDim, SQ8_FP16_SpacesOptimizationTest, testing::Values(64UL, 128UL, 256UL, 512UL, 1024UL)); -// Surfaces which SIMD tiers were actually exercised on the current host. Without this, a CI -// runner that lacks AVX-512 silently passes with zero tier-1 coverage. Logs per-tier presence -// to stderr and GTEST_SKIPs only when no SIMD tier is available at all. -TEST(SQ8_FP16_SIMD_TierCoverage, ReportTiersExercised) { - auto opt = getCpuOptimizationFeatures(); - bool any_simd = false; - -#ifdef CPU_FEATURES_ARCH_X86_64 -#ifdef OPT_AVX512F - if (opt.avx512f) { - std::cerr << "[SQ8_FP16] AVX-512F tier exercised\n"; - any_simd = true; - } else { - std::cerr << "[SQ8_FP16] AVX-512F tier NOT exercised on this host\n"; - } -#endif - // F16C guards all non-AVX-512 SQ8↔FP16 tiers — matches the dispatcher layout. -#ifdef OPT_F16C -#ifdef OPT_AVX2_FMA - if (opt.avx2 && opt.fma3 && opt.f16c) { - std::cerr << "[SQ8_FP16] AVX2+FMA+F16C tier exercised\n"; - any_simd = true; - } else { - std::cerr << "[SQ8_FP16] AVX2+FMA+F16C tier NOT exercised on this host\n"; - } -#endif -#ifdef OPT_AVX2 - if (opt.avx2 && opt.f16c) { - std::cerr << "[SQ8_FP16] AVX2+F16C tier exercised\n"; - any_simd = true; - } else { - std::cerr << "[SQ8_FP16] AVX2+F16C tier NOT exercised on this host\n"; - } -#endif -#ifdef OPT_SSE4 - if (opt.sse4_1 && opt.f16c && opt.avx) { - std::cerr << "[SQ8_FP16] SSE4+F16C+AVX tier exercised\n"; - any_simd = true; - } else { - std::cerr << "[SQ8_FP16] SSE4+F16C+AVX tier NOT exercised on this host\n"; - } -#endif -#endif // OPT_F16C -#endif // x86_64 - - if (!any_simd) { - GTEST_SKIP() << "No SQ8_FP16 SIMD tier available on this host — scalar fallback only."; - } -} - /* ======================== Tests SQ8_FP16 (edge cases) ========================= */ // Zero FP16 query against a non-zero SQ8 storage. IP must be exactly 1.0 (1 - 0), From 999580f1c780b3d518201d332e4125474357a918 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Sun, 31 May 2026 14:26:19 +0300 Subject: [PATCH 22/24] =?UTF-8?q?Simplify=20SQ8=E2=86=94FP16=20kernels=20a?= =?UTF-8?q?nd=20trim=20PR=20churn=20[MOD-14954]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - AVX512F IP: keep the <=3 tail chunks on distinct accumulators (sum0/sum1/sum2) instead of serializing into one, preserving ILP when the main 64-lane loop runs few or zero times. - Condense kernel header comments; drop redundant float16.h/alignment.h includes (pulled in transitively) and the direct include (provided via space_includes.h, matching the other AVX512F headers). - test_spaces: align the SQ8_FP16 scalar-fallback alignment assertion with the convention used by the other SpacesOptimizationTest suites. - Revert unrelated CMake message/quote churn on the base AVX2/SSE4 TUs and the stray blank line in AVX512F_BW_VL_VNNI.cpp, leaving only the additive F16C build blocks in this PR. Co-Authored-By: Claude Opus 4.8 (1M context) --- src/VecSim/spaces/CMakeLists.txt | 10 ++-- src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h | 19 ++---- src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h | 21 ++----- src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h | 58 ++++++++----------- src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h | 18 ++---- src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h | 2 - src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h | 2 - src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h | 2 - src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h | 2 - .../spaces/functions/AVX512F_BW_VL_VNNI.cpp | 1 - tests/unit/test_spaces.cpp | 23 ++------ 11 files changed, 49 insertions(+), 109 deletions(-) diff --git a/src/VecSim/spaces/CMakeLists.txt b/src/VecSim/spaces/CMakeLists.txt index 9b7477837..309d3f3a4 100644 --- a/src/VecSim/spaces/CMakeLists.txt +++ b/src/VecSim/spaces/CMakeLists.txt @@ -61,8 +61,8 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)") # SQ8↔FP16 kernels (which require F16C) live in sibling TUs functions/AVX2_F16C.cpp and # functions/AVX2_FMA_F16C.cpp, compiled only when _has_full_f16c is true. if(CXX_AVX2) - message("Building functions/AVX2.cpp with AVX2") - set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS "-mavx2") + message("Building with AVX2") + set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS -mavx2) list(APPEND OPTIMIZATIONS functions/AVX2.cpp) endif() @@ -73,7 +73,7 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)") endif() if(CXX_AVX2 AND CXX_FMA) - message("Building functions/AVX2_FMA.cpp with AVX2 and FMA") + message("Building with AVX2 and FMA") set_source_files_properties(functions/AVX2_FMA.cpp PROPERTIES COMPILE_FLAGS "-mavx2 -mfma") list(APPEND OPTIMIZATIONS functions/AVX2_FMA.cpp) endif() @@ -103,8 +103,8 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)") endif() if(CXX_SSE4) - message("Building functions/SSE4.cpp with SSE4.1") - set_source_files_properties(functions/SSE4.cpp PROPERTIES COMPILE_FLAGS "-msse4.1") + message("Building with SSE4") + set_source_files_properties(functions/SSE4.cpp PROPERTIES COMPILE_FLAGS -msse4.1) list(APPEND OPTIMIZATIONS functions/SSE4.cpp) endif() diff --git a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h index a4c1612ea..3800f1e8a 100644 --- a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h @@ -17,11 +17,8 @@ using sq8 = vecsim_types::sq8; using float16 = vecsim_types::float16; /* - * Asymmetric SQ8 (storage) ↔ FP16 (query) inner product using algebraic identity: - * IP(x, y) = Σ(x_i * y_i) - * ≈ Σ((min + delta * q_i) * y_i) - * = min * Σy_i + delta * Σ(q_i * y_i) - * = min * y_sum + delta * quantized_dot_product + * Asymmetric SQ8 (storage) <-> FP16 (query) inner product using algebraic identity: + * IP(x, y) = min * y_sum + delta * Σ(q_i * y_i) * * FP16 query lanes are widened to FP32 per 8-lane chunk via _mm256_cvtph_ps (F16C); * inner-loop arithmetic runs in FP32 with _mm256_fmadd_ps. @@ -42,32 +39,25 @@ static inline void SQ8_FP16_InnerProductStep_AVX2_FMA(const uint8_t *&pVect1, sum256 = _mm256_fmadd_ps(v1_f, v2_f, sum256); } -// Precondition: dim >= 16. Caller is the dispatcher in IP_space.cpp / L2_space.cpp. -// The residual block reads 8 SQ8 bytes and 16 FP16 bytes unconditionally; shorter blobs would -// under-read. +// pVec1v = SQ8 storage, pVec2v = FP16 query. Precondition: dim >= 16 (enforced by dispatcher). template // 0..15 float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v, size_t dimension) { const uint8_t *pVec1 = static_cast(pVec1v); const float16 *pVec2 = static_cast(pVec2v); const uint8_t *pEnd1 = pVec1 + dimension; - // Two independent accumulators break the FMA dependency chain so consecutive iterations - // can issue in parallel through both FMA ports. + // Two accumulators break the FMA dependency chain across consecutive iterations. __m256 sum_a = _mm256_setzero_ps(); __m256 sum_b = _mm256_setzero_ps(); if constexpr (residual % 8) { constexpr int mask = (1 << (residual % 8)) - 1; - // Single-side mask is sufficient: SQ8 lanes beyond `residual` may hold garbage, but the - // FP16 query blend below forces those FP32 query lanes to 0, so garbage·0=0 contributes - // nothing to the dot product. SQ8 load is intentionally unmasked. __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast(pVec1)); pVec1 += residual % 8; __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128); __m256 v1_f = _mm256_cvtepi32_ps(v1_256); - // FP16 side: load full 16-byte block (safe — dim >= 16 and metadata follows). __m128i v2_128 = _mm_loadu_si128(reinterpret_cast(pVec2)); __m256 v2_f = _mm256_cvtph_ps(v2_128); v2_f = _mm256_blend_ps(_mm256_setzero_ps(), v2_f, mask); @@ -77,7 +67,6 @@ float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v, } if constexpr (residual >= 8) { - // Route the half-residual chunk to the second accumulator for ILP. SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum_b); } diff --git a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h index 3a01d80f2..acec6102c 100644 --- a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h @@ -17,15 +17,11 @@ using sq8 = vecsim_types::sq8; using float16 = vecsim_types::float16; /* - * Asymmetric SQ8 (storage) ↔ FP16 (query) inner product using algebraic identity: - * IP(x, y) = Σ(x_i * y_i) - * ≈ Σ((min + delta * q_i) * y_i) - * = min * Σy_i + delta * Σ(q_i * y_i) - * = min * y_sum + delta * quantized_dot_product + * Asymmetric SQ8 (storage) <-> FP16 (query) inner product using algebraic identity: + * IP(x, y) = min * y_sum + delta * Σ(q_i * y_i) * * FP16 query lanes are widened to FP32 per 8-lane chunk via _mm256_cvtph_ps (F16C); - * inner-loop arithmetic runs in FP32 with separate _mm256_mul_ps + _mm256_add_ps - * (no FMA tier — Haswell-era AVX2 without FMA support). + * inner-loop arithmetic runs in FP32 with separate _mm256_mul_ps + _mm256_add_ps (no FMA). */ // 8-wide AVX2 step (no FMA): 8 SQ8 lanes + 8 FP16 lanes -> mul + add into sum. @@ -43,26 +39,20 @@ static inline void SQ8_FP16_InnerProductStep_AVX2(const uint8_t *&pVect1, const sum256 = _mm256_add_ps(sum256, _mm256_mul_ps(v1_f, v2_f)); } -// Precondition: dim >= 16. Caller is the dispatcher in IP_space.cpp / L2_space.cpp. -// The residual block reads 8 SQ8 bytes and 16 FP16 bytes unconditionally; shorter blobs would -// under-read. +// pVec1v = SQ8 storage, pVec2v = FP16 query. Precondition: dim >= 16 (enforced by dispatcher). template // 0..15 float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size_t dimension) { const uint8_t *pVec1 = static_cast(pVec1v); const float16 *pVec2 = static_cast(pVec2v); const uint8_t *pEnd1 = pVec1 + dimension; - // Two independent accumulators break the mul→add dependency chain on Haswell-class CPUs - // without FMA, where the add cannot retire before the prior mul. + // Two accumulators break the mul->add dependency chain (no FMA on this tier). __m256 sum_a = _mm256_setzero_ps(); __m256 sum_b = _mm256_setzero_ps(); if constexpr (residual % 8) { constexpr int mask = (1 << (residual % 8)) - 1; - // Single-side mask is sufficient: SQ8 lanes beyond `residual` may hold garbage, but the - // FP16 query blend below forces those FP32 query lanes to 0, so garbage·0=0 contributes - // nothing to the dot product. SQ8 load is intentionally unmasked. __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast(pVec1)); pVec1 += residual % 8; __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128); @@ -77,7 +67,6 @@ float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size } if constexpr (residual >= 8) { - // Route the half-residual chunk to the second accumulator for ILP. SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum_b); } diff --git a/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h index 7ba9c0412..60d0ba719 100644 --- a/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h @@ -11,20 +11,25 @@ #include "VecSim/types/sq8.h" #include "VecSim/types/float16.h" #include "VecSim/utils/alignment.h" -#include using sq8 = vecsim_types::sq8; using float16 = vecsim_types::float16; -// Helper: load 16 SQ8 + 16 FP16 lanes, widen both to FP32, fused-multiply-add into sum. +/* + * Asymmetric SQ8 (storage) <-> FP16 (query) inner product using algebraic identity: + * IP(x, y) = min * y_sum + delta * Σ(q_i * y_i) + * + * FP16 query lanes are widened to FP32 per 16-lane chunk via _mm512_cvtph_ps (AVX512F); + * inner-loop arithmetic runs in FP32 with _mm512_fmadd_ps. + */ + +// 16-wide AVX512F step: 16 SQ8 lanes + 16 FP16 lanes -> 16 FP32 fused-multiply-add. static inline void SQ8_FP16_InnerProductStep_AVX512(const uint8_t *&pVec1, const float16 *&pVec2, __m512 &sum) { - // 16 uint8 -> 16 fp32 __m128i v1_128 = _mm_loadu_si128(reinterpret_cast(pVec1)); __m512i v1_512 = _mm512_cvtepu8_epi32(v1_128); __m512 v1_f = _mm512_cvtepi32_ps(v1_512); - // 16 fp16 -> 16 fp32. _mm512_cvtph_ps is part of AVX512F. __m256i v2_16 = _mm256_loadu_si256(reinterpret_cast(pVec2)); __m512 v2_f = _mm512_cvtph_ps(v2_16); @@ -34,19 +39,14 @@ static inline void SQ8_FP16_InnerProductStep_AVX512(const uint8_t *&pVec1, const pVec2 += 16; } -// Raw inner product Σ((min + delta * q_i) * y_i). Used by both InnerProduct/Cosine wrappers -// and by the L2 kernel. -// Precondition: dim >= 16. Caller is the dispatcher in IP_space.cpp / L2_space.cpp, which gates -// this. The residual block reads 16 SQ8 bytes and 32 FP16 bytes unconditionally; shorter blobs -// would under-read. +// pVec1v = SQ8 storage, pVec2v = FP16 query. Precondition: dim >= 16 (enforced by dispatcher). template // 0..15 float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, size_t dimension) { - const uint8_t *pVec1 = static_cast(pVec1v); // SQ8 storage - const float16 *pVec2 = static_cast(pVec2v); // FP16 query + const uint8_t *pVec1 = static_cast(pVec1v); + const float16 *pVec2 = static_cast(pVec2v); const uint8_t *pEnd1 = pVec1 + dimension; - // Four independent accumulators break the FMA dependency chain so the inner loop can - // saturate both FMA ports on Sapphire Rapids / Zen 4. + // Four accumulators break the FMA dependency chain to saturate both FMA ports. __m512 sum0 = _mm512_setzero_ps(); __m512 sum1 = _mm512_setzero_ps(); __m512 sum2 = _mm512_setzero_ps(); @@ -59,23 +59,16 @@ float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, si __m512i v1_512 = _mm512_cvtepu8_epi32(v1_128); __m512 v1_f = _mm512_cvtepi32_ps(v1_512); - // Safe to read the full 32-byte FP16 chunk: dim >= 16 and the FP16 metadata follows - // the lanes, so the load stays within the query blob. __m256i v2_16 = _mm256_loadu_si256(reinterpret_cast(pVec2)); __m512 v2_f = _mm512_cvtph_ps(v2_16); - // Mask out unused lanes by folding the mask into the multiply. sum0 = _mm512_maskz_mul_ps(mask, v1_f, v2_f); pVec1 += residual; pVec2 += residual; } - // Main unrolled loop: 4 chunks of 16 lanes per iteration, one chunk per accumulator. - // Residual leaves `dim - residual` lanes remaining (a multiple of 16), so the - // pointer comparison stays exact. Compare via pointer subtraction (not - // `pVec1 + 64 <= pEnd1`) so we never form a pointer past one-past-the-end, - // which would be UB in C++ regardless of dereference. + // Main loop: 4 chunks of 16 lanes per iteration, one chunk per accumulator. while (static_cast(pEnd1 - pVec1) >= 64) { SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum0); SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum1); @@ -83,25 +76,24 @@ float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, si SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum3); } - // Reduce the four accumulators into one. - __m512 sum = _mm512_add_ps(_mm512_add_ps(sum0, sum1), _mm512_add_ps(sum2, sum3)); - - // Tail: at most three remaining 16-lane chunks. - while (pVec1 < pEnd1) { - SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum); - } + // Tail: at most three remaining 16-lane chunks (post-residual remainder is a multiple of 16). + // Keep chunks on distinct accumulators to preserve ILP when the main loop did not run. + const size_t remaining = pEnd1 - pVec1; + if (remaining >= 16) + SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum0); + if (remaining >= 32) + SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum1); + if (remaining >= 48) + SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum2); + __m512 sum = _mm512_add_ps(_mm512_add_ps(sum0, sum1), _mm512_add_ps(sum2, sum3)); float quantized_dot = _mm512_reduce_add_ps(sum); - // SQ8 metadata starts at byte offset `dimension`; for odd `dimension` it is not - // 4-byte aligned, so use load_unaligned. Mirrors the scalar SQ8_FP16_Impl pattern. const uint8_t *pVec1Base = static_cast(pVec1v); const uint8_t *params_bytes = pVec1Base + dimension; const float min_val = load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); const float delta = load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); - // FP16 query metadata sits at byte offset 2*dimension; for odd `dimension` it is - // 2-byte aligned only. const float16 *pVec2Base = static_cast(pVec2v); const auto *query_meta_bytes = reinterpret_cast(pVec2Base + dimension); const float y_sum = load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); @@ -117,7 +109,5 @@ float SQ8_FP16_InnerProductSIMD16_AVX512F(const void *pVec1v, const void *pVec2v template // 0..15 float SQ8_FP16_CosineSIMD16_AVX512F(const void *pVec1v, const void *pVec2v, size_t dimension) { - // Cosine distance = 1 - IP for pre-normalised vectors. Aliases InnerProduct, matching the - // SQ8_FP32 pattern. return SQ8_FP16_InnerProductSIMD16_AVX512F(pVec1v, pVec2v, dimension); } diff --git a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h index 871a189dc..1cc3cb153 100644 --- a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h @@ -16,20 +16,16 @@ using sq8 = vecsim_types::sq8; using float16 = vecsim_types::float16; /* - * Asymmetric SQ8 (storage) ↔ FP16 (query) inner product using algebraic identity: - * IP(x, y) = Σ(x_i * y_i) - * ≈ Σ((min + delta * q_i) * y_i) - * = min * Σy_i + delta * Σ(q_i * y_i) - * = min * y_sum + delta * quantized_dot_product + * Asymmetric SQ8 (storage) <-> FP16 (query) inner product using algebraic identity: + * IP(x, y) = min * y_sum + delta * Σ(q_i * y_i) * * FP16 query lanes are widened to FP32 per 4-lane chunk via _mm_cvtph_ps (F16C); - * inner-loop arithmetic runs in FP32 with separate _mm_mul_ps + _mm_add_ps (SSE4 has no FMA). + * inner-loop arithmetic runs in FP32 with separate _mm_mul_ps + _mm_add_ps (no FMA). */ // 4-wide SSE4+F16C step: 4 SQ8 lanes + 4 FP16 lanes -> mul + add into sum. static inline void SQ8_FP16_InnerProductStep_SSE4(const uint8_t *&pVect1, const float16 *&pVect2, __m128 &sum) { - // Alignment-safe 4-byte load of SQ8 lanes via load_unaligned (no strict-aliasing UB). __m128i v1_i = _mm_cvtepu8_epi32(_mm_cvtsi32_si128(load_unaligned(pVect1))); pVect1 += 4; __m128 v1_f = _mm_cvtepi32_ps(v1_i); @@ -41,8 +37,7 @@ static inline void SQ8_FP16_InnerProductStep_SSE4(const uint8_t *&pVect1, const sum = _mm_add_ps(sum, _mm_mul_ps(v1_f, v2_f)); } -// Precondition: dim >= 16. Caller is the dispatcher in IP_space.cpp / L2_space.cpp. -// Shorter blobs would underflow the residual ladder + final do-while loop. +// pVec1v = SQ8 storage, pVec2v = FP16 query. Precondition: dim >= 16 (enforced by dispatcher). template // 0..15 float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2v, size_t dimension) { @@ -50,7 +45,7 @@ float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2 const float16 *pVec2 = static_cast(pVec2v); const uint8_t *pEnd1 = pVec1 + dimension; - // Two independent accumulators break the mul→add dependency chain (SSE4 lacks FMA). + // Two accumulators break the mul->add dependency chain (no FMA on this tier). __m128 sum_a = _mm_setzero_ps(); __m128 sum_b = _mm_setzero_ps(); @@ -80,7 +75,6 @@ float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2 sum_a = _mm_mul_ps(v1_f, v2_f); } - // Alternate the residual-ladder steps across the two accumulators for ILP. if constexpr (residual >= 4) { SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_b); } @@ -91,8 +85,6 @@ float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2 SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_b); } - // Remaining lanes after the residual block are a multiple of 16, hence a multiple of 8, - // so two 4-lane steps per iteration consume the tail exactly. do { SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_a); SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_b); diff --git a/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h index 38809e9c2..c855b62ca 100644 --- a/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h +++ b/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h @@ -11,8 +11,6 @@ #include "VecSim/spaces/AVX_utils.h" #include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h" #include "VecSim/types/sq8.h" -#include "VecSim/types/float16.h" -#include "VecSim/utils/alignment.h" using sq8 = vecsim_types::sq8; using float16 = vecsim_types::float16; diff --git a/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h index 98bb29c05..7c2cbfcd8 100644 --- a/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h +++ b/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h @@ -11,8 +11,6 @@ #include "VecSim/spaces/AVX_utils.h" #include "VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h" #include "VecSim/types/sq8.h" -#include "VecSim/types/float16.h" -#include "VecSim/utils/alignment.h" using sq8 = vecsim_types::sq8; using float16 = vecsim_types::float16; diff --git a/src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h index 384870b21..9d7b1569f 100644 --- a/src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h +++ b/src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h @@ -10,8 +10,6 @@ #include "VecSim/spaces/space_includes.h" #include "VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h" #include "VecSim/types/sq8.h" -#include "VecSim/types/float16.h" -#include "VecSim/utils/alignment.h" using sq8 = vecsim_types::sq8; using float16 = vecsim_types::float16; diff --git a/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h index 75bbd46f8..d0a0fea06 100644 --- a/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h +++ b/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h @@ -10,8 +10,6 @@ #include "VecSim/spaces/space_includes.h" #include "VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h" #include "VecSim/types/sq8.h" -#include "VecSim/types/float16.h" -#include "VecSim/utils/alignment.h" using sq8 = vecsim_types::sq8; using float16 = vecsim_types::float16; diff --git a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp index 712bdda4e..3b8813b89 100644 --- a/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp +++ b/src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp @@ -75,7 +75,6 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(size_t d CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP32_L2SqrSIMD16_AVX512F_BW_VL_VNNI); return ret_dist_func; } - // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum) dist_func_t Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(size_t dim) { dist_func_t ret_dist_func; diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index ae54e931e..5a3b94556 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -3143,18 +3143,13 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { #endif #endif // OPT_F16C - // Scalar fallback. Init alignment to a sentinel (0xFF) so the assert below actually verifies - // that the dispatcher LEAVES THE VALUE UNTOUCHED on the scalar path — initialising to 0 then - // asserting `== 0` would pass even if the dispatcher were a no-op. - unsigned char alignment = 0xFF; + unsigned char alignment = 0; arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); ASSERT_EQ(arch_opt_func, SQ8_FP16_L2Sqr) << "Unexpected scalar fallback function for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "Scalar fallback with dim " << dim; - ASSERT_EQ(alignment, 0xFF) << "Scalar fallback must leave caller's alignment value untouched " - "(dim " - << dim << ")"; + ASSERT_EQ(alignment, 0) << "No optimization with dim " << dim; } TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { @@ -3223,16 +3218,13 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { #endif #endif // OPT_F16C - // Scalar fallback — see L2 test for the 0xFF sentinel rationale. - unsigned char alignment = 0xFF; + unsigned char alignment = 0; arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); ASSERT_EQ(arch_opt_func, SQ8_FP16_InnerProduct) << "Unexpected scalar fallback function for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "Scalar fallback with dim " << dim; - ASSERT_EQ(alignment, 0xFF) << "Scalar fallback must leave caller's alignment value untouched " - "(dim " - << dim << ")"; + ASSERT_EQ(alignment, 0) << "No optimization with dim " << dim; } TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { @@ -3301,16 +3293,13 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { #endif #endif // OPT_F16C - // Scalar fallback — see L2 test for the 0xFF sentinel rationale. - unsigned char alignment = 0xFF; + unsigned char alignment = 0; arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); ASSERT_EQ(arch_opt_func, SQ8_FP16_Cosine) << "Unexpected scalar fallback function for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "Scalar fallback with dim " << dim; - ASSERT_EQ(alignment, 0xFF) << "Scalar fallback must leave caller's alignment value untouched " - "(dim " - << dim << ")"; + ASSERT_EQ(alignment, 0) << "No optimization with dim " << dim; } // Dim range [16, 32] covers every residual class for the 16-element chunk used by every tier. From 91c14e5afb50009ad9bde0027f202b84b847fc5e Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Sun, 31 May 2026 14:37:11 +0300 Subject: [PATCH 23/24] Document why OPT_F16C differs from the other OPT_* macros [MOD-14954] Explain at the definition site that OPT_F16C is a cross-cutting capability gate (not a 1:1 dispatch tier), why it is a compound CXX_F16C/FMA/AVX guard (F16C is VEX-encoded and needs AVX state), and why the AVX-512 SQ8<->FP16 path stays outside it (_mm512_cvtph_ps is part of AVX512F). Co-Authored-By: Claude Opus 4.8 (1M context) --- cmake/x86_64InstructionFlags.cmake | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/cmake/x86_64InstructionFlags.cmake b/cmake/x86_64InstructionFlags.cmake index f19ef7662..ff0e43e97 100644 --- a/cmake/x86_64InstructionFlags.cmake +++ b/cmake/x86_64InstructionFlags.cmake @@ -73,6 +73,24 @@ if(CXX_AVX512F AND CXX_AVX512BW AND CXX_AVX512VL AND CXX_AVX512VNNI) add_compile_definitions(OPT_AVX512_F_BW_VL_VNNI) endif() +# OPT_F16C is unusual compared to the other OPT_* macros above: +# +# 1. It is a *capability* gate, not a dispatch tier. Every other OPT_* maps 1:1 to a single +# ISA tier that owns its own translation unit (OPT_AVX2 -> AVX2.cpp, OPT_SSE4 -> SSE4.cpp). +# F16C owns no tier of its own; it only enables the vcvtph2ps (FP16<->FP32) conversion that +# several tiers need. So it is hoisted *around* multiple tiers (AVX2_FMA / AVX2 / SSE4 for +# the SQ8<->FP16 kernels) rather than selecting one. +# +# 2. It is a compound guard (CXX_F16C AND CXX_FMA AND CXX_AVX), not a single flag. F16C is +# VEX-encoded, so vcvtph2ps requires AVX state to execute -- emitting it without AVX is +# invalid. Defining OPT_F16C therefore implies AVX is present, and the F16C kernels must be +# compiled with -mf16c added *on top of* -mavx (see functions/*_F16C.cpp in +# src/VecSim/spaces/CMakeLists.txt). The base AVX2.cpp / SSE4.cpp objects stay F16C-free so +# they still run on CPUs without F16C. +# +# 3. The AVX-512 tier deliberately does NOT use this gate: _mm512_cvtph_ps is part of AVX512F +# itself, so the AVX-512 SQ8<->FP16 path needs only OPT_AVX512F and lives outside any +# OPT_F16C guard. if(CXX_F16C AND CXX_FMA AND CXX_AVX) add_compile_definitions(OPT_F16C) endif() From f5926c230b57e028523688d2f49f36db80204378 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Sun, 31 May 2026 17:12:04 +0300 Subject: [PATCH 24/24] Cover AVX512 three-chunk tail and dim<16 dispatcher guard in SQ8_FP16 tests [MOD-14954] Codecov flagged 4 uncovered lines on PR #970: - The AVX512F `remaining >= 48` third tail step in IP_AVX512F_SQ8_FP16.h was never executed: the test dims never satisfied (dim / 16) % 4 == 3. Add 48 (zero main-loop iterations) and 112 (one main-loop iteration) to exercise it. - The `dim < 16` scalar early-return in the IP/Cosine/L2 SQ8_FP16 dispatchers was never taken. Assert the three dispatchers return the scalar funcs at dim 8. Test-only change. Local release + ASan: SQ8_FP16 137/137, ASan clean. Co-Authored-By: Claude Opus 4.8 (1M context) --- tests/unit/test_spaces.cpp | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index 5a3b94556..474ac5c75 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -572,6 +572,12 @@ TEST_F(SpacesTest, GetDistFuncSQ8FP16Asymmetric) { ASSERT_EQ(l2_func, L2_SQ8_FP16_GetDistFunc(dim, nullptr)); ASSERT_EQ(ip_func, IP_SQ8_FP16_GetDistFunc(dim, nullptr)); ASSERT_EQ(cosine_func, Cosine_SQ8_FP16_GetDistFunc(dim, nullptr)); + + // dim < 16 takes the scalar early-return in every SQ8_FP16 dispatcher (no SIMD tier). + size_t small_dim = 8; + ASSERT_EQ(L2_SQ8_FP16_GetDistFunc(small_dim, nullptr), SQ8_FP16_L2Sqr); + ASSERT_EQ(IP_SQ8_FP16_GetDistFunc(small_dim, nullptr), SQ8_FP16_InnerProduct); + ASSERT_EQ(Cosine_SQ8_FP16_GetDistFunc(small_dim, nullptr), SQ8_FP16_Cosine); } #ifdef CPU_FEATURES_ARCH_X86_64 @@ -3308,9 +3314,10 @@ INSTANTIATE_TEST_SUITE_P(SQ8_FP16_SIMD, SQ8_FP16_SpacesOptimizationTest, // Higher dimensions surface multi-iteration loop bugs (pointer stride, do-while termination // off-by-one) that the [16, 32] range does not exercise because the AVX-512 inner loop runs at -// most twice in that range. +// most twice in that range. 48 and 112 specifically hit the AVX-512 three-chunk tail +// (remaining == 48, i.e. (dim / 16) % 4 == 3): 48 with zero main-loop iterations, 112 with one. INSTANTIATE_TEST_SUITE_P(SQ8_FP16_SIMD_HighDim, SQ8_FP16_SpacesOptimizationTest, - testing::Values(64UL, 128UL, 256UL, 512UL, 1024UL)); + testing::Values(48UL, 64UL, 112UL, 128UL, 256UL, 512UL, 1024UL)); /* ======================== Tests SQ8_FP16 (edge cases) ========================= */