Add SQ8↔FP16 x86 SIMD distance kernels [MOD-14954] by dor-forer · Pull Request #970 · RedisAI/VectorSimilarity

dor-forer · 2026-05-26T10:32:57Z

Summary

Implement AVX-512 (F+BW+VL+VNNI) / AVX2+FMA / AVX2 / SSE4 kernels for asymmetric SQ8↔FP16 distance (IP / Cosine / L2²) on Intel x86.
Wire the new kernels into the dispatcher (IP_space.cpp / L2_space.cpp) with F16C gating on the AVX2/SSE4 tiers; AVX-512 path uses _mm512_cvtph_ps which is part of AVX512F (no F16C requirement).
Append -mf16c conditionally to AVX2_FMA / AVX2 / SSE4 dispatcher source files in CMake (purely additive — no SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 codegen change since those sources contain no F16C intrinsics).
Extend SQ8_FP16_SpacesOptimizationTest to walk all four ISA tiers against the scalar baseline across dim ∈ [16, 32]; update existing GetDistFunc_*_SQ8_FP16 assertions accordingly.
Register per-ISA microbenchmarks in bm_spaces_sq8_fp16.cpp, mirroring the SQ8_FP32 layout.

Spec: docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md
Plan: docs/superpowers/plans/2026-05-26-sq8-fp16-x86-kernels.md

Test plan

`make clean ALL=1 && make build && make unit_test` — 2271 tests pass (release).
`make asan` — 2271 tests pass under AddressSanitizer.
Benchmark binary runs on hosts that lack AVX-512 (AVX-512 tier gracefully reports "AVX512F_BW_VL_VNNI not available"; AVX2+FMA / AVX2 / SSE4 / scalar tiers all execute and report timings).

🤖 Generated with Claude Code

Note

Medium Risk
Changes core VecSim distance selection and numeric SIMD paths for asymmetric quantization; correctness is heavily tested but wrong dispatch or residual handling could affect search results on x86.

Overview
Adds x86 SIMD for asymmetric SQ8 (index) ↔ FP16 (query) distances: inner product, cosine, and L2², replacing the previous always-scalar path for dim ≥ 16 when CPU features match.

Dispatch in IP_space.cpp and L2_space.cpp now picks tiered implementations: AVX-512F (no VNNI; uses _mm512_cvtph_ps), then AVX2+FMA+F16C, AVX2+F16C, and SSE4.1+AVX+F16C, with optional SQ8 alignment hints when dimension is divisible by the chunk size; smaller dims still use scalar.

Build: F16C-dependent kernels live in separate translation units (AVX2_F16C, AVX2_FMA_F16C, SSE4_F16C) compiled with -mf16c only when the toolchain reports full F16C support, keeping existing AVX2/SSE4 objects unchanged.

Tests & benchmarks: SQ8_FP16_SpacesOptimizationTest walks each tier against the scalar baseline; bm_spaces_sq8_fp16.cpp registers per-ISA benchmarks alongside the naive baseline.

^{Reviewed by Cursor Bugbot for commit 8fe3d74. Bugbot is set up for automated code reviews on this repo. Configure here.}

Captures the architecture, file-level plan, CMake F16C gating, and risk register for adding AVX-512 / AVX2+FMA / AVX2 / SSE4 kernels for the asymmetric SQ8 (storage) ↔ FP16 (query) distance functions, wiring them into the existing dispatcher tables and SQ8_FP16 unit/benchmark scaffolding from MOD-15141. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jit-ci · 2026-05-26T10:34:21Z

🛡️ Jit Security Scan Results

✅ No security findings were detected in this PR

^{Security scan by Jit}

Enables _mm{,256}_cvtph_ps in the AVX2+FMA, AVX2, and SSE4 dispatcher translation units so the upcoming SQ8↔FP16 kernels can widen FP16 lanes to FP32. The flag is appended only when CXX_F16C is detected; existing SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 sources contain no F16C intrinsics so emitted code for those kernels is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Parameterised gtest fixture mirroring SQ8_FP32_SpacesOptimizationTest; currently asserts only the scalar fallback path. Per-tier SIMD assertion blocks (AVX-512, AVX2+FMA, AVX2, SSE4) are added alongside the kernel implementations in subsequent commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implements asymmetric SQ8 (storage) ↔ FP16 (query) Inner Product, Cosine, and L2² kernels for the AVX-512 F+BW+VL+VNNI tier. Each chunk widens 16 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 16 FP16 lanes via _mm512_cvtph_ps, then fmadds into a 16-lane FP32 accumulator. SQ8 storage and FP16 query metadata reads use load_unaligned to tolerate odd dimensions. Dispatcher branches in IP_space.cpp / L2_space.cpp select the new Choose_SQ8_FP16_*_implementation_AVX512F_BW_VL_VNNI when features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni; otherwise behaviour is unchanged from MOD-15141. A parameterised gtest fixture exercises every residual class in [16, 32] against the scalar baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

8-wide AVX2+FMA kernels widen 8 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 8 FP16 lanes via _mm256_cvtph_ps, then fmadd into a 256-bit FP32 accumulator. Residual (< 8) lanes load the full 16-byte FP16 block, convert, then blend zero across unused lanes — mirroring the existing F16C FP16 kernel pattern. Dispatcher branch in {IP,Cosine,L2}_SQ8_FP16_GetDistFunc selects the new Choose_SQ8_FP16_*_implementation_AVX2_FMA when features.avx2 && features.fma3 && features.f16c. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the AVX2+FMA kernels but uses _mm256_mul_ps + _mm256_add_ps instead of _mm256_fmadd_ps so it can run on Haswell-era AVX2 hardware without FMA support (uncommon but matches the existing SQ8_FP32 tiering). Dispatcher gate requires features.avx2 && features.f16c and runs between the AVX2+FMA and SSE4 tiers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

4-wide SSE4 kernels widen 4 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 4 FP16 lanes via _mm_cvtph_ps (F16C), then mul+add into a 128-bit FP32 accumulator (SSE4 has no FMA). Residual % 4 lanes are materialised via _mm_set_ps + the scalar FP16_to_FP32 helper, mirroring the existing SSE4 SQ8_FP32 residual pattern. Dispatcher gate requires features.sse4_1 && features.f16c && features.avx since F16C is VEX-encoded — matches the existing F16C/FP16 dispatcher gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The SQ8_FP16 GetDistFunc dispatcher now returns AVX-512 / AVX2+FMA / AVX2 / SSE4 SIMD kernels when the corresponding feature flags are set (only scalar previously). Updates the GetDistFunc_*_SQ8_FP16 asserts to compute the expected function for the host's highest supported tier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds AVX-512 / AVX2+FMA / AVX2 / SSE4 benchmark registrations to bm_spaces_sq8_fp16.cpp, mirroring the SQ8_FP32 layout. Gates each tier on the corresponding OPT_* defines plus the runtime feature checks that mirror the dispatcher in IP_space.cpp / L2_space.cpp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-26T10:50:57Z

Codecov Report

❌ Patch coverage is 99.04762% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.05%. Comparing base (bbe9dfd) to head (8fe3d74).

Files with missing lines	Patch %	Lines
src/VecSim/spaces/IP_space.cpp	95.23%	2 Missing ⚠️
src/VecSim/spaces/L2_space.cpp	95.23%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #970      +/-   ##
==========================================
+ Coverage   96.99%   97.05%   +0.05%     
==========================================
  Files         130      141      +11     
  Lines        7793     8105     +312     
==========================================
+ Hits         7559     7866     +307     
- Misses        234      239       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- CMake: gate `-mf16c` on CXX_F16C AND CXX_FMA AND CXX_AVX (matches OPT_F16C macro) and append `-mavx` to the SSE4 dispatcher when adding -mf16c, since F16C is VEX-encoded and requires AVX state. Mirrors the existing F16C.cpp recipe and prevents miscompiles on toolchains with F16C but without AVX. - IP_SSE4_SQ8_FP16.h: replace `*reinterpret_cast<const int32_t *>(pVect1)` with `load_unaligned<int32_t>(pVect1)` to remove strict-aliasing UB on the uint8_t SQ8 lane load. - IP_AVX2{,_FMA}_SQ8_FP16.h: improve the residual-mask comment to spell out the asymmetric-mask reasoning (SQ8 unmasked is safe because the FP16 query blend forces those FP32 query lanes to 0 → garbage·0=0). - IP_AVX{512,2,2_FMA,SSE4}_SQ8_FP16.h: add the `IP = min·y_sum + delta·Σ(q·y)` algebraic-identity comment header that AVX-512 already carried, plus a precondition note that callers must enforce dim >= 16 (matches the established SQ8_FP32 convention; no runtime assert because sibling SQ8_FP32 SIMD kernels also rely on the dispatcher gate). - test_spaces.cpp: route the SQ8_FP16 edge-case tests (ZeroQuery, ConstantStorage, MixedSignQuery) through {IP,Cosine,L2}_SQ8_FP16_GetDistFunc so the runtime-selected SIMD tier is actually exercised on those inputs, not just the scalar reference. - test_spaces.cpp: add SQ8_FP16_SIMD_HighDim suite with dims {64, 128, 256, 512, 1024} so multi-iteration do-while loop bugs would fire (the existing [16, 32] range covers at most two AVX-512 chunk iterations). - test_spaces.cpp: add SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised — a single test that emits per-tier coverage to stderr and GTEST_SKIPs when no SIMD tier is available, so CI runners without AVX-512 do not silently report zero tier-1 coverage. - test_spaces.cpp: scalar-fallback `alignment` checks now seed the value with 0xFF and assert it remains 0xFF, verifying the dispatcher contract ("scalar leaves caller's value untouched") instead of just measuring that the variable's pre-zeroed init survived. - test_spaces.cpp: drop the stale MOD-15152/MOD-15153 wiring-TODO comment on SQ8_FP16_NoOptimizationSpacesTest now that the SIMD tiers are wired. - bm_spaces_sq8_fp16.cpp: drop the matching stale comment. Out of scope (separate ticket): two-accumulator FMA refactor (also affects SQ8_FP32) and the SSE4 residual `_mm_cvtph_ps` perf opportunity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Break the FMA / mul+add dependency chain in all four SQ8↔FP16 IP kernels by widening the inner loop to use multiple independent accumulators. L2 kernels inherit the change through their `…InnerProductImp_…` call. - IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h: 1 → 4 accumulators, unroll-4 main loop (64 lanes/iter) with a 16-lane tail for the 0..3 remaining chunks. - IP_AVX2_FMA_SQ8_FP16.h, IP_AVX2_SQ8_FP16.h: 1 → 2 accumulators; the existing 2-step unrolled body now routes each step to an independent accumulator. The `residual >= 8` half-chunk feeds the second accumulator so the prologue also breaks the dependency chain. - IP_SSE4_SQ8_FP16.h: 1 → 2 accumulators; do-while unrolled 1 → 2 steps per iteration (4 → 8 lanes/iter). Residual-ladder steps alternate between sum_a and sum_b for prologue ILP. Correctness invariant: residual block consumes exactly `residual` lanes (0..15) → remaining tail is always a multiple of 16, so the unrolled loops (multiples of 8 / 16 / 64) terminate exactly. Verified by 131 SQ8_FP16 unit tests + 115 under ASan.

The SQ8↔FP16 AVX-512 kernel does not actually issue any VNNI instruction — the inner loop is FP32 FMA (`_mm512_fmadd_ps`) over lanes widened from SQ8 (`_mm512_cvtepu8_epi32` + `_mm512_cvtepi32_ps`) and FP16 (`_mm512_cvtph_ps`). Real VNNI use would require an integer-encoded query, which is a different kernel entirely. The file/function names are renamed to match what the kernel actually uses (AVX-512F). The dispatcher .cpp/.h files stay named after the runtime tier (AVX512F_BW_VL_VNNI) since the SQ8↔FP16 kernel still registers under that tier alongside the genuinely VNNI-using SQ8↔SQ8 / INT8 / UINT8 kernels — the gate is a CPU-feature gate, not an ISA claim. The same misnomer exists for SQ8↔FP32; tracked separately so the rename there can ship as its own commit. Also: fix a strict-aliasing-class UB introduced by the AVX-512 unroll-4 loop. `while (pVec1 + 64 <= pEnd1)` forms a pointer past one-past-end of the SQ8 storage object when fewer than 64 lane bytes remain, which is UB in C++ regardless of dereference. Switched to pointer subtraction (`static_cast<size_t>(pEnd1 - pVec1) >= 64`). Renames: - IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> IP_AVX512F_SQ8_FP16.h - L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> L2_AVX512F_SQ8_FP16.h - SQ8_FP16_{InnerProduct,Cosine,L2Sqr}SIMD16_AVX512F_BW_VL_VNNI -> _AVX512F - Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F_BW_VL_VNNI -> _AVX512F Verified: 131 SQ8_FP16 unit tests + 115 under ASan.

Design doc was added in ad941b8 for planning; not appropriate as a long-lived in-repo artifact. Keep externally (Confluence / scratch) rather than ship with the kernel commit.

Two trims, both restoring pre-existing patterns elsewhere in the file: 1. `GetDistFuncSQ8FP16Asymmetric` had grown a runtime SIMD-tier walk that duplicated coverage already provided by `SQ8_FP16_SpacesOptimizationTest`. Reduced to the bare dispatcher-equality check used by the FP32 / SQ8↔SQ8 sister tests at lines 540-548 and 551-559. 2. The `SQ8_FP16_EdgeCases` tests (`ZeroQueryTest`, `ConstantStorageTest`, `MixedSignQueryTest`) were routed through `{IP,Cosine,L2}_SQ8_FP16_GetDistFunc(dim, nullptr)` to force runtime SIMD dispatch on adversarial inputs. Reverted to direct scalar calls (`SQ8_FP16_InnerProduct`, etc.) — the original pre-fdc5c1cd shape. Coverage rationale: the SIMD kernels are branchless on input values (verified by grep — no value-dependent `if` in any tier). Every code path is therefore exercised by `SQ8_FP16_SpacesOptimizationTest`'s random inputs at multiple dims. The edge-case tests verify the *algebraic identity* (IP of zero query = 1.0, constant storage matches dequant baseline, mixed-sign handling) — scalar correctness on these inputs is what was actually being checked, and the SIMD path matches scalar via the SpacesOptimizationTest tier walk. Net: 77 lines removed from the test file, matches sister conventions, no coverage gap.

The SQ8↔FP16 kernels in the SSE4, AVX2, and AVX2+FMA tiers depend on F16C (`_mm_cvtph_ps` / `_mm256_cvtph_ps`), while every other kernel in those dispatcher TUs is F16C-clean. The previous arrangement mixed both under `#ifdef OPT_F16C` blocks inside the base dispatcher .cpp/.h files. Split each tier's F16C-dependent kernels off into a sibling TU: functions/SSE4.cpp → SSE4 + SQ8_FP32 (no F16C) functions/SSE4_F16C.cpp → SQ8_FP16 only (requires -mavx -mf16c) functions/AVX2.cpp → AVX2 + BF16 + SQ8_FP32 (no F16C) functions/AVX2_F16C.cpp → SQ8_FP16 only (requires -mf16c) functions/AVX2_FMA.cpp → SQ8_FP32 (no F16C) functions/AVX2_FMA_F16C.cpp → SQ8_FP16 only (requires -mf16c) The AVX-512 tier is unaffected — its SQ8_FP16 kernel uses `_mm512_cvtph_ps`, which is part of AVX-512F and not F16C. CMake now compiles each sibling TU conditionally on `_has_full_f16c` and applies the F16C flags only there. Base TUs no longer carry `-mf16c`, since they no longer reference F16C intrinsics. Result: - No `#ifdef OPT_F16C` directives in `functions/*.cpp` or `functions/*.h`. - Compile-time isolation: an F16C intrinsic accidentally added outside a `_F16C` sibling will fail to build, not silently miscompile. - Caller sites (`IP_space.cpp`, `L2_space.cpp`, `test_spaces.cpp`, `bm_spaces.h`) still gate the *calls* with `#ifdef OPT_F16C`; the new sibling .h includes are unconditional, since declarations alone don't link-error and the calls remain guarded. Verified: 131 SQ8_FP16 unit tests + 115 ASan + 1166 full test_spaces suite (covers other ISA tiers SQ8_FP32 / BF16 / INT8 / UINT8 to confirm no regression from the dispatcher restructure).

…[MOD-14954] Two related cleanups in the SQ8↔FP16 dispatch path: 1. The AVX-512 SQ8↔FP16 kernel only uses AVX-512F instructions (`_mm512_cvtph_ps`, `_mm512_fmadd_ps`, etc.) but was registered under the VNNI tier (`OPT_AVX512_F_BW_VL_VNNI` + check of avx512f/bw/vl/vnni). That meant CPUs with AVX-512F but no VNNI (Skylake-X, some Cascade Lake variants, etc.) would fall through to AVX2_FMA even though they can run the AVX-512 kernel. Moved the `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F` definitions from `functions/AVX512F_BW_VL_VNNI.cpp` to `functions/AVX512F.cpp`, with matching header reshuffle. Dispatch sites now gate on `OPT_AVX512F` + `features.avx512f`. 2. F16C is a transversal requirement across the non-AVX-512 SQ8↔FP16 tiers (SSE4, AVX2, AVX2+FMA) — every one of them widens FP16 query lanes via `vcvtph2ps`. Per-tier nested `#ifdef OPT_F16C` was hoisted into a single outer block around the three ISA branches in `IP_SQ8_FP16_GetDistFunc`, `Cosine_SQ8_FP16_GetDistFunc`, and `L2_SQ8_FP16_GetDistFunc`. Verified: 131 SQ8_FP16 release + 115 ASan + 1166 full test_spaces suite.

Remove extraneous blank lines in SSE4 and AVX2_FMA source files, fix indentation in AVX512F SQ8_FP16 function signatures, and reformat benchmark macro invocation to fit line length conventions.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 839fe3c. Configure here.}

The comments referencing SQ8-to-FP16 dispatch location are no longer accurate after the recent refactoring that moved the dispatch logic. Clean up these stale comments from the AVX512F_BW_VL_VNNI files.

…4954] Mirrors the dispatcher layout in IP_space.cpp / L2_space.cpp where a single OPT_F16C guard wraps the AVX2+FMA, AVX2, and SSE4 branches. Each test body (L2/IP/Cosine) and the TierCoverage report now use the same single-guard shape. Also retargets the TierCoverage AVX-512 check from OPT_AVX512_F_BW_VL_VNNI to OPT_AVX512F, matching the dispatcher's new AVX-512F-only gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised was an outlier — no other data type has a std::cerr-based coverage reporter. Per-tier coverage is already provided by SQ8_FP16_SpacesOptimizationTest (which walks AVX-512 → AVX2+FMA → AVX2 → SSE4 → scalar by clearing feature flags), and ISA-lane presence is handled by the CI matrix, matching the convention used by every other type's SpacesOptimizationTest. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Stacked on PR #970 (MOD-14954 x86 kernels). Mirrors x86 structure onto NEON_HP / SVE / SVE2 tiers. Zero CMake changes; reuses existing ARM TU compile flags. Scalar fallback already on main serves as reference. Bakes in PR #970 review lessons (assert(dim>=16), 4-accumulator ILP, formula anchor, load_unaligned<float> metadata, dispatcher-routed tier-walk tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

14 bite-sized tasks following the spec at 2026-05-28-arm-sq8-fp16-design.md. Each task ends in a commit; assistant runs tests/ASan/benchmarks after the user confirms each ARM build cycle. Zero CMake changes; PR stacks on #970. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds x86 SIMD implementations and dispatch wiring for asymmetric SQ8 storage ↔ FP16 query distance kernels, extending existing scalar SQ8_FP16 support with AVX-512F, AVX2+FMA+F16C, AVX2+F16C, and SSE4+F16C paths.

Changes:

Adds SIMD IP/Cosine/L2² SQ8_FP16 kernels and per-ISA chooser entry points.
Updates IP/L2 dispatchers to select SIMD implementations with runtime CPU feature gating.
Extends SQ8_FP16 unit tests and benchmarks to cover scalar and x86 SIMD tiers.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h`	Adds AVX-512F SQ8↔FP16 IP/Cosine kernels.
`src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h`	Adds AVX2+FMA+F16C SQ8↔FP16 IP/Cosine kernels.
`src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h`	Adds AVX2+F16C SQ8↔FP16 IP/Cosine kernels.
`src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h`	Adds SSE4+F16C SQ8↔FP16 IP/Cosine kernels.
`src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h`	Adds AVX-512F L2² wrapper using raw IP.
`src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h`	Adds AVX2+FMA+F16C L2² wrapper.
`src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h`	Adds AVX2+F16C L2² wrapper.
`src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h`	Adds SSE4+F16C L2² wrapper.
`src/VecSim/spaces/IP_space.cpp`	Wires SQ8_FP16 IP/Cosine dispatch to new x86 SIMD tiers.
`src/VecSim/spaces/L2_space.cpp`	Wires SQ8_FP16 L2 dispatch to new x86 SIMD tiers.
`src/VecSim/spaces/functions/AVX512F.cpp`	Registers AVX-512F SQ8_FP16 chooser functions.
`src/VecSim/spaces/functions/AVX512F.h`	Declares AVX-512F SQ8_FP16 chooser APIs.
`src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp`	Adds AVX2+FMA+F16C chooser implementations.
`src/VecSim/spaces/functions/AVX2_FMA_F16C.h`	Declares AVX2+FMA+F16C SQ8_FP16 chooser APIs.
`src/VecSim/spaces/functions/AVX2_F16C.cpp`	Adds AVX2+F16C chooser implementations.
`src/VecSim/spaces/functions/AVX2_F16C.h`	Declares AVX2+F16C SQ8_FP16 chooser APIs.
`src/VecSim/spaces/functions/SSE4_F16C.cpp`	Adds SSE4+F16C chooser implementations.
`src/VecSim/spaces/functions/SSE4_F16C.h`	Declares SSE4+F16C SQ8_FP16 chooser APIs.
`src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp`	Formatting-only spacing change.
`src/VecSim/spaces/CMakeLists.txt`	Adds F16C-specific translation units and compile flags.
`tests/unit/test_spaces.cpp`	Adds SQ8_FP16 SIMD dispatcher/correctness tests.
`tests/benchmark/spaces_benchmarks/bm_spaces.h`	Includes new F16C chooser headers for benchmarks.
`tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp`	Registers per-ISA SQ8_FP16 benchmarks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dor-forer added the benchmarks-all label May 26, 2026

dor-forer and others added 8 commits May 26, 2026 13:39

dor-forer force-pushed the dor-forer-sq8-fp16-x86-kernels-mod-14954 branch from a43eac4 to e21cb3b Compare May 26, 2026 10:42

cursor Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/VecSim/spaces/CMakeLists.txt Outdated

dor-forer and others added 9 commits May 26, 2026 14:07

Reformat SQ8↔FP16 SIMD kernels for consistent line breaks

4c8828e

Remove SQ8↔FP16 design doc from PR [MOD-14954]

fe69f85

Design doc was added in ad941b8 for planning; not appropriate as a long-lived in-repo artifact. Keep externally (Confluence / scratch) rather than ship with the kernel commit.

Clean up whitespace and formatting inconsistencies

839fe3c

Remove extraneous blank lines in SSE4 and AVX2_FMA source files, fix indentation in AVX512F SQ8_FP16 function signatures, and reformat benchmark macro invocation to fit line length conventions.

cursor Bot reviewed May 28, 2026

View reviewed changes

Comment thread tests/unit/test_spaces.cpp Outdated

dor-forer and others added 3 commits May 28, 2026 15:41

Remove obsolete SQ8-to-FP16 dispatch comments

3565985

The comments referencing SQ8-to-FP16 dispatch location are no longer accurate after the recent refactoring that moved the dispatch logic. Clean up these stale comments from the AVX512F_BW_VL_VNNI files.

dor-forer requested a review from Copilot May 31, 2026 10:36

Copilot started reviewing on behalf of dor-forer May 31, 2026 10:36 View session

Copilot AI reviewed May 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SQ8↔FP16 x86 SIMD distance kernels [MOD-14954]#970

Add SQ8↔FP16 x86 SIMD distance kernels [MOD-14954]#970
dor-forer wants to merge 21 commits into
mainfrom
dor-forer-sq8-fp16-x86-kernels-mod-14954

dor-forer commented May 26, 2026 •

edited by cursor Bot

Loading

Uh oh!

jit-ci Bot commented May 26, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dor-forer commented May 26, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

jit-ci Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛡️ Jit Security Scan Results

Uh oh!

codecov Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dor-forer commented May 26, 2026 •

edited by cursor Bot

Loading

jit-ci Bot commented May 26, 2026 •

edited

Loading

codecov Bot commented May 26, 2026 •

edited

Loading