Add SQ8↔FP16 x86 SIMD distance kernels [MOD-14954]#970
Conversation
Captures the architecture, file-level plan, CMake F16C gating, and risk register for adding AVX-512 / AVX2+FMA / AVX2 / SSE4 kernels for the asymmetric SQ8 (storage) ↔ FP16 (query) distance functions, wiring them into the existing dispatcher tables and SQ8_FP16 unit/benchmark scaffolding from MOD-15141. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
Enables _mm{,256}_cvtph_ps in the AVX2+FMA, AVX2, and SSE4 dispatcher
translation units so the upcoming SQ8↔FP16 kernels can widen FP16 lanes
to FP32. The flag is appended only when CXX_F16C is detected; existing
SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 sources contain no F16C intrinsics so
emitted code for those kernels is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Parameterised gtest fixture mirroring SQ8_FP32_SpacesOptimizationTest; currently asserts only the scalar fallback path. Per-tier SIMD assertion blocks (AVX-512, AVX2+FMA, AVX2, SSE4) are added alongside the kernel implementations in subsequent commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements asymmetric SQ8 (storage) ↔ FP16 (query) Inner Product, Cosine, and L2² kernels for the AVX-512 F+BW+VL+VNNI tier. Each chunk widens 16 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 16 FP16 lanes via _mm512_cvtph_ps, then fmadds into a 16-lane FP32 accumulator. SQ8 storage and FP16 query metadata reads use load_unaligned to tolerate odd dimensions. Dispatcher branches in IP_space.cpp / L2_space.cpp select the new Choose_SQ8_FP16_*_implementation_AVX512F_BW_VL_VNNI when features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni; otherwise behaviour is unchanged from MOD-15141. A parameterised gtest fixture exercises every residual class in [16, 32] against the scalar baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8-wide AVX2+FMA kernels widen 8 SQ8 lanes via cvtepu8_epi32 +
cvtepi32_ps and 8 FP16 lanes via _mm256_cvtph_ps, then fmadd into a
256-bit FP32 accumulator. Residual (< 8) lanes load the full 16-byte
FP16 block, convert, then blend zero across unused lanes — mirroring
the existing F16C FP16 kernel pattern. Dispatcher branch in
{IP,Cosine,L2}_SQ8_FP16_GetDistFunc selects the new
Choose_SQ8_FP16_*_implementation_AVX2_FMA when features.avx2 &&
features.fma3 && features.f16c.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the AVX2+FMA kernels but uses _mm256_mul_ps + _mm256_add_ps instead of _mm256_fmadd_ps so it can run on Haswell-era AVX2 hardware without FMA support (uncommon but matches the existing SQ8_FP32 tiering). Dispatcher gate requires features.avx2 && features.f16c and runs between the AVX2+FMA and SSE4 tiers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4-wide SSE4 kernels widen 4 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 4 FP16 lanes via _mm_cvtph_ps (F16C), then mul+add into a 128-bit FP32 accumulator (SSE4 has no FMA). Residual % 4 lanes are materialised via _mm_set_ps + the scalar FP16_to_FP32 helper, mirroring the existing SSE4 SQ8_FP32 residual pattern. Dispatcher gate requires features.sse4_1 && features.f16c && features.avx since F16C is VEX-encoded — matches the existing F16C/FP16 dispatcher gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The SQ8_FP16 GetDistFunc dispatcher now returns AVX-512 / AVX2+FMA / AVX2 / SSE4 SIMD kernels when the corresponding feature flags are set (only scalar previously). Updates the GetDistFunc_*_SQ8_FP16 asserts to compute the expected function for the host's highest supported tier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds AVX-512 / AVX2+FMA / AVX2 / SSE4 benchmark registrations to bm_spaces_sq8_fp16.cpp, mirroring the SQ8_FP32 layout. Gates each tier on the corresponding OPT_* defines plus the runtime feature checks that mirror the dispatcher in IP_space.cpp / L2_space.cpp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a43eac4 to
e21cb3b
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #970 +/- ##
==========================================
+ Coverage 96.99% 97.05% +0.05%
==========================================
Files 130 141 +11
Lines 7793 8105 +312
==========================================
+ Hits 7559 7866 +307
- Misses 234 239 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- CMake: gate `-mf16c` on CXX_F16C AND CXX_FMA AND CXX_AVX (matches OPT_F16C
macro) and append `-mavx` to the SSE4 dispatcher when adding -mf16c, since
F16C is VEX-encoded and requires AVX state. Mirrors the existing F16C.cpp
recipe and prevents miscompiles on toolchains with F16C but without AVX.
- IP_SSE4_SQ8_FP16.h: replace `*reinterpret_cast<const int32_t *>(pVect1)`
with `load_unaligned<int32_t>(pVect1)` to remove strict-aliasing UB on
the uint8_t SQ8 lane load.
- IP_AVX2{,_FMA}_SQ8_FP16.h: improve the residual-mask comment to spell out
the asymmetric-mask reasoning (SQ8 unmasked is safe because the FP16
query blend forces those FP32 query lanes to 0 → garbage·0=0).
- IP_AVX{512,2,2_FMA,SSE4}_SQ8_FP16.h: add the `IP = min·y_sum + delta·Σ(q·y)`
algebraic-identity comment header that AVX-512 already carried, plus a
precondition note that callers must enforce dim >= 16 (matches the
established SQ8_FP32 convention; no runtime assert because sibling
SQ8_FP32 SIMD kernels also rely on the dispatcher gate).
- test_spaces.cpp: route the SQ8_FP16 edge-case tests (ZeroQuery,
ConstantStorage, MixedSignQuery) through {IP,Cosine,L2}_SQ8_FP16_GetDistFunc
so the runtime-selected SIMD tier is actually exercised on those inputs,
not just the scalar reference.
- test_spaces.cpp: add SQ8_FP16_SIMD_HighDim suite with dims {64, 128, 256,
512, 1024} so multi-iteration do-while loop bugs would fire (the existing
[16, 32] range covers at most two AVX-512 chunk iterations).
- test_spaces.cpp: add SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised — a
single test that emits per-tier coverage to stderr and GTEST_SKIPs when
no SIMD tier is available, so CI runners without AVX-512 do not silently
report zero tier-1 coverage.
- test_spaces.cpp: scalar-fallback `alignment` checks now seed the value
with 0xFF and assert it remains 0xFF, verifying the dispatcher contract
("scalar leaves caller's value untouched") instead of just measuring
that the variable's pre-zeroed init survived.
- test_spaces.cpp: drop the stale MOD-15152/MOD-15153 wiring-TODO comment
on SQ8_FP16_NoOptimizationSpacesTest now that the SIMD tiers are wired.
- bm_spaces_sq8_fp16.cpp: drop the matching stale comment.
Out of scope (separate ticket): two-accumulator FMA refactor (also affects
SQ8_FP32) and the SSE4 residual `_mm_cvtph_ps` perf opportunity.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Break the FMA / mul+add dependency chain in all four SQ8↔FP16 IP kernels by widening the inner loop to use multiple independent accumulators. L2 kernels inherit the change through their `…InnerProductImp_…` call. - IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h: 1 → 4 accumulators, unroll-4 main loop (64 lanes/iter) with a 16-lane tail for the 0..3 remaining chunks. - IP_AVX2_FMA_SQ8_FP16.h, IP_AVX2_SQ8_FP16.h: 1 → 2 accumulators; the existing 2-step unrolled body now routes each step to an independent accumulator. The `residual >= 8` half-chunk feeds the second accumulator so the prologue also breaks the dependency chain. - IP_SSE4_SQ8_FP16.h: 1 → 2 accumulators; do-while unrolled 1 → 2 steps per iteration (4 → 8 lanes/iter). Residual-ladder steps alternate between sum_a and sum_b for prologue ILP. Correctness invariant: residual block consumes exactly `residual` lanes (0..15) → remaining tail is always a multiple of 16, so the unrolled loops (multiples of 8 / 16 / 64) terminate exactly. Verified by 131 SQ8_FP16 unit tests + 115 under ASan.
The SQ8↔FP16 AVX-512 kernel does not actually issue any VNNI instruction
— the inner loop is FP32 FMA (`_mm512_fmadd_ps`) over lanes widened from
SQ8 (`_mm512_cvtepu8_epi32` + `_mm512_cvtepi32_ps`) and FP16
(`_mm512_cvtph_ps`). Real VNNI use would require an integer-encoded
query, which is a different kernel entirely.
The file/function names are renamed to match what the kernel actually
uses (AVX-512F). The dispatcher .cpp/.h files stay named after the
runtime tier (AVX512F_BW_VL_VNNI) since the SQ8↔FP16 kernel still
registers under that tier alongside the genuinely VNNI-using SQ8↔SQ8 /
INT8 / UINT8 kernels — the gate is a CPU-feature gate, not an ISA claim.
The same misnomer exists for SQ8↔FP32; tracked separately so the rename
there can ship as its own commit.
Also: fix a strict-aliasing-class UB introduced by the AVX-512 unroll-4
loop. `while (pVec1 + 64 <= pEnd1)` forms a pointer past one-past-end of
the SQ8 storage object when fewer than 64 lane bytes remain, which is UB
in C++ regardless of dereference. Switched to pointer subtraction
(`static_cast<size_t>(pEnd1 - pVec1) >= 64`).
Renames:
- IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> IP_AVX512F_SQ8_FP16.h
- L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> L2_AVX512F_SQ8_FP16.h
- SQ8_FP16_{InnerProduct,Cosine,L2Sqr}SIMD16_AVX512F_BW_VL_VNNI -> _AVX512F
- Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F_BW_VL_VNNI -> _AVX512F
Verified: 131 SQ8_FP16 unit tests + 115 under ASan.
Design doc was added in ad941b8 for planning; not appropriate as a long-lived in-repo artifact. Keep externally (Confluence / scratch) rather than ship with the kernel commit.
Two trims, both restoring pre-existing patterns elsewhere in the file:
1. `GetDistFuncSQ8FP16Asymmetric` had grown a runtime SIMD-tier walk that
duplicated coverage already provided by `SQ8_FP16_SpacesOptimizationTest`.
Reduced to the bare dispatcher-equality check used by the FP32 / SQ8↔SQ8
sister tests at lines 540-548 and 551-559.
2. The `SQ8_FP16_EdgeCases` tests (`ZeroQueryTest`, `ConstantStorageTest`,
`MixedSignQueryTest`) were routed through
`{IP,Cosine,L2}_SQ8_FP16_GetDistFunc(dim, nullptr)` to force runtime SIMD
dispatch on adversarial inputs. Reverted to direct scalar calls
(`SQ8_FP16_InnerProduct`, etc.) — the original pre-fdc5c1cd shape.
Coverage rationale: the SIMD kernels are branchless on input values
(verified by grep — no value-dependent `if` in any tier). Every code
path is therefore exercised by `SQ8_FP16_SpacesOptimizationTest`'s
random inputs at multiple dims. The edge-case tests verify the
*algebraic identity* (IP of zero query = 1.0, constant storage matches
dequant baseline, mixed-sign handling) — scalar correctness on these
inputs is what was actually being checked, and the SIMD path matches
scalar via the SpacesOptimizationTest tier walk.
Net: 77 lines removed from the test file, matches sister conventions,
no coverage gap.
The SQ8↔FP16 kernels in the SSE4, AVX2, and AVX2+FMA tiers depend on F16C (`_mm_cvtph_ps` / `_mm256_cvtph_ps`), while every other kernel in those dispatcher TUs is F16C-clean. The previous arrangement mixed both under `#ifdef OPT_F16C` blocks inside the base dispatcher .cpp/.h files. Split each tier's F16C-dependent kernels off into a sibling TU: functions/SSE4.cpp → SSE4 + SQ8_FP32 (no F16C) functions/SSE4_F16C.cpp → SQ8_FP16 only (requires -mavx -mf16c) functions/AVX2.cpp → AVX2 + BF16 + SQ8_FP32 (no F16C) functions/AVX2_F16C.cpp → SQ8_FP16 only (requires -mf16c) functions/AVX2_FMA.cpp → SQ8_FP32 (no F16C) functions/AVX2_FMA_F16C.cpp → SQ8_FP16 only (requires -mf16c) The AVX-512 tier is unaffected — its SQ8_FP16 kernel uses `_mm512_cvtph_ps`, which is part of AVX-512F and not F16C. CMake now compiles each sibling TU conditionally on `_has_full_f16c` and applies the F16C flags only there. Base TUs no longer carry `-mf16c`, since they no longer reference F16C intrinsics. Result: - No `#ifdef OPT_F16C` directives in `functions/*.cpp` or `functions/*.h`. - Compile-time isolation: an F16C intrinsic accidentally added outside a `_F16C` sibling will fail to build, not silently miscompile. - Caller sites (`IP_space.cpp`, `L2_space.cpp`, `test_spaces.cpp`, `bm_spaces.h`) still gate the *calls* with `#ifdef OPT_F16C`; the new sibling .h includes are unconditional, since declarations alone don't link-error and the calls remain guarded. Verified: 131 SQ8_FP16 unit tests + 115 ASan + 1166 full test_spaces suite (covers other ISA tiers SQ8_FP32 / BF16 / INT8 / UINT8 to confirm no regression from the dispatcher restructure).
…[MOD-14954]
Two related cleanups in the SQ8↔FP16 dispatch path:
1. The AVX-512 SQ8↔FP16 kernel only uses AVX-512F instructions
(`_mm512_cvtph_ps`, `_mm512_fmadd_ps`, etc.) but was registered under
the VNNI tier (`OPT_AVX512_F_BW_VL_VNNI` + check of avx512f/bw/vl/vnni).
That meant CPUs with AVX-512F but no VNNI (Skylake-X, some Cascade Lake
variants, etc.) would fall through to AVX2_FMA even though they can
run the AVX-512 kernel.
Moved the `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F`
definitions from `functions/AVX512F_BW_VL_VNNI.cpp` to
`functions/AVX512F.cpp`, with matching header reshuffle. Dispatch
sites now gate on `OPT_AVX512F` + `features.avx512f`.
2. F16C is a transversal requirement across the non-AVX-512 SQ8↔FP16
tiers (SSE4, AVX2, AVX2+FMA) — every one of them widens FP16 query
lanes via `vcvtph2ps`. Per-tier nested `#ifdef OPT_F16C` was hoisted
into a single outer block around the three ISA branches in
`IP_SQ8_FP16_GetDistFunc`, `Cosine_SQ8_FP16_GetDistFunc`, and
`L2_SQ8_FP16_GetDistFunc`.
Verified: 131 SQ8_FP16 release + 115 ASan + 1166 full test_spaces suite.
Remove extraneous blank lines in SSE4 and AVX2_FMA source files, fix indentation in AVX512F SQ8_FP16 function signatures, and reformat benchmark macro invocation to fit line length conventions.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 839fe3c. Configure here.
The comments referencing SQ8-to-FP16 dispatch location are no longer accurate after the recent refactoring that moved the dispatch logic. Clean up these stale comments from the AVX512F_BW_VL_VNNI files.
…4954] Mirrors the dispatcher layout in IP_space.cpp / L2_space.cpp where a single OPT_F16C guard wraps the AVX2+FMA, AVX2, and SSE4 branches. Each test body (L2/IP/Cosine) and the TierCoverage report now use the same single-guard shape. Also retargets the TierCoverage AVX-512 check from OPT_AVX512_F_BW_VL_VNNI to OPT_AVX512F, matching the dispatcher's new AVX-512F-only gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised was an outlier — no other data type has a std::cerr-based coverage reporter. Per-tier coverage is already provided by SQ8_FP16_SpacesOptimizationTest (which walks AVX-512 → AVX2+FMA → AVX2 → SSE4 → scalar by clearing feature flags), and ISA-lane presence is handled by the CI matrix, matching the convention used by every other type's SpacesOptimizationTest. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stacked on PR #970 (MOD-14954 x86 kernels). Mirrors x86 structure onto NEON_HP / SVE / SVE2 tiers. Zero CMake changes; reuses existing ARM TU compile flags. Scalar fallback already on main serves as reference. Bakes in PR #970 review lessons (assert(dim>=16), 4-accumulator ILP, formula anchor, load_unaligned<float> metadata, dispatcher-routed tier-walk tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 bite-sized tasks following the spec at 2026-05-28-arm-sq8-fp16-design.md. Each task ends in a commit; assistant runs tests/ASan/benchmarks after the user confirms each ARM build cycle. Zero CMake changes; PR stacks on #970. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds x86 SIMD implementations and dispatch wiring for asymmetric SQ8 storage ↔ FP16 query distance kernels, extending existing scalar SQ8_FP16 support with AVX-512F, AVX2+FMA+F16C, AVX2+F16C, and SSE4+F16C paths.
Changes:
- Adds SIMD IP/Cosine/L2² SQ8_FP16 kernels and per-ISA chooser entry points.
- Updates IP/L2 dispatchers to select SIMD implementations with runtime CPU feature gating.
- Extends SQ8_FP16 unit tests and benchmarks to cover scalar and x86 SIMD tiers.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h |
Adds AVX-512F SQ8↔FP16 IP/Cosine kernels. |
src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h |
Adds AVX2+FMA+F16C SQ8↔FP16 IP/Cosine kernels. |
src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h |
Adds AVX2+F16C SQ8↔FP16 IP/Cosine kernels. |
src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h |
Adds SSE4+F16C SQ8↔FP16 IP/Cosine kernels. |
src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h |
Adds AVX-512F L2² wrapper using raw IP. |
src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h |
Adds AVX2+FMA+F16C L2² wrapper. |
src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h |
Adds AVX2+F16C L2² wrapper. |
src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h |
Adds SSE4+F16C L2² wrapper. |
src/VecSim/spaces/IP_space.cpp |
Wires SQ8_FP16 IP/Cosine dispatch to new x86 SIMD tiers. |
src/VecSim/spaces/L2_space.cpp |
Wires SQ8_FP16 L2 dispatch to new x86 SIMD tiers. |
src/VecSim/spaces/functions/AVX512F.cpp |
Registers AVX-512F SQ8_FP16 chooser functions. |
src/VecSim/spaces/functions/AVX512F.h |
Declares AVX-512F SQ8_FP16 chooser APIs. |
src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp |
Adds AVX2+FMA+F16C chooser implementations. |
src/VecSim/spaces/functions/AVX2_FMA_F16C.h |
Declares AVX2+FMA+F16C SQ8_FP16 chooser APIs. |
src/VecSim/spaces/functions/AVX2_F16C.cpp |
Adds AVX2+F16C chooser implementations. |
src/VecSim/spaces/functions/AVX2_F16C.h |
Declares AVX2+F16C SQ8_FP16 chooser APIs. |
src/VecSim/spaces/functions/SSE4_F16C.cpp |
Adds SSE4+F16C chooser implementations. |
src/VecSim/spaces/functions/SSE4_F16C.h |
Declares SSE4+F16C SQ8_FP16 chooser APIs. |
src/VecSim/spaces/functions/AVX512F_BW_VL_VNNI.cpp |
Formatting-only spacing change. |
src/VecSim/spaces/CMakeLists.txt |
Adds F16C-specific translation units and compile flags. |
tests/unit/test_spaces.cpp |
Adds SQ8_FP16 SIMD dispatcher/correctness tests. |
tests/benchmark/spaces_benchmarks/bm_spaces.h |
Includes new F16C chooser headers for benchmarks. |
tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp |
Registers per-ISA SQ8_FP16 benchmarks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Summary
IP_space.cpp/L2_space.cpp) with F16C gating on the AVX2/SSE4 tiers; AVX-512 path uses_mm512_cvtph_pswhich is part of AVX512F (no F16C requirement).-mf16cconditionally to AVX2_FMA / AVX2 / SSE4 dispatcher source files in CMake (purely additive — no SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 codegen change since those sources contain no F16C intrinsics).SQ8_FP16_SpacesOptimizationTestto walk all four ISA tiers against the scalar baseline across dim ∈ [16, 32]; update existingGetDistFunc_*_SQ8_FP16assertions accordingly.bm_spaces_sq8_fp16.cpp, mirroring the SQ8_FP32 layout.Spec:
docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.mdPlan:
docs/superpowers/plans/2026-05-26-sq8-fp16-x86-kernels.mdTest plan
🤖 Generated with Claude Code
Note
Medium Risk
Changes core VecSim distance selection and numeric SIMD paths for asymmetric quantization; correctness is heavily tested but wrong dispatch or residual handling could affect search results on x86.
Overview
Adds x86 SIMD for asymmetric SQ8 (index) ↔ FP16 (query) distances: inner product, cosine, and L2², replacing the previous always-scalar path for
dim ≥ 16when CPU features match.Dispatch in
IP_space.cppandL2_space.cppnow picks tiered implementations: AVX-512F (no VNNI; uses_mm512_cvtph_ps), then AVX2+FMA+F16C, AVX2+F16C, and SSE4.1+AVX+F16C, with optional SQ8 alignment hints when dimension is divisible by the chunk size; smaller dims still use scalar.Build: F16C-dependent kernels live in separate translation units (
AVX2_F16C,AVX2_FMA_F16C,SSE4_F16C) compiled with-mf16conly when the toolchain reports full F16C support, keeping existing AVX2/SSE4 objects unchanged.Tests & benchmarks:
SQ8_FP16_SpacesOptimizationTestwalks each tier against the scalar baseline;bm_spaces_sq8_fp16.cppregisters per-ISA benchmarks alongside the naive baseline.Reviewed by Cursor Bugbot for commit 8fe3d74. Bugbot is set up for automated code reviews on this repo. Configure here.