Initial AVX-512 support by Shnatsel · Pull Request #201 · linebender/fearless_simd

Shnatsel · 2026-02-21T16:33:35Z

This adds the AVX-512 level with Ice Lake as the baseline. This is a deliberate choice: Skylake technically has AVX-512 but is impractically slow, and supporting AVX-512 would both reduce performance on Skylake and reduce performance on other CPUs with actually competent AVX-512 by restricting their instruction sets.

512-bit operations are now backed by native AVX-512 instructions where available.

Future work

These changes only cover 512-bit vectors. AVX-512 allows more efficient implementations of many ops of smaller sizes, this PR does not tackle them yet.

This keeps the mask representation intact and instead uses extra instructions to cast from masks to full-width vectors and back. This is inefficient but lets us keep the exiting API and avoid dealing with #179 for now.

Change AVX-512 native_width from 256 to 512, enabling true 512-bit SIMD operations instead of delegating everything to AVX2. Operations with native 512-bit implementations: - splat, split, combine - zip_low, zip_high, unzip_even, unzip_odd - cvt_* (f32<->i32/u32 using native AVX-512 unsigned intrinsics) - all_true, any_true, none_true, any_false (using mask registers) - slide (across blocks) - load_interleaved, store_interleaved - Array operations (from_array, as_array, store_array) Operations using split/combine fallback (to be implemented later): - Binary ops (add, sub, mul, etc.) - Unary ops (abs, neg, sqrt, etc.) - Comparisons (simd_eq, simd_lt, etc.) - requires mask register handling - select, shift, reinterpret, widen_narrow - slide (within blocks) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Enable native 512-bit AVX-512 implementations for unary, binary, and ternary operations instead of using the split/combine fallback. Changes: - mk_x86.rs: Update force_generic_op to allow native implementations for Binary, Unary, and Ternary operations on 512-bit vectors - arch/x86.rs: Fix rounding operations for AVX-512 which uses _mm512_roundscale_ps/pd instead of _mm512_round_ps/pd - arch/x86.rs: Fix min_precise/max_precise for AVX-512 using mask-based comparisons (_mm512_cmp_ps_mask) and blending (_mm512_mask_blend_ps) instead of vector-based blendv operations Operations now using native 512-bit AVX-512 intrinsics: - Binary: add, sub, mul, div, and, or, xor, min, max, copysign, min_precise, max_precise - Unary: abs, neg, sqrt, floor, ceil, round_ties_even, trunc, fract, not - Ternary: mul_add, mul_sub (FMA) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…en/narrow Enable native 512-bit AVX-512 implementations for additional operations instead of using the split/combine fallback. Changes to mk_x86.rs: - handle_select: Use _mm512_movepi*_mask to convert vector mask to __mmask*, then use _mm512_mask_blend_* for native AVX-512 select - handle_shift: For 8-bit shifts on 512-bit vectors, use _mm512_cvtepi8_epi16 and _mm512_cvtepu8_epi16 for sign/zero extension instead of cmpgt (which returns masks in AVX-512) - handle_widen_narrow: Use native AVX-512 intrinsics for widen (256->512) via _mm512_cvtep* and narrow (512->256) via _mm512_cvtepi*_epi* Operations now using native 512-bit AVX-512 intrinsics: - Shift: shl, shr (including 8-bit with proper AVX-512 extension) - Select: using mask registers and mask_blend - Reinterpret: using cast intrinsics - Widen: using _mm512_cvtepu8_epi16, etc. - Narrow: using _mm512_cvtepi16_epi8, _mm512_cvtepi32_epi16, etc. The only operation still using split/combine fallback is Compare, which returns mask registers and needs different handling. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Enable native 512-bit AVX-512 implementations for both WithinBlocks and AcrossBlocks slide operations instead of using split/combine fallback for WithinBlocks. Changes: - Update force_generic_op to allow native slide implementations for AVX-512 512-bit vectors (both WithinBlocks and AcrossBlocks) - WithinBlocks now uses dyn_alignr_512 which calls _mm512_alignr_epi8 - AcrossBlocks continues to use cross_block_alignr_512 The only operation still using split/combine fallback is Compare, which returns mask registers and needs different handling. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Enable native 512-bit AVX-512 implementations for all comparison operations, eliminating the need for force_generic_op special cases. Changes: - Add handle_compare_avx512 method that uses native AVX-512 comparison intrinsics (_mm512_cmpeq_epi*_mask, _mm512_cmpgt_epi*_mask, etc.) - Convert mask register results back to vector masks using _mm512_movm_epi* intrinsics to maintain the public API - Float comparisons use _mm512_cmp_ps_mask/_mm512_cmp_pd_mask with predicate constants - Integer comparisons use direct comparison intrinsics with proper signed/unsigned variants AVX-512 comparison intrinsics used: - Equality: _mm512_cmpeq_epi*_mask - Less than: _mm512_cmplt_epi*_mask / _mm512_cmplt_epu*_mask - Less equal: _mm512_cmple_epi*_mask / _mm512_cmple_epu*_mask - Greater than: _mm512_cmpgt_epi*_mask / _mm512_cmpgt_epu*_mask - Greater equal: _mm512_cmpge_epi*_mask / _mm512_cmpge_epu*_mask All 512-bit AVX-512 operations now use native intrinsics. The force_generic_op function returns None for all AVX-512 512-bit operations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Since all AVX-512 512-bit operations now use native intrinsics, the force_generic_op override is no longer needed. The default trait implementation (which returns None) is now sufficient. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Use .cast_signed() instead of `as i64` for u64 constants to make the intentional wrapping explicit. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Fix 8-bit shift operations (shl/shr for i8x64 and u8x64): Replace _mm512_packs_epi16/_mm512_packus_epi16 with _mm512_cvtepi16_epi8 which properly truncates without lane interleaving - Fix precise float-to-integer conversions (cvt_i32_precise_f32x16 and cvt_u32_precise_f32x16): Add proper handling for out-of-range values and NaN using AVX-512 native mask comparison and blend instructions - Fix zip_low/zip_high for 512-bit vectors: Redesign algorithm to extract appropriate 256-bit halves and perform full 256-bit zip operations, avoiding the limitation of _mm512_shuffle_i64x2 which cannot freely interleave lanes from both operands

Replace the previous multi-step zip implementation with a single permutex2var instruction. This is more efficient and cleaner: - zip_low: interleaves elements 0..n/2 from both vectors - zip_high: interleaves elements n/2..n from both vectors Uses _mm512_permutex2var_{epi8,epi16,epi32,epi64,ps,pd} depending on element type. The index vector is computed at compile time.

This method was never overridden by any Level implementation and always returned None, so the logic can be simplified to directly call should_use_generic_op. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…ng code and improving readability

Replace the previous cross_block_alignr_512 implementation that split 512-bit vectors into 256-bit halves with a single _mm512_permutex2var_epi8 instruction. This reduces the operation from multiple extracts, permutes, and inserts to a single VBMI shuffle instruction. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…bably a bug

Replace _mm512_cvtepi16_epi8 + _mm512_inserti64x4 with a single _mm512_permutex2var_epi8 call for truncating 16-bit shifted values back to 8-bit. This is more efficient because vpmovwb is 2 uops producing only 256-bit output, while vpermt2b produces a full 512-bit result in 1-2 uops. See: llvm/llvm-project#34219 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Split avx2_slide_helpers() into common helpers (used by both AVX2 and AVX-512) and the full version (AVX2 only). The cross_block_alignr_256x2 function is only needed for AVX2's 2x256-bit operations, not AVX-512 which uses cross_block_alignr_512 instead. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add underscores to literal suffixes (0usize -> 0_usize, etc.) - Use try_into().unwrap() instead of `as` casts for index conversions, ensuring the generator will panic if assumptions are violated - Add backticks to function names in doc comments Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…ppy --tests` without a reported location, I've failed to isolate it to a specific crate and suppress it there

Shnatsel · 2026-02-22T15:04:01Z

This came together better than I expected. After several rounds of my own review and cleanup this is now ready for review by the maintainers.

There are still two big work items for changing the mask representation and using AVX-512 instructions for sizes smaller than 512 bits, but I thought this is a good milestone to start the conversation before the amount of changes becomes overwhelming.

Shnatsel and others added 25 commits February 21, 2026 15:16

Add initial AVX-512 level; delegates to AVX2 for everything for now

5568988

Remove force_generic_op override for x86

11fc650

Since all AVX-512 512-bit operations now use native intrinsics, the force_generic_op override is no longer needed. The default trait implementation (which returns None) is now sufficient. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Bump MSRV to 1.89 to get access to AVX-512 intrinsics

db4c255

Fix clippy cast_possible_wrap warning

04b9272

Use .cast_signed() instead of `as i64` for u64 constants to make the intentional wrapping explicit. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Expand test coverage to include AVX-512

3428f9f

Remove unused force_generic_op from Level trait

318c085

This method was never overridden by any Level implementation and always returned None, so the logic can be simplified to directly call should_use_generic_op. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Unify rounding mode representation between SSE and AVX-512, simplifyi…

4fef749

…ng code and improving readability

Drop weird and unnecessary #[cfg(not(target_feature = avx512f))], pro…

d36a716

…bably a bug

Drop erroneous avx512f cfg from avx2 level detection

873edc4

Suppress an apparently buggy Clippy lint; surfaced only in `cargo cli…

04c9168

…ppy --tests` without a reported location, I've failed to isolate it to a specific crate and suppress it there

appease toml format checker

aa0254f

drop the erroneous 'not(target_feature = avx512f)' cfg's

daf5112

dummy commit to re-run CI and get rid of transient failure

487359b

Fix typo

e1336f4

Shnatsel marked this pull request as ready for review February 22, 2026 15:03

Shnatsel changed the title ~~PoC: AVX-512~~ Initial AVX-512 support Feb 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial AVX-512 support#201

Initial AVX-512 support#201
Shnatsel wants to merge 25 commits intolinebender:mainfrom
Shnatsel:avx512

Shnatsel commented Feb 21, 2026 •

edited

Loading

Uh oh!

Shnatsel commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Shnatsel commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Future work

Uh oh!

Shnatsel commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shnatsel commented Feb 21, 2026 •

edited

Loading