Skip to content

Initial AVX-512 support#201

Open
Shnatsel wants to merge 25 commits intolinebender:mainfrom
Shnatsel:avx512
Open

Initial AVX-512 support#201
Shnatsel wants to merge 25 commits intolinebender:mainfrom
Shnatsel:avx512

Conversation

@Shnatsel
Copy link
Contributor

@Shnatsel Shnatsel commented Feb 21, 2026

This adds the AVX-512 level with Ice Lake as the baseline. This is a deliberate choice: Skylake technically has AVX-512 but is impractically slow, and supporting AVX-512 would both reduce performance on Skylake and reduce performance on other CPUs with actually competent AVX-512 by restricting their instruction sets.

512-bit operations are now backed by native AVX-512 instructions where available.

Future work

These changes only cover 512-bit vectors. AVX-512 allows more efficient implementations of many ops of smaller sizes, this PR does not tackle them yet.

This keeps the mask representation intact and instead uses extra instructions to cast from masks to full-width vectors and back. This is inefficient but lets us keep the exiting API and avoid dealing with #179 for now.

Shnatsel and others added 25 commits February 21, 2026 15:16
Change AVX-512 native_width from 256 to 512, enabling true 512-bit
SIMD operations instead of delegating everything to AVX2.

Operations with native 512-bit implementations:
- splat, split, combine
- zip_low, zip_high, unzip_even, unzip_odd
- cvt_* (f32<->i32/u32 using native AVX-512 unsigned intrinsics)
- all_true, any_true, none_true, any_false (using mask registers)
- slide (across blocks)
- load_interleaved, store_interleaved
- Array operations (from_array, as_array, store_array)

Operations using split/combine fallback (to be implemented later):
- Binary ops (add, sub, mul, etc.)
- Unary ops (abs, neg, sqrt, etc.)
- Comparisons (simd_eq, simd_lt, etc.) - requires mask register handling
- select, shift, reinterpret, widen_narrow
- slide (within blocks)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable native 512-bit AVX-512 implementations for unary, binary, and
ternary operations instead of using the split/combine fallback.

Changes:
- mk_x86.rs: Update force_generic_op to allow native implementations
  for Binary, Unary, and Ternary operations on 512-bit vectors
- arch/x86.rs: Fix rounding operations for AVX-512 which uses
  _mm512_roundscale_ps/pd instead of _mm512_round_ps/pd
- arch/x86.rs: Fix min_precise/max_precise for AVX-512 using mask-based
  comparisons (_mm512_cmp_ps_mask) and blending (_mm512_mask_blend_ps)
  instead of vector-based blendv operations

Operations now using native 512-bit AVX-512 intrinsics:
- Binary: add, sub, mul, div, and, or, xor, min, max, copysign,
  min_precise, max_precise
- Unary: abs, neg, sqrt, floor, ceil, round_ties_even, trunc, fract, not
- Ternary: mul_add, mul_sub (FMA)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…en/narrow

Enable native 512-bit AVX-512 implementations for additional operations
instead of using the split/combine fallback.

Changes to mk_x86.rs:
- handle_select: Use _mm512_movepi*_mask to convert vector mask to __mmask*,
  then use _mm512_mask_blend_* for native AVX-512 select
- handle_shift: For 8-bit shifts on 512-bit vectors, use _mm512_cvtepi8_epi16
  and _mm512_cvtepu8_epi16 for sign/zero extension instead of cmpgt (which
  returns masks in AVX-512)
- handle_widen_narrow: Use native AVX-512 intrinsics for widen (256->512)
  via _mm512_cvtep* and narrow (512->256) via _mm512_cvtepi*_epi*

Operations now using native 512-bit AVX-512 intrinsics:
- Shift: shl, shr (including 8-bit with proper AVX-512 extension)
- Select: using mask registers and mask_blend
- Reinterpret: using cast intrinsics
- Widen: using _mm512_cvtepu8_epi16, etc.
- Narrow: using _mm512_cvtepi16_epi8, _mm512_cvtepi32_epi16, etc.

The only operation still using split/combine fallback is Compare, which
returns mask registers and needs different handling.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable native 512-bit AVX-512 implementations for both WithinBlocks and
AcrossBlocks slide operations instead of using split/combine fallback
for WithinBlocks.

Changes:
- Update force_generic_op to allow native slide implementations for
  AVX-512 512-bit vectors (both WithinBlocks and AcrossBlocks)
- WithinBlocks now uses dyn_alignr_512 which calls _mm512_alignr_epi8
- AcrossBlocks continues to use cross_block_alignr_512

The only operation still using split/combine fallback is Compare, which
returns mask registers and needs different handling.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable native 512-bit AVX-512 implementations for all comparison
operations, eliminating the need for force_generic_op special cases.

Changes:
- Add handle_compare_avx512 method that uses native AVX-512 comparison
  intrinsics (_mm512_cmpeq_epi*_mask, _mm512_cmpgt_epi*_mask, etc.)
- Convert mask register results back to vector masks using
  _mm512_movm_epi* intrinsics to maintain the public API
- Float comparisons use _mm512_cmp_ps_mask/_mm512_cmp_pd_mask with
  predicate constants
- Integer comparisons use direct comparison intrinsics with proper
  signed/unsigned variants

AVX-512 comparison intrinsics used:
- Equality: _mm512_cmpeq_epi*_mask
- Less than: _mm512_cmplt_epi*_mask / _mm512_cmplt_epu*_mask
- Less equal: _mm512_cmple_epi*_mask / _mm512_cmple_epu*_mask
- Greater than: _mm512_cmpgt_epi*_mask / _mm512_cmpgt_epu*_mask
- Greater equal: _mm512_cmpge_epi*_mask / _mm512_cmpge_epu*_mask

All 512-bit AVX-512 operations now use native intrinsics. The
force_generic_op function returns None for all AVX-512 512-bit
operations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Since all AVX-512 512-bit operations now use native intrinsics, the
force_generic_op override is no longer needed. The default trait
implementation (which returns None) is now sufficient.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use .cast_signed() instead of `as i64` for u64 constants to make the
intentional wrapping explicit.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix 8-bit shift operations (shl/shr for i8x64 and u8x64): Replace
  _mm512_packs_epi16/_mm512_packus_epi16 with _mm512_cvtepi16_epi8
  which properly truncates without lane interleaving

- Fix precise float-to-integer conversions (cvt_i32_precise_f32x16
  and cvt_u32_precise_f32x16): Add proper handling for out-of-range
  values and NaN using AVX-512 native mask comparison and blend
  instructions

- Fix zip_low/zip_high for 512-bit vectors: Redesign algorithm to
  extract appropriate 256-bit halves and perform full 256-bit zip
  operations, avoiding the limitation of _mm512_shuffle_i64x2 which
  cannot freely interleave lanes from both operands
Replace the previous multi-step zip implementation with a single
permutex2var instruction. This is more efficient and cleaner:

- zip_low: interleaves elements 0..n/2 from both vectors
- zip_high: interleaves elements n/2..n from both vectors

Uses _mm512_permutex2var_{epi8,epi16,epi32,epi64,ps,pd} depending
on element type. The index vector is computed at compile time.
This method was never overridden by any Level implementation and always
returned None, so the logic can be simplified to directly call
should_use_generic_op.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace the previous cross_block_alignr_512 implementation that split
512-bit vectors into 256-bit halves with a single _mm512_permutex2var_epi8
instruction. This reduces the operation from multiple extracts, permutes,
and inserts to a single VBMI shuffle instruction.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace _mm512_cvtepi16_epi8 + _mm512_inserti64x4 with a single
_mm512_permutex2var_epi8 call for truncating 16-bit shifted values
back to 8-bit. This is more efficient because vpmovwb is 2 uops
producing only 256-bit output, while vpermt2b produces a full
512-bit result in 1-2 uops.

See: llvm/llvm-project#34219

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Split avx2_slide_helpers() into common helpers (used by both AVX2 and
AVX-512) and the full version (AVX2 only). The cross_block_alignr_256x2
function is only needed for AVX2's 2x256-bit operations, not AVX-512
which uses cross_block_alignr_512 instead.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add underscores to literal suffixes (0usize -> 0_usize, etc.)
- Use try_into().unwrap() instead of `as` casts for index conversions,
  ensuring the generator will panic if assumptions are violated
- Add backticks to function names in doc comments

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ppy --tests` without a reported location, I've failed to isolate it to a specific crate and suppress it there
@Shnatsel Shnatsel marked this pull request as ready for review February 22, 2026 15:03
@Shnatsel
Copy link
Contributor Author

This came together better than I expected. After several rounds of my own review and cleanup this is now ready for review by the maintainers.

There are still two big work items for changing the mask representation and using AVX-512 instructions for sizes smaller than 512 bits, but I thought this is a good milestone to start the conversation before the amount of changes becomes overwhelming.

@Shnatsel Shnatsel changed the title PoC: AVX-512 Initial AVX-512 support Feb 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants