Open
Conversation
Change AVX-512 native_width from 256 to 512, enabling true 512-bit SIMD operations instead of delegating everything to AVX2. Operations with native 512-bit implementations: - splat, split, combine - zip_low, zip_high, unzip_even, unzip_odd - cvt_* (f32<->i32/u32 using native AVX-512 unsigned intrinsics) - all_true, any_true, none_true, any_false (using mask registers) - slide (across blocks) - load_interleaved, store_interleaved - Array operations (from_array, as_array, store_array) Operations using split/combine fallback (to be implemented later): - Binary ops (add, sub, mul, etc.) - Unary ops (abs, neg, sqrt, etc.) - Comparisons (simd_eq, simd_lt, etc.) - requires mask register handling - select, shift, reinterpret, widen_narrow - slide (within blocks) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable native 512-bit AVX-512 implementations for unary, binary, and ternary operations instead of using the split/combine fallback. Changes: - mk_x86.rs: Update force_generic_op to allow native implementations for Binary, Unary, and Ternary operations on 512-bit vectors - arch/x86.rs: Fix rounding operations for AVX-512 which uses _mm512_roundscale_ps/pd instead of _mm512_round_ps/pd - arch/x86.rs: Fix min_precise/max_precise for AVX-512 using mask-based comparisons (_mm512_cmp_ps_mask) and blending (_mm512_mask_blend_ps) instead of vector-based blendv operations Operations now using native 512-bit AVX-512 intrinsics: - Binary: add, sub, mul, div, and, or, xor, min, max, copysign, min_precise, max_precise - Unary: abs, neg, sqrt, floor, ceil, round_ties_even, trunc, fract, not - Ternary: mul_add, mul_sub (FMA) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…en/narrow Enable native 512-bit AVX-512 implementations for additional operations instead of using the split/combine fallback. Changes to mk_x86.rs: - handle_select: Use _mm512_movepi*_mask to convert vector mask to __mmask*, then use _mm512_mask_blend_* for native AVX-512 select - handle_shift: For 8-bit shifts on 512-bit vectors, use _mm512_cvtepi8_epi16 and _mm512_cvtepu8_epi16 for sign/zero extension instead of cmpgt (which returns masks in AVX-512) - handle_widen_narrow: Use native AVX-512 intrinsics for widen (256->512) via _mm512_cvtep* and narrow (512->256) via _mm512_cvtepi*_epi* Operations now using native 512-bit AVX-512 intrinsics: - Shift: shl, shr (including 8-bit with proper AVX-512 extension) - Select: using mask registers and mask_blend - Reinterpret: using cast intrinsics - Widen: using _mm512_cvtepu8_epi16, etc. - Narrow: using _mm512_cvtepi16_epi8, _mm512_cvtepi32_epi16, etc. The only operation still using split/combine fallback is Compare, which returns mask registers and needs different handling. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable native 512-bit AVX-512 implementations for both WithinBlocks and AcrossBlocks slide operations instead of using split/combine fallback for WithinBlocks. Changes: - Update force_generic_op to allow native slide implementations for AVX-512 512-bit vectors (both WithinBlocks and AcrossBlocks) - WithinBlocks now uses dyn_alignr_512 which calls _mm512_alignr_epi8 - AcrossBlocks continues to use cross_block_alignr_512 The only operation still using split/combine fallback is Compare, which returns mask registers and needs different handling. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable native 512-bit AVX-512 implementations for all comparison operations, eliminating the need for force_generic_op special cases. Changes: - Add handle_compare_avx512 method that uses native AVX-512 comparison intrinsics (_mm512_cmpeq_epi*_mask, _mm512_cmpgt_epi*_mask, etc.) - Convert mask register results back to vector masks using _mm512_movm_epi* intrinsics to maintain the public API - Float comparisons use _mm512_cmp_ps_mask/_mm512_cmp_pd_mask with predicate constants - Integer comparisons use direct comparison intrinsics with proper signed/unsigned variants AVX-512 comparison intrinsics used: - Equality: _mm512_cmpeq_epi*_mask - Less than: _mm512_cmplt_epi*_mask / _mm512_cmplt_epu*_mask - Less equal: _mm512_cmple_epi*_mask / _mm512_cmple_epu*_mask - Greater than: _mm512_cmpgt_epi*_mask / _mm512_cmpgt_epu*_mask - Greater equal: _mm512_cmpge_epi*_mask / _mm512_cmpge_epu*_mask All 512-bit AVX-512 operations now use native intrinsics. The force_generic_op function returns None for all AVX-512 512-bit operations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Since all AVX-512 512-bit operations now use native intrinsics, the force_generic_op override is no longer needed. The default trait implementation (which returns None) is now sufficient. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use .cast_signed() instead of `as i64` for u64 constants to make the intentional wrapping explicit. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix 8-bit shift operations (shl/shr for i8x64 and u8x64): Replace _mm512_packs_epi16/_mm512_packus_epi16 with _mm512_cvtepi16_epi8 which properly truncates without lane interleaving - Fix precise float-to-integer conversions (cvt_i32_precise_f32x16 and cvt_u32_precise_f32x16): Add proper handling for out-of-range values and NaN using AVX-512 native mask comparison and blend instructions - Fix zip_low/zip_high for 512-bit vectors: Redesign algorithm to extract appropriate 256-bit halves and perform full 256-bit zip operations, avoiding the limitation of _mm512_shuffle_i64x2 which cannot freely interleave lanes from both operands
Replace the previous multi-step zip implementation with a single
permutex2var instruction. This is more efficient and cleaner:
- zip_low: interleaves elements 0..n/2 from both vectors
- zip_high: interleaves elements n/2..n from both vectors
Uses _mm512_permutex2var_{epi8,epi16,epi32,epi64,ps,pd} depending
on element type. The index vector is computed at compile time.
This method was never overridden by any Level implementation and always returned None, so the logic can be simplified to directly call should_use_generic_op. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ng code and improving readability
Replace the previous cross_block_alignr_512 implementation that split 512-bit vectors into 256-bit halves with a single _mm512_permutex2var_epi8 instruction. This reduces the operation from multiple extracts, permutes, and inserts to a single VBMI shuffle instruction. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace _mm512_cvtepi16_epi8 + _mm512_inserti64x4 with a single _mm512_permutex2var_epi8 call for truncating 16-bit shifted values back to 8-bit. This is more efficient because vpmovwb is 2 uops producing only 256-bit output, while vpermt2b produces a full 512-bit result in 1-2 uops. See: llvm/llvm-project#34219 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Split avx2_slide_helpers() into common helpers (used by both AVX2 and AVX-512) and the full version (AVX2 only). The cross_block_alignr_256x2 function is only needed for AVX2's 2x256-bit operations, not AVX-512 which uses cross_block_alignr_512 instead. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add underscores to literal suffixes (0usize -> 0_usize, etc.) - Use try_into().unwrap() instead of `as` casts for index conversions, ensuring the generator will panic if assumptions are violated - Add backticks to function names in doc comments Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ppy --tests` without a reported location, I've failed to isolate it to a specific crate and suppress it there
Contributor
Author
|
This came together better than I expected. After several rounds of my own review and cleanup this is now ready for review by the maintainers. There are still two big work items for changing the mask representation and using AVX-512 instructions for sizes smaller than 512 bits, but I thought this is a good milestone to start the conversation before the amount of changes becomes overwhelming. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds the AVX-512 level with Ice Lake as the baseline. This is a deliberate choice: Skylake technically has AVX-512 but is impractically slow, and supporting AVX-512 would both reduce performance on Skylake and reduce performance on other CPUs with actually competent AVX-512 by restricting their instruction sets.
512-bit operations are now backed by native AVX-512 instructions where available.
Future work
These changes only cover 512-bit vectors. AVX-512 allows more efficient implementations of many ops of smaller sizes, this PR does not tackle them yet.
This keeps the mask representation intact and instead uses extra instructions to cast from masks to full-width vectors and back. This is inefficient but lets us keep the exiting API and avoid dealing with #179 for now.