Skip to content

Eliminate use_hint 32/88 intrinsics#940

Open
willieyz wants to merge 3 commits intomainfrom
eliminate-use_hint_32_88-intrinsics
Open

Eliminate use_hint 32/88 intrinsics#940
willieyz wants to merge 3 commits intomainfrom
eliminate-use_hint_32_88-intrinsics

Conversation

@willieyz
Copy link
Contributor

@willieyz willieyz commented Feb 3, 2026

We also tried unrolling the loops: mld_poly_use_hint_88_avx2_loop and mld_poly_use_hint_32_avx2_loop
in both files. However, the benchmark results showed that this did not provide any performance benefit, so we decided to keep the current version.

  • bench components
    • Δ (%) = (asm − AVX2) / AVX2 × 100
Component Implementation Build ML-DSA-44 ML-DSA-65 ML-DSA-87 Notes
mld_poly_caddq
(avg)
AVX2 intrinsics no-opt 821 781 789
x86_64 asm no-opt 847 786 787
Δ (%) no-opt +3.17% +0.64% -0.25%
mld_poly_caddq
(avg)
AVX2 intrinsics opt 210 147 143
x86_64 asm opt 220 153 155
x86_64 asm
(unroll)
opt 273 154 156 unroll by 4
Δ (%) opt +4.76% +4.08% +8.39%
Δ (%) (unroll) opt +30.00% +4.76% +9.09% unroll by 4
  • bench
    • Δ (%) = (asm − AVX2) / AVX2 × 100
Component Implementation Build ML-DSA-44 ML-DSA-65 ML-DSA-87 Notes
keypair cycles
(avg)
AVX2 intrinsics no-opt 127436 218610 360739 baseline (main)
x86_64 asm no-opt 127459 217604 367118
Δ (%) no-opt +0.02% -0.46% +1.77%
AVX2 intrinsics opt 56955 98362 157869 baseline (main)
x86_64 asm opt 59747 102961 165706
x86_64 asm
(unroll)
opt 59483 104732 166654
Δ (%) opt +4.90% +4.68% +4.96%
Δ (%) (unroll) opt +4.44% +6.48% +5.56% unroll by 4
sign cycles
(avg)
AVX2 intrinsics no-opt 451922 756003 958151 baseline (main)
x86_64 asm no-opt 452833 752512 974497
Δ (%) no-opt +0.20% -0.46% +1.71%
AVX2 intrinsics opt 170370 281545 347924 baseline (main)
x86_64 asm opt 178564 294843 362677
x86_64 asm
(unroll)
opt 177251 300667 366158
Δ (%) opt +4.81% +4.72% +4.24%
Δ (%) (unroll) opt +4.04% +6.79% +5.24% unroll by 4
verify cycles
(avg)
AVX2 intrinsics no-opt 134113 220671 363234 baseline (main)
x86_64 asm no-opt 134633 220015 369763
Δ (%) no-opt +0.39% -0.30% +1.80%
AVX2 intrinsics opt 60234 98904 156281 baseline (main)
x86_64 asm opt 63140 103682 164376
x86_64 asm
(unroll)
opt 62822 105719 164028
Δ (%) opt +4.82% +4.83% +5.18%
Δ (%) (unroll) opt +4.30% +6.89% +4.96% unroll by 4

@oqs-bot
Copy link
Contributor

oqs-bot commented Feb 3, 2026

CBMC Results (ML-DSA-87)

Full Results (175 proofs)
Proof Status Current Previous Change
**TOTAL** 2433s 2484s -2.1%
sign_verify_internal 301s 313s -4%
mld_attempt_signature_generation 210s 218s -4%
polyvecl_pointwise_acc_montgomery_c 174s 183s -5%
polyvec_matrix_expand 158s 165s -4%
rej_uniform_native 137s 144s -5%
poly_pointwise_montgomery_c 128s 133s -4%
mld_invntt_layer 119s 122s -2%
polyvec_matrix_expand_serial 108s 108s +0%
mld_ct_memcmp 77s 81s -5%
sign_signature_internal 76s 73s +4%
mld_ntt_layer 44s 44s +0%
keccak_squeezeblocks_x4 42s 43s -2%
mld_compute_t0_t1_tr_from_sk_components 24s 25s -4%
polymat_permute_bitrev_to_custom 24s 27s -11%
rej_uniform 20s 20s +0%
fqmul 18s 18s +0%
poly_chknorm_c 16s 18s -11%
mld_ntt_butterfly_block 15s 13s +15%
poly_uniform_4x 15s 16s -6%
rej_uniform_c 15s 18s -17%
mld_check_pct 14s 11s +27%
poly_uniform_eta_4x 14s 14s +0%
polyeta_unpack 14s 12s +17%
polyveck_add 14s 15s -7%
polyveck_power2round 14s 13s +8%
keccakf1600x4_permute_native 13s 15s -13%
polyt0_unpack 13s 15s -13%
polyvec_matrix_pointwise_montgomery 13s 13s +0%
polyveck_use_hint 11s 10s +10%
polyveck_reduce 10s 9s +11%
keccak_absorb_once_x4 9s 11s -18%
poly_invntt_tomont_c 9s 7s +29%
polyveck_caddq 9s 8s +12%
polyvecl_ntt 9s 8s +12%
keccakf1600_permute_native 8s 9s -11%
mld_compute_pack_z 8s 11s -27%
mld_polyvecl_permute_bitrev_to_custom_native 8s 8s +0%
poly_decompose_c 8s 8s +0%
polyveck_chknorm 8s 11s -27%
polyveck_invntt_tomont 8s 7s +14%
polyveck_ntt 8s 6s +33%
polyveck_pointwise_poly_montgomery 8s 8s +0%
sign 8s 7s +14%
unpack_sk 8s 8s +0%
keccakf1600_permute 7s 5s +40%
mld_sample_s1_s2 7s 8s -12%
mld_sample_s1_s2_serial 7s 8s -12%
polyveck_sub 7s 7s +0%
polyz_unpack_native 7s 3s +133%
sign_pk_from_sk 7s 10s -30%
unpack_hints 7s 5s +40%
keccak_absorb 6s 6s +0%
polyveck_decompose 6s 8s -25%
polyveck_make_hint 6s 4s +50%
polyveck_shiftl 6s 6s +0%
polyvecl_uniform_gamma1_serial 6s 3s +100%
sign_signature 6s 6s +0%
unpack_pk 6s 5s +20%
unpack_sig 6s 5s +20%
mld_ct_get_optblocker_u32 5s 3s +67%
poly_caddq_c 5s 3s +67%
poly_caddq_native 5s 5s +0%
poly_ntt_native 5s 3s +67%
poly_shiftl 5s 5s +0%
poly_uniform_gamma1_4x 5s 2s +150%
poly_use_hint_c 5s 2s +150%
polyt0_pack 5s 4s +25%
polyvecl_chknorm 5s 4s +25%
rej_eta_native 5s 3s +67%
shake256x4_squeezeblocks 5s 4s +25%
sign_keypair 5s 4s +25%
sign_keypair_internal 5s 6s -17%
sign_signature_extmu 5s 6s -17%
sign_verify 5s 2s +150%
fqscale 4s 2s +100%
intt_native_x86_64 4s 3s +33%
make_hint 4s 3s +33%
mld_h 4s 4s +0%
mld_keccakf1600_extract_bytes 4s 2s +100%
mld_value_barrier_u32 4s 3s +33%
poly_add 4s 3s +33%
poly_chknorm_native 4s 3s +33%
poly_invntt_tomont_native 4s 2s +100%
poly_sub 4s 4s +0%
poly_use_hint_native 4s 3s +33%
polyveck_pack_t0 4s 3s +33%
polyveck_pack_w1 4s 3s +33%
polyveck_unpack_eta 4s 5s -20%
polyvecl_pointwise_acc_montgomery 4s 3s +33%
polyvecl_pointwise_acc_montgomery_native 4s 4s +0%
polyvecl_unpack_eta 4s 5s -20%
polyz_unpack 4s 3s +33%
polyz_unpack_c 4s 5s -20%
power2round 4s 2s +100%
rej_eta_c 4s 4s +0%
shake128_absorb 4s 4s +0%
shake128_squeeze 4s 3s +33%
shake128x4_squeezeblocks 4s 2s +100%
sign_open 4s 3s +33%
sign_verify_extmu 4s 4s +0%
sign_verify_pre_hash_shake256 4s 3s +33%
caddq 3s 4s -25%
keccak_finalize 3s 3s +0%
keccak_init 3s 2s +50%
keccak_squeeze 3s 4s -25%
keccakf1600_extract_bytes (big endian) 3s 3s +0%
keccakf1600_xor_bytes (big endian) 3s 2s +50%
keccakf1600x4_extract_bytes 3s 2s +50%
mld_ct_abs_i32 3s 3s +0%
mld_ct_cmask_neg_i32 3s 3s +0%
mld_ct_sel_int32 3s 1s +200%
mld_prepare_domain_separation_prefix 3s 5s -40%
montgomery_reduce 3s 4s -25%
pack_pk 3s 2s +50%
pack_sig_z 3s 2s +50%
pack_sk 3s 3s +0%
poly_caddq 3s 4s -25%
poly_challenge 3s 4s -25%
poly_chknorm 3s 2s +50%
poly_decompose 3s 3s +0%
poly_ntt 3s 2s +50%
poly_ntt_c 3s 3s +0%
poly_pointwise_montgomery_native 3s 3s +0%
poly_power2round 3s 1s +200%
poly_reduce 3s 2s +50%
poly_uniform_gamma1 3s 4s -25%
poly_use_hint 3s 4s -25%
polyeta_pack 3s 2s +50%
polyt1_unpack 3s 3s +0%
polyveck_pack_eta 3s 3s +0%
polyveck_unpack_t0 3s 4s -25%
polyvecl_pack_eta 3s 3s +0%
polyvecl_permute_bitrev_to_custom 3s 2s +50%
polyvecl_unpack_z 3s 6s -50%
polyw1_pack 3s 4s -25%
rej_eta 3s 2s +50%
shake128x4_absorb_once 3s 3s +0%
shake256_absorb 3s 2s +50%
shake256_finalize 3s 2s +50%
shake256_release 3s 2s +50%
shake256_squeeze 3s 4s -25%
sign_signature_pre_hash_internal 3s 6s -50%
sign_signature_pre_hash_shake256 3s 6s -50%
sys_check_capability 3s 2s +50%
decompose 2s 3s -33%
keccakf1600_xor_bytes 2s 3s -33%
keccakf1600x4_permute 2s 2s +0%
mld_ct_cmask_nonzero_u32 2s 3s -33%
mld_ct_cmask_nonzero_u8 2s 3s -33%
mld_ct_get_optblocker_i64 2s 2s +0%
mld_ct_get_optblocker_u8 2s 2s +0%
mld_value_barrier_i64 2s 2s +0%
ntt_native_x86_64 2s 5s -60%
pack_sig_c_h 2s 7s -71%
poly_decompose_native 2s 3s -33%
poly_invntt_tomont 2s 4s -50%
poly_make_hint 2s 2s +0%
poly_pointwise_montgomery 2s 2s +0%
poly_uniform 2s 2s +0%
poly_uniform_eta 2s 5s -60%
polyt1_pack 2s 4s -50%
polyvecl_uniform_gamma1 2s 3s -33%
polyz_pack 2s 2s +0%
shake128_finalize 2s 2s +0%
shake128_init 2s 2s +0%
shake128_release 2s 3s -33%
shake256 2s 3s -33%
shake256x4_absorb_once 2s 4s -50%
sign_verify_pre_hash_internal 2s 3s -33%
use_hint 2s 3s -33%
keccakf1600x4_xor_bytes 1s 4s -75%
mld_value_barrier_u8 1s 2s -50%
poly_caddq_native_aarch64 1s 4s -75%
reduce32 1s 2s -50%
shake256_init 1s 3s -67%

@oqs-bot
Copy link
Contributor

oqs-bot commented Feb 3, 2026

CBMC Results (ML-DSA-44)

Full Results (175 proofs)
Proof Status Current Previous Change
**TOTAL** 2061s 2185s -5.7%
mld_attempt_signature_generation 304s 330s -8%
polyvecl_pointwise_acc_montgomery_c 200s 217s -8%
sign_verify_internal 190s 192s -1%
rej_uniform_native 142s 151s -6%
poly_pointwise_montgomery_c 131s 148s -11%
mld_invntt_layer 115s 124s -7%
mld_ct_memcmp 81s 88s -8%
keccak_squeezeblocks_x4 46s 43s +7%
mld_ntt_layer 42s 49s -14%
sign_signature_internal 22s 22s +0%
rej_uniform 20s 22s -9%
fqmul 18s 20s -10%
poly_uniform_eta_4x 18s 17s +6%
rej_uniform_c 16s 20s -20%
mld_compute_t0_t1_tr_from_sk_components 15s 15s +0%
mld_polyvecl_permute_bitrev_to_custom_native 15s 15s +0%
polymat_permute_bitrev_to_custom 15s 16s -6%
polyt0_unpack 15s 18s -17%
polyvec_matrix_expand 14s 16s -12%
keccakf1600x4_permute_native 13s 13s +0%
poly_uniform_4x 13s 13s +0%
polyeta_unpack 13s 15s -13%
polyz_unpack_c 13s 14s -7%
keccak_absorb_once_x4 12s 14s -14%
poly_chknorm_c 12s 14s -14%
mld_ntt_butterfly_block 11s 13s -15%
keccakf1600_permute 8s 7s +14%
keccakf1600_permute_native 8s 9s -11%
poly_invntt_tomont_c 8s 8s +0%
polyvec_matrix_pointwise_montgomery 8s 6s +33%
mld_check_pct 7s 8s -12%
polyveck_add 7s 9s -22%
polyveck_decompose 7s 5s +40%
sign_verify 7s 7s +0%
sign_verify_pre_hash_internal 7s 3s +133%
keccak_absorb 6s 6s +0%
mld_compute_pack_z 6s 7s -14%
mld_prepare_domain_separation_prefix 6s 3s +100%
mld_sample_s1_s2 6s 4s +50%
poly_decompose_native 6s 4s +50%
poly_uniform_eta 6s 4s +50%
polyveck_ntt 6s 7s -14%
polyveck_pointwise_poly_montgomery 6s 7s -14%
polyveck_power2round 6s 5s +20%
polyvecl_ntt 6s 7s -14%
polyvecl_uniform_gamma1 6s 4s +50%
sign_keypair_internal 6s 5s +20%
sign_open 6s 4s +50%
sign_pk_from_sk 6s 7s -14%
sign_signature_pre_hash_shake256 6s 6s +0%
make_hint 5s 3s +67%
mld_h 5s 7s -29%
mld_sample_s1_s2_serial 5s 4s +25%
montgomery_reduce 5s 3s +67%
ntt_native_x86_64 5s 4s +25%
poly_uniform_gamma1 5s 4s +25%
poly_uniform_gamma1_4x 5s 4s +25%
poly_use_hint_c 5s 5s +0%
polyt0_pack 5s 3s +67%
polyvec_matrix_expand_serial 5s 5s +0%
polyveck_caddq 5s 5s +0%
polyveck_reduce 5s 5s +0%
polyveck_sub 5s 5s +0%
polyveck_use_hint 5s 6s -17%
polyvecl_pack_eta 5s 3s +67%
rej_eta_c 5s 4s +25%
shake128_init 5s 2s +150%
sign 5s 9s -44%
sign_signature 5s 6s -17%
unpack_hints 5s 6s -17%
keccakf1600_extract_bytes (big endian) 4s 2s +100%
keccakf1600_xor_bytes (big endian) 4s 2s +100%
mld_ct_cmask_nonzero_u8 4s 3s +33%
poly_add 4s 6s -33%
poly_challenge 4s 4s +0%
poly_chknorm_native 4s 2s +100%
poly_decompose_c 4s 3s +33%
poly_pointwise_montgomery 4s 1s +300%
poly_power2round 4s 4s +0%
poly_uniform 4s 5s -20%
poly_use_hint_native 4s 4s +0%
polyeta_pack 4s 2s +100%
polyveck_invntt_tomont 4s 5s -20%
polyveck_pack_t0 4s 3s +33%
polyveck_shiftl 4s 5s -20%
polyvecl_chknorm 4s 5s -20%
polyvecl_pointwise_acc_montgomery_native 4s 3s +33%
polyw1_pack 4s 4s +0%
power2round 4s 2s +100%
rej_eta 4s 4s +0%
shake128_squeeze 4s 2s +100%
shake256x4_absorb_once 4s 3s +33%
sign_keypair 4s 5s -20%
sign_signature_extmu 4s 4s +0%
sign_signature_pre_hash_internal 4s 3s +33%
unpack_sk 4s 3s +33%
caddq 3s 3s +0%
intt_native_x86_64 3s 4s -25%
keccak_finalize 3s 5s -40%
keccak_init 3s 3s +0%
keccakf1600x4_extract_bytes 3s 2s +50%
keccakf1600x4_xor_bytes 3s 3s +0%
mld_ct_abs_i32 3s 2s +50%
mld_ct_cmask_nonzero_u32 3s 3s +0%
mld_ct_get_optblocker_u8 3s 2s +50%
mld_value_barrier_u32 3s 2s +50%
mld_value_barrier_u8 3s 3s +0%
pack_pk 3s 5s -40%
pack_sig_z 3s 4s -25%
pack_sk 3s 3s +0%
poly_caddq_c 3s 3s +0%
poly_caddq_native_aarch64 3s 4s -25%
poly_invntt_tomont 3s 5s -40%
poly_invntt_tomont_native 3s 2s +50%
poly_ntt 3s 3s +0%
poly_ntt_c 3s 1s +200%
poly_pointwise_montgomery_native 3s 4s -25%
poly_shiftl 3s 3s +0%
polyt1_pack 3s 3s +0%
polyt1_unpack 3s 4s -25%
polyveck_chknorm 3s 3s +0%
polyveck_make_hint 3s 5s -40%
polyveck_pack_eta 3s 5s -40%
polyveck_unpack_eta 3s 3s +0%
polyvecl_permute_bitrev_to_custom 3s 4s -25%
polyvecl_pointwise_acc_montgomery 3s 5s -40%
polyvecl_uniform_gamma1_serial 3s 4s -25%
polyvecl_unpack_eta 3s 2s +50%
polyvecl_unpack_z 3s 3s +0%
polyz_unpack 3s 4s -25%
polyz_unpack_native 3s 3s +0%
reduce32 3s 2s +50%
shake256_squeeze 3s 2s +50%
sign_verify_extmu 3s 3s +0%
sign_verify_pre_hash_shake256 3s 3s +0%
use_hint 3s 2s +50%
decompose 2s 3s -33%
fqscale 2s 4s -50%
keccakf1600_xor_bytes 2s 3s -33%
keccakf1600x4_permute 2s 2s +0%
mld_ct_cmask_neg_i32 2s 3s -33%
mld_ct_get_optblocker_i64 2s 4s -50%
mld_ct_get_optblocker_u32 2s 1s +100%
mld_ct_sel_int32 2s 2s +0%
pack_sig_c_h 2s 3s -33%
poly_caddq 2s 3s -33%
poly_caddq_native 2s 4s -50%
poly_decompose 2s 2s +0%
poly_make_hint 2s 3s -33%
poly_ntt_native 2s 3s -33%
poly_reduce 2s 2s +0%
poly_sub 2s 1s +100%
poly_use_hint 2s 5s -60%
polyveck_pack_w1 2s 2s +0%
polyveck_unpack_t0 2s 3s -33%
polyz_pack 2s 4s -50%
rej_eta_native 2s 4s -50%
shake128_absorb 2s 3s -33%
shake128_finalize 2s 2s +0%
shake128_release 2s 1s +100%
shake128x4_absorb_once 2s 3s -33%
shake128x4_squeezeblocks 2s 3s -33%
shake256 2s 3s -33%
shake256_absorb 2s 8s -75%
shake256_finalize 2s 2s +0%
shake256_init 2s 2s +0%
shake256_release 2s 3s -33%
shake256x4_squeezeblocks 2s 2s +0%
sys_check_capability 2s 3s -33%
unpack_pk 2s 3s -33%
unpack_sig 2s 2s +0%
keccak_squeeze 1s 2s -50%
mld_keccakf1600_extract_bytes 1s 3s -67%
mld_value_barrier_i64 1s 3s -67%
poly_chknorm 1s 4s -75%

@oqs-bot
Copy link
Contributor

oqs-bot commented Feb 3, 2026

CBMC Results (ML-DSA-65)

Full Results (175 proofs)
Proof Status Current Previous Change
**TOTAL** 2402s 2317s +3.7%
mld_attempt_signature_generation 248s 231s +7%
polyvecl_pointwise_acc_montgomery_c 243s 221s +10%
sign_verify_internal 212s 200s +6%
poly_pointwise_montgomery_c 150s 140s +7%
rej_uniform_native 148s 140s +6%
mld_invntt_layer 125s 119s +5%
polyvec_matrix_expand 120s 114s +5%
mld_ct_memcmp 84s 81s +4%
polyvec_matrix_expand_serial 65s 66s -2%
mld_ntt_layer 46s 45s +2%
keccak_squeezeblocks_x4 45s 44s +2%
sign_signature_internal 43s 43s +0%
mld_compute_t0_t1_tr_from_sk_components 27s 25s +8%
fqmul 20s 20s +0%
polymat_permute_bitrev_to_custom 20s 19s +5%
poly_chknorm_c 19s 16s +19%
poly_uniform_eta_4x 19s 16s +19%
rej_uniform 19s 20s -5%
polyvec_matrix_pointwise_montgomery 18s 13s +38%
polyveck_decompose 18s 17s +6%
rej_uniform_c 18s 17s +6%
poly_uniform_4x 16s 14s +14%
mld_ntt_butterfly_block 15s 12s +25%
mld_polyvecl_permute_bitrev_to_custom_native 15s 13s +15%
polyt0_unpack 15s 17s -12%
polyveck_use_hint 14s 11s +27%
keccakf1600x4_permute_native 12s 12s +0%
keccak_absorb_once_x4 11s 12s -8%
keccakf1600_permute_native 10s 9s +11%
polyveck_power2round 10s 8s +25%
sign 10s 10s +0%
keccak_absorb 9s 6s +50%
polyveck_add 9s 11s -18%
polyveck_reduce 9s 8s +12%
keccakf1600_permute 8s 10s -20%
mld_check_pct 8s 8s +0%
mld_h 8s 5s +60%
poly_decompose_c 8s 7s +14%
polyeta_unpack 8s 6s +33%
polyveck_caddq 8s 6s +33%
polyveck_invntt_tomont 8s 9s -11%
polyveck_sub 8s 8s +0%
mld_compute_pack_z 7s 5s +40%
mld_sample_s1_s2_serial 7s 6s +17%
poly_invntt_tomont_c 7s 7s +0%
polyveck_ntt 7s 8s -12%
polyveck_shiftl 7s 6s +17%
polyvecl_unpack_z 7s 4s +75%
sign_pk_from_sk 7s 5s +40%
poly_caddq_native 6s 2s +200%
polyveck_pack_t0 6s 7s -14%
polyveck_pointwise_poly_montgomery 6s 6s +0%
polyvecl_ntt 6s 8s -25%
polyz_unpack_c 6s 7s -14%
sign_verify_pre_hash_shake256 6s 4s +50%
unpack_hints 6s 5s +20%
unpack_pk 6s 2s +200%
keccakf1600_extract_bytes (big endian) 5s 3s +67%
keccakf1600x4_xor_bytes 5s 3s +67%
mld_prepare_domain_separation_prefix 5s 2s +150%
mld_sample_s1_s2 5s 5s +0%
poly_make_hint 5s 4s +25%
poly_pointwise_montgomery 5s 6s -17%
poly_power2round 5s 6s -17%
poly_use_hint_c 5s 6s -17%
polyveck_chknorm 5s 5s +0%
polyveck_make_hint 5s 5s +0%
rej_eta_c 5s 5s +0%
rej_eta_native 5s 4s +25%
shake256_absorb 5s 3s +67%
sign_keypair_internal 5s 8s -38%
keccak_finalize 4s 2s +100%
keccak_squeeze 4s 5s -20%
mld_ct_cmask_nonzero_u32 4s 3s +33%
mld_ct_get_optblocker_u32 4s 2s +100%
mld_value_barrier_i64 4s 3s +33%
mld_value_barrier_u8 4s 1s +300%
montgomery_reduce 4s 2s +100%
pack_sk 4s 3s +33%
poly_caddq_c 4s 4s +0%
poly_challenge 4s 5s -20%
poly_invntt_tomont 4s 2s +100%
poly_pointwise_montgomery_native 4s 3s +33%
poly_reduce 4s 3s +33%
poly_uniform 4s 2s +100%
poly_uniform_eta 4s 6s -33%
poly_uniform_gamma1 4s 5s -20%
poly_uniform_gamma1_4x 4s 3s +33%
polyeta_pack 4s 4s +0%
polyt0_pack 4s 5s -20%
polyt1_pack 4s 3s +33%
polyveck_unpack_t0 4s 4s +0%
polyvecl_chknorm 4s 5s -20%
polyvecl_pointwise_acc_montgomery_native 4s 4s +0%
polyvecl_uniform_gamma1 4s 3s +33%
polyvecl_uniform_gamma1_serial 4s 3s +33%
polyvecl_unpack_eta 4s 3s +33%
polyz_unpack_native 4s 4s +0%
shake256 4s 4s +0%
sign_keypair 4s 3s +33%
sign_signature_extmu 4s 3s +33%
sign_signature_pre_hash_internal 4s 4s +0%
sign_signature_pre_hash_shake256 4s 3s +33%
sign_verify 4s 2s +100%
sys_check_capability 4s 2s +100%
unpack_sig 4s 3s +33%
unpack_sk 4s 6s -33%
use_hint 4s 1s +300%
fqscale 3s 2s +50%
keccak_init 3s 2s +50%
keccakf1600_xor_bytes (big endian) 3s 2s +50%
keccakf1600x4_permute 3s 3s +0%
mld_ct_abs_i32 3s 2s +50%
mld_ct_cmask_neg_i32 3s 2s +50%
mld_ct_cmask_nonzero_u8 3s 4s -25%
mld_keccakf1600_extract_bytes 3s 2s +50%
ntt_native_x86_64 3s 4s -25%
pack_sig_c_h 3s 4s -25%
poly_add 3s 4s -25%
poly_caddq 3s 4s -25%
poly_caddq_native_aarch64 3s 6s -50%
poly_chknorm 3s 5s -40%
poly_chknorm_native 3s 4s -25%
poly_ntt_c 3s 3s +0%
poly_shiftl 3s 5s -40%
poly_sub 3s 3s +0%
poly_use_hint 3s 6s -50%
poly_use_hint_native 3s 4s -25%
polyt1_unpack 3s 3s +0%
polyveck_pack_eta 3s 4s -25%
polyveck_unpack_eta 3s 4s -25%
polyvecl_permute_bitrev_to_custom 3s 4s -25%
polyw1_pack 3s 3s +0%
polyz_unpack 3s 6s -50%
rej_eta 3s 3s +0%
shake128_finalize 3s 4s -25%
shake128_squeeze 3s 2s +50%
shake256_init 3s 3s +0%
shake256x4_absorb_once 3s 4s -25%
sign_open 3s 4s -25%
sign_signature 3s 3s +0%
sign_verify_pre_hash_internal 3s 5s -40%
caddq 2s 2s +0%
decompose 2s 3s -33%
intt_native_x86_64 2s 4s -50%
keccakf1600x4_extract_bytes 2s 3s -33%
make_hint 2s 4s -50%
mld_ct_get_optblocker_i64 2s 3s -33%
mld_ct_get_optblocker_u8 2s 2s +0%
pack_pk 2s 3s -33%
pack_sig_z 2s 2s +0%
poly_decompose 2s 4s -50%
poly_decompose_native 2s 6s -67%
poly_invntt_tomont_native 2s 3s -33%
poly_ntt 2s 5s -60%
poly_ntt_native 2s 4s -50%
polyveck_pack_w1 2s 4s -50%
polyvecl_pack_eta 2s 5s -60%
polyvecl_pointwise_acc_montgomery 2s 6s -67%
polyz_pack 2s 3s -33%
power2round 2s 3s -33%
reduce32 2s 2s +0%
shake128_init 2s 4s -50%
shake128x4_absorb_once 2s 3s -33%
shake128x4_squeezeblocks 2s 3s -33%
shake256_release 2s 3s -33%
shake256_squeeze 2s 2s +0%
sign_verify_extmu 2s 6s -67%
keccakf1600_xor_bytes 1s 4s -75%
mld_ct_sel_int32 1s 4s -75%
mld_value_barrier_u32 1s 4s -75%
shake128_absorb 1s 3s -67%
shake128_release 1s 2s -50%
shake256_finalize 1s 4s -75%
shake256x4_squeezeblocks 1s 2s -50%

@willieyz willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch 9 times, most recently from 1ea9d5f to 8a19e9a Compare February 5, 2026 06:05
@willieyz willieyz marked this pull request as ready for review February 5, 2026 06:39
@willieyz willieyz requested a review from a team as a code owner February 5, 2026 06:39
@willieyz willieyz marked this pull request as draft February 5, 2026 07:19
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 46205 cycles 46203 cycles 1.00
ML-DSA-44 sign 131278 cycles 131278 cycles 1
ML-DSA-44 verify 47765 cycles 47768 cycles 1.00
ML-DSA-65 keypair 81014 cycles 81024 cycles 1.00
ML-DSA-65 sign 215785 cycles 215787 cycles 1.00
ML-DSA-65 verify 80057 cycles 80052 cycles 1.00
ML-DSA-87 keypair 132158 cycles 132151 cycles 1.00
ML-DSA-87 sign 276862 cycles 276816 cycles 1.00
ML-DSA-87 verify 130418 cycles 130384 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 114213 cycles 114155 cycles 1.00
ML-DSA-44 sign 418158 cycles 417994 cycles 1.00
ML-DSA-44 verify 122319 cycles 122262 cycles 1.00
ML-DSA-65 keypair 195508 cycles 195499 cycles 1.00
ML-DSA-65 sign 682497 cycles 682470 cycles 1.00
ML-DSA-65 verify 197760 cycles 197741 cycles 1.00
ML-DSA-87 keypair 322642 cycles 322656 cycles 1.00
ML-DSA-87 sign 864585 cycles 864584 cycles 1.00
ML-DSA-87 verify 328628 cycles 328653 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 34677 cycles 34696 cycles 1.00
ML-DSA-44 sign 120151 cycles 120195 cycles 1.00
ML-DSA-44 verify 38151 cycles 38145 cycles 1.00
ML-DSA-65 keypair 61275 cycles 60582 cycles 1.01
ML-DSA-65 sign 202094 cycles 200476 cycles 1.01
ML-DSA-65 verify 62940 cycles 62563 cycles 1.01
ML-DSA-87 keypair 93525 cycles 94602 cycles 0.99
ML-DSA-87 sign 236210 cycles 240494 cycles 0.98
ML-DSA-87 verify 95587 cycles 95761 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 93726 cycles 93889 cycles 1.00
ML-DSA-44 sign 333512 cycles 333450 cycles 1.00
ML-DSA-44 verify 99955 cycles 99851 cycles 1.00
ML-DSA-65 keypair 160065 cycles 160390 cycles 1.00
ML-DSA-65 sign 545794 cycles 545908 cycles 1.00
ML-DSA-65 verify 160881 cycles 160887 cycles 1.00
ML-DSA-87 keypair 267728 cycles 267405 cycles 1.00
ML-DSA-87 sign 707504 cycles 707235 cycles 1.00
ML-DSA-87 verify 270918 cycles 269967 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 276468 cycles 277102 cycles 1.00
ML-DSA-44 sign 818650 cycles 810656 cycles 1.01
ML-DSA-44 verify 276672 cycles 278882 cycles 0.99
ML-DSA-65 keypair 475323 cycles 478906 cycles 0.99
ML-DSA-65 sign 1367640 cycles 1360800 cycles 1.01
ML-DSA-65 verify 459822 cycles 466415 cycles 0.99
ML-DSA-87 keypair 825623 cycles 818822 cycles 1.01
ML-DSA-87 sign 1873209 cycles 1878770 cycles 1.00
ML-DSA-87 verify 800938 cycles 794467 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 69035 cycles 69134 cycles 1.00
ML-DSA-44 sign 187364 cycles 187688 cycles 1.00
ML-DSA-44 verify 69341 cycles 69282 cycles 1.00
ML-DSA-65 keypair 119503 cycles 119368 cycles 1.00
ML-DSA-65 sign 303527 cycles 300862 cycles 1.01
ML-DSA-65 verify 115926 cycles 115513 cycles 1.00
ML-DSA-87 keypair 203793 cycles 203546 cycles 1.00
ML-DSA-87 sign 394456 cycles 394636 cycles 1.00
ML-DSA-87 verify 195809 cycles 195483 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 57235 cycles 56751 cycles 1.01
ML-DSA-44 sign 181496 cycles 181670 cycles 1.00
ML-DSA-44 verify 61165 cycles 61146 cycles 1.00
ML-DSA-65 keypair 98680 cycles 98647 cycles 1.00
ML-DSA-65 sign 298309 cycles 298480 cycles 1.00
ML-DSA-65 verify 100528 cycles 100288 cycles 1.00
ML-DSA-87 keypair 152581 cycles 152587 cycles 1.00
ML-DSA-87 sign 355291 cycles 355235 cycles 1.00
ML-DSA-87 verify 153950 cycles 153556 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 68156 cycles 68132 cycles 1.00
ML-DSA-44 sign 202004 cycles 201919 cycles 1.00
ML-DSA-44 verify 70775 cycles 70781 cycles 1.00
ML-DSA-65 keypair 120970 cycles 120914 cycles 1.00
ML-DSA-65 sign 331183 cycles 331101 cycles 1.00
ML-DSA-65 verify 117884 cycles 117908 cycles 1.00
ML-DSA-87 keypair 198649 cycles 198347 cycles 1.00
ML-DSA-87 sign 427544 cycles 427112 cycles 1.00
ML-DSA-87 verify 194417 cycles 194311 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 135070 cycles 134705 cycles 1.00
ML-DSA-44 sign 526006 cycles 524023 cycles 1.00
ML-DSA-44 verify 147853 cycles 147704 cycles 1.00
ML-DSA-65 keypair 226865 cycles 226528 cycles 1.00
ML-DSA-65 sign 860582 cycles 861852 cycles 1.00
ML-DSA-65 verify 235373 cycles 235761 cycles 1.00
ML-DSA-87 keypair 370367 cycles 371080 cycles 1.00
ML-DSA-87 sign 1079627 cycles 1079785 cycles 1.00
ML-DSA-87 verify 382615 cycles 383268 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 41639 cycles 42042 cycles 0.99
ML-DSA-44 sign 134495 cycles 135046 cycles 1.00
ML-DSA-44 verify 44953 cycles 45886 cycles 0.98
ML-DSA-65 keypair 72877 cycles 72408 cycles 1.01
ML-DSA-65 sign 214749 cycles 215490 cycles 1.00
ML-DSA-65 verify 73910 cycles 73252 cycles 1.01
ML-DSA-87 keypair 107778 cycles 107965 cycles 1.00
ML-DSA-87 sign 252308 cycles 254024 cycles 0.99
ML-DSA-87 verify 109196 cycles 111034 cycles 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 157593 cycles 157623 cycles 1.00
ML-DSA-44 sign 550359 cycles 549610 cycles 1.00
ML-DSA-44 verify 169225 cycles 169078 cycles 1.00
ML-DSA-65 keypair 267977 cycles 267943 cycles 1.00
ML-DSA-65 sign 903637 cycles 902493 cycles 1.00
ML-DSA-65 verify 274125 cycles 274108 cycles 1.00
ML-DSA-87 keypair 450990 cycles 447542 cycles 1.01
ML-DSA-87 sign 1162617 cycles 1156527 cycles 1.01
ML-DSA-87 verify 460584 cycles 457749 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 72258 cycles 72244 cycles 1.00
ML-DSA-44 sign 211991 cycles 212021 cycles 1.00
ML-DSA-44 verify 75712 cycles 75740 cycles 1.00
ML-DSA-65 keypair 127432 cycles 127429 cycles 1.00
ML-DSA-65 sign 350175 cycles 350138 cycles 1.00
ML-DSA-65 verify 125364 cycles 125365 cycles 1.00
ML-DSA-87 keypair 208138 cycles 208164 cycles 1.00
ML-DSA-87 sign 448958 cycles 448891 cycles 1.00
ML-DSA-87 verify 205105 cycles 205092 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 128309 cycles 128287 cycles 1.00
ML-DSA-44 sign 447743 cycles 447655 cycles 1.00
ML-DSA-44 verify 138349 cycles 144617 cycles 0.96
ML-DSA-65 keypair 220300 cycles 220134 cycles 1.00
ML-DSA-65 sign 727626 cycles 727309 cycles 1.00
ML-DSA-65 verify 223200 cycles 223042 cycles 1.00
ML-DSA-87 keypair 365101 cycles 365095 cycles 1.00
ML-DSA-87 sign 926593 cycles 926085 cycles 1.00
ML-DSA-87 verify 372803 cycles 372794 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 120283 cycles 123215 cycles 0.98
ML-DSA-44 sign 447117 cycles 449447 cycles 0.99
ML-DSA-44 verify 131120 cycles 129997 cycles 1.01
ML-DSA-65 keypair 205159 cycles 204042 cycles 1.01
ML-DSA-65 sign 729240 cycles 726667 cycles 1.00
ML-DSA-65 verify 210548 cycles 209895 cycles 1.00
ML-DSA-87 keypair 336772 cycles 336983 cycles 1.00
ML-DSA-87 sign 923968 cycles 923345 cycles 1.00
ML-DSA-87 verify 346738 cycles 346079 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 212612 cycles 212810 cycles 1.00
ML-DSA-44 sign 759997 cycles 759720 cycles 1.00
ML-DSA-44 verify 228854 cycles 229136 cycles 1.00
ML-DSA-65 keypair 380708 cycles 380820 cycles 1.00
ML-DSA-65 sign 1252502 cycles 1251840 cycles 1.00
ML-DSA-65 verify 371854 cycles 372231 cycles 1.00
ML-DSA-87 keypair 605059 cycles 605579 cycles 1.00
ML-DSA-87 sign 1593982 cycles 1591706 cycles 1.00
ML-DSA-87 verify 618815 cycles 617581 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpacemiT K1 8 (Banana Pi F3) benchmarks (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 828493 cycles 828629 cycles 1.00
ML-DSA-44 sign 3237874 cycles 3236899 cycles 1.00
ML-DSA-44 verify 920036 cycles 920218 cycles 1.00
ML-DSA-65 keypair 1414978 cycles 1413016 cycles 1.00
ML-DSA-65 sign 5366078 cycles 5357541 cycles 1.00
ML-DSA-65 verify 1482925 cycles 1480164 cycles 1.00
ML-DSA-87 keypair 2312703 cycles 2311040 cycles 1.00
ML-DSA-87 sign 6669160 cycles 6668340 cycles 1.00
ML-DSA-87 verify 2416765 cycles 2415856 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 222746 cycles 227029 cycles 0.98
ML-DSA-44 sign 609985 cycles 617875 cycles 0.99
ML-DSA-44 verify 223898 cycles 224701 cycles 1.00
ML-DSA-65 keypair 396984 cycles 412531 cycles 0.96
ML-DSA-65 sign 1037227 cycles 1061715 cycles 0.98
ML-DSA-65 verify 375316 cycles 387814 cycles 0.97
ML-DSA-87 keypair 658105 cycles 666611 cycles 0.99
ML-DSA-87 sign 1352975 cycles 1398456 cycles 0.97
ML-DSA-87 verify 638484 cycles 667131 cycles 0.96

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 314316 cycles 322374 cycles 0.98
ML-DSA-44 sign 1219077 cycles 1200283 cycles 1.02
ML-DSA-44 verify 347864 cycles 342633 cycles 1.02
ML-DSA-65 keypair 605825 cycles 566673 cycles 1.07
ML-DSA-65 sign 2034909 cycles 1937222 cycles 1.05
ML-DSA-65 verify 568560 cycles 546998 cycles 1.04
ML-DSA-87 keypair 877363 cycles 869944 cycles 1.01
ML-DSA-87 sign 2465004 cycles 2468357 cycles 1.00
ML-DSA-87 verify 897477 cycles 906874 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-65 keypair 605825 cycles 566673 cycles 1.07
ML-DSA-65 sign 2034909 cycles 1937222 cycles 1.05
ML-DSA-65 verify 568560 cycles 546998 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

@willieyz willieyz marked this pull request as ready for review February 5, 2026 07:42
Copy link
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please apply the same changes as requested in #905

@mkannwischer mkannwischer marked this pull request as draft February 10, 2026 02:46
@willieyz willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch 15 times, most recently from 18a3b4d to 32689ac Compare February 12, 2026 06:23
This commit adds poly_use_hint to bench --components for benchmarking
the performance impact of the changes to:
- poly_use_hint_32
- poly_use_hint_88

Signed-off-by: willieyz <willie.zhao@chelpis.com>
In this PR, we replace the AVX2 intrinsics implementation of
poly_use_hint_32 and poly_use_hint_88 with a x86_64 assembly version,
this is part of the effort to enable HOL-Light proofs.

Signed-off-by: willieyz <willie.zhao@chelpis.com>
@willieyz willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch from 32689ac to 81d5192 Compare February 12, 2026 06:26
This commit extract the decompose 32/88 and use_hint 32/88 as a macro.

Signed-off-by: willieyz <willie.zhao@chelpis.com>
@willieyz willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch from 81d5192 to 377fdc8 Compare February 12, 2026 06:26
@willieyz
Copy link
Contributor Author

Please apply the same changes as requested in #905

Hello, @mkannwischer , I had apply same changes requested in #905, including:

  • Remove all usage # for comment, use // and /*...*/ instead
  • Remove all vzeroupper
  • Use 32-bit constant instead of 64 bit
  • Extract the decompose32/88 and use_hint32/88 macros (referencing the AArch64 versions), and add brief comments in the same style.

Thank you for your help!

@willieyz willieyz marked this pull request as ready for review February 12, 2026 06:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AVX2: Replace intrinsics implementation of poly_use_hint with assembly

3 participants