Skip to content

Armv8.1-M: Add native Keccak x4 XORBytes and ExtractBytes #1524

Merged
mkannwischer merged 3 commits intomainfrom
mve-keccakx4-bitinterleaving
Feb 24, 2026
Merged

Armv8.1-M: Add native Keccak x4 XORBytes and ExtractBytes #1524
mkannwischer merged 3 commits intomainfrom
mve-keccakx4-bitinterleaving

Conversation

@mkannwischer
Copy link
Copy Markdown
Contributor

See individual commits

@mkannwischer mkannwischer force-pushed the mve-keccakx4-bitinterleaving branch from f568d85 to 54dafae Compare January 23, 2026 04:31
Comment thread mlkem/src/fips202/native/api.h Outdated
@oqs-bot
Copy link
Copy Markdown
Contributor

oqs-bot commented Jan 23, 2026

CBMC Results (ML-KEM-512)

Full Results (153 proofs)
Proof Status Current Previous Change
**TOTAL** 1182s 1313s -10.0%
mlk_indcpa_keypair_derand 177s 199s -11%
mlk_indcpa_enc 148s 179s -17%
mlk_keccak_squeezeblocks_x4 145s 188s -23%
mlk_rej_uniform_c 74s 75s -1%
mlk_polyvec_basemul_acc_montgomery_cached_c 42s 60s -30%
mlk_poly_rej_uniform 38s 37s +3%
poly_ntt_native 29s 32s -9%
mlk_ntt_layer 28s 30s -7%
mlk_polyvec_add 23s 27s -15%
polyvec_basemul_acc_montgomery_cached_native 21s 23s -9%
keccakf1600x4_permute_native_x4 20s 18s +11%
mlk_poly_reduce_native 16s 14s +14%
mlk_indcpa_dec 11s 10s +10%
mlk_poly_sub 11s 13s -15%
mlk_keccak_absorb_once_x4 9s 10s -10%
mlk_poly_frombytes_native 8s 10s -20%
mlk_keccak_squeeze_once 7s 7s +0%
mlk_keccak_squeezeblocks 7s 8s -12%
mlk_poly_rej_uniform_x4 7s 9s -22%
keccakf1600_permute_native 6s 8s -25%
mlk_fqmul 6s 6s +0%
mlk_invntt_layer 6s 4s +50%
mlk_ntt_butterfly_block 6s 10s -40%
poly_frombytes_native_x86_64 6s 6s +0%
keccakf1600x4_xor_bytes_native 5s - new
kem_enc_derand 5s 5s +0%
mlk_gen_matrix 5s 2s +150%
mlk_keccak_absorb_once 5s 5s +0%
mlk_poly_cbd_eta1 5s 4s +25%
mlk_poly_frommsg 5s 7s -29%
mlk_rej_uniform 5s 2s +150%
mlk_shake256x4 5s 4s +25%
intt_native_aarch64 4s 3s +33%
keccakf1600x4_extract_bytes_native 4s - new
kem_check_pk 4s 5s -20%
kem_dec 4s 3s +33%
kem_keypair_derand 4s 3s +33%
mlk_keccakf1600_permute 4s 2s +100%
mlk_keccakf1600_xor_bytes 4s 4s +0%
mlk_montgomery_reduce 4s 4s +0%
mlk_poly_decompress_dv 4s 3s +33%
mlk_poly_getnoise_eta1122_4x 4s 3s +33%
mlk_poly_getnoise_eta1_4x_native 4s 1s +300%
mlk_poly_mulcache_compute 4s 1s +300%
mlk_poly_tomont_c 4s 2s +100%
mlk_polymat_permute_bitrev_to_custom 4s 7s -43%
mlk_scalar_compress_d1 4s 3s +33%
mlk_scalar_compress_d10 4s 3s +33%
mlk_scalar_decompress_d4 4s 2s +100%
ntt_native_aarch64 4s 2s +100%
poly_invntt_tomont_native 4s 3s +33%
poly_tomont_native_aarch64 4s 3s +33%
polyvec_basemul_acc_montgomery_cached_k4_native_x86_64 4s 2s +100%
keccak_f1600_x1_native_aarch64 3s 1s +200%
keccak_f1600_x1_native_aarch64_v84a 3s 1s +200%
kem_check_sk 3s 2s +50%
kem_enc 3s 3s +0%
kem_keypair 3s 2s +50%
mlk_poly_cbd_eta2 3s 2s +50%
mlk_poly_compress_du 3s 2s +50%
mlk_poly_compress_dv 3s 3s +0%
mlk_poly_frombytes 3s 2s +50%
mlk_poly_getnoise_eta2 3s 2s +50%
mlk_poly_invntt_tomont_c 3s 3s +0%
mlk_poly_mulcache_compute_c 3s 4s -25%
mlk_poly_tobytes 3s 3s +0%
mlk_poly_tobytes_c 3s 1s +200%
mlk_poly_tomsg 3s 3s +0%
mlk_polyvec_decompress_du 3s 3s +0%
mlk_polyvec_mulcache_compute 3s 4s -25%
mlk_polyvec_ntt 3s 1s +200%
mlk_polyvec_tobytes 3s 1s +200%
mlk_scalar_compress_d4 3s 3s +0%
mlk_scalar_decompress_d10 3s 2s +50%
mlk_scalar_signed_to_unsigned_q 3s 4s -25%
mlk_value_barrier_u8 3s 3s +0%
poly_getnoise_eta1122_4x_native 3s 3s +0%
poly_reduce_native_aarch64 3s 2s +50%
poly_tobytes_native_aarch64 3s 6s -50%
polyvec_basemul_acc_montgomery_cached_k2_native_x86_64 3s 1s +200%
polyvec_basemul_acc_montgomery_cached_k3_native_aarch64 3s 2s +50%
intt_native_x86_64 2s 2s +0%
keccak_f1600_x4_native_aarch64_v84a 2s 2s +0%
keccak_f1600_x4_native_aarch64_v8a_scalar_hybrid 2s 3s -33%
keccak_f1600_x4_native_aarch64_v8a_v84a_scalar_hybrid 2s 2s +0%
mlk_barrett_reduce 2s 2s +0%
mlk_check_pct 2s 1s +100%
mlk_ct_cmask_neg_i16 2s 1s +100%
mlk_ct_cmask_nonzero_u8 2s 2s +0%
mlk_ct_cmov_zero 2s 2s +0%
mlk_ct_sel_int16 2s 2s +0%
mlk_ct_sel_uint8 2s 1s +100%
mlk_gen_matrix_serial 2s 4s -50%
mlk_keccakf1600_extract_bytes 2s 3s -33%
mlk_keccakf1600_extract_bytes (big endian) 2s 3s -33%
mlk_keccakf1600_xor_bytes (big endian) 2s 4s -50%
mlk_keccakf1600x4_extract_bytes 2s 2s +0%
mlk_keccakf1600x4_permute 2s 2s +0%
mlk_keccakf1600x4_xor_bytes 2s 2s +0%
mlk_poly_decompress_du 2s 2s +0%
mlk_poly_frombytes_c 2s 3s -33%
mlk_poly_invntt_tomont 2s 4s -50%
mlk_poly_mulcache_compute_native 2s 4s -50%
mlk_poly_ntt 2s 2s +0%
mlk_poly_ntt_c 2s 1s +100%
mlk_poly_reduce 2s 3s -33%
mlk_poly_tobytes_native 2s 2s +0%
mlk_poly_tomont 2s 2s +0%
mlk_polyvec_basemul_acc_montgomery_cached 2s 2s +0%
mlk_polyvec_frombytes 2s 2s +0%
mlk_polyvec_permute_bitrev_to_custom 2s 3s -33%
mlk_polyvec_permute_bitrev_to_custom_native 2s 3s -33%
mlk_scalar_compress_d11 2s 3s -33%
mlk_scalar_decompress_d11 2s 4s -50%
mlk_sha3_256 2s 1s +100%
mlk_sha3_512 2s 1s +100%
mlk_shake128_absorb_once 2s 3s -33%
mlk_shake128_squeezeblocks 2s 4s -50%
mlk_shake128x4_absorb_once 2s 1s +100%
mlk_shake128x4_squeezeblocks 2s 2s +0%
mlk_value_barrier_i32 2s 2s +0%
mlk_value_barrier_u32 2s 4s -50%
ntt_native_x86_64 2s 5s -60%
poly_mulcache_compute_native_aarch64 2s 2s +0%
poly_reduce_native_x86_64 2s 6s -67%
poly_tobytes_native_x86_64 2s 3s -33%
poly_tomont_native_x86_64 2s 2s +0%
polyvec_basemul_acc_montgomery_cached_k2_native_aarch64 2s 2s +0%
polyvec_basemul_acc_montgomery_cached_k3_native_x86_64 2s 3s -33%
polyvec_basemul_acc_montgomery_cached_k4_native_aarch64 2s 1s +100%
rej_uniform_native_aarch64 2s 4s -50%
sys_check_capability 2s 2s +0%
mlk_ct_cmask_nonzero_u16 1s 1s +0%
mlk_ct_get_optblocker_i32 1s 1s +0%
mlk_ct_get_optblocker_u32 1s 2s -50%
mlk_ct_get_optblocker_u8 1s 2s -50%
mlk_ct_memcmp 1s 2s -50%
mlk_matvec_mul 1s 1s +0%
mlk_poly_add 1s 1s +0%
mlk_poly_getnoise_eta1_4x 1s 3s -67%
mlk_poly_reduce_c 1s 3s -67%
mlk_poly_tomont_native 1s 1s +0%
mlk_polyvec_compress_du 1s 5s -80%
mlk_polyvec_invntt_tomont 1s 3s -67%
mlk_polyvec_reduce 1s 2s -50%
mlk_polyvec_tomont 1s 2s -50%
mlk_scalar_compress_d5 1s 2s -50%
mlk_scalar_decompress_d5 1s 1s +0%
mlk_shake256 1s 3s -67%
nttunpack_native_x86_64 1s 3s -67%
poly_mulcache_compute_native_x86_64 1s 2s -50%
rej_uniform_native 1s 3s -67%
rej_uniform_native_x86_64 1s 2s -50%

@oqs-bot
Copy link
Copy Markdown
Contributor

oqs-bot commented Jan 23, 2026

CBMC Results (ML-KEM-1024)

Full Results (153 proofs)
Proof Status Current Previous Change
**TOTAL** 2576s 2545s +1.2%
mlk_indcpa_enc 1323s 1294s +2%
mlk_indcpa_keypair_derand 227s 220s +3%
mlk_keccak_squeezeblocks_x4 139s 156s -11%
polyvec_basemul_acc_montgomery_cached_native 132s 120s +10%
mlk_rej_uniform_c 79s 77s +3%
mlk_polyvec_basemul_acc_montgomery_cached_c 57s 62s -8%
mlk_poly_rej_uniform 38s 36s +6%
mlk_poly_decompress_dv 26s 26s +0%
poly_ntt_native 25s 28s -11%
keccakf1600x4_permute_native_x4 21s 18s +17%
mlk_indcpa_dec 20s 17s +18%
mlk_ntt_layer 18s 21s -14%
mlk_poly_reduce_native 16s 14s +14%
mlk_polyvec_ntt 13s 14s -7%
mlk_keccak_absorb_once_x4 12s 10s +20%
mlk_poly_sub 12s 13s -8%
mlk_polyvec_add 11s 11s +0%
mlk_ntt_butterfly_block 10s 9s +11%
mlk_gen_matrix 9s 9s +0%
mlk_poly_rej_uniform_x4 9s 8s +12%
kem_dec 8s 5s +60%
mlk_fqmul 8s 6s +33%
mlk_poly_compress_du 8s 10s -20%
mlk_keccak_squeezeblocks 7s 7s +0%
mlk_poly_frombytes_native 7s 10s -30%
mlk_poly_frommsg 7s 8s -12%
keccakf1600_permute_native 6s 6s +0%
mlk_ct_cmask_nonzero_u8 6s 2s +200%
mlk_keccak_squeeze_once 6s 6s +0%
mlk_poly_getnoise_eta1_4x 6s 2s +200%
rej_uniform_native 6s 5s +20%
kem_check_pk 5s 4s +25%
mlk_gen_matrix_serial 5s 6s -17%
mlk_keccakf1600_permute 5s 3s +67%
mlk_poly_compress_dv 5s 1s +400%
mlk_poly_getnoise_eta1_4x_native 5s 3s +67%
mlk_polyvec_permute_bitrev_to_custom_native 5s 4s +25%
ntt_native_aarch64 5s 4s +25%
poly_invntt_tomont_native 5s 1s +400%
keccakf1600x4_xor_bytes_native 4s - new
mlk_ct_get_optblocker_i32 4s 3s +33%
mlk_ct_sel_uint8 4s 1s +300%
mlk_invntt_layer 4s 7s -43%
mlk_keccak_absorb_once 4s 4s +0%
mlk_keccakf1600_xor_bytes (big endian) 4s 3s +33%
mlk_poly_tomsg 4s 5s -20%
mlk_polyvec_decompress_du 4s 1s +300%
mlk_scalar_compress_d5 4s 2s +100%
mlk_shake256x4 4s 4s +0%
ntt_native_x86_64 4s 1s +300%
poly_frombytes_native_x86_64 4s 8s -50%
intt_native_x86_64 3s 2s +50%
keccak_f1600_x4_native_aarch64_v8a_scalar_hybrid 3s 1s +200%
keccakf1600x4_extract_bytes_native 3s - new
kem_keypair 3s 2s +50%
mlk_check_pct 3s 2s +50%
mlk_keccakf1600_extract_bytes 3s 3s +0%
mlk_keccakf1600_extract_bytes (big endian) 3s 1s +200%
mlk_keccakf1600_xor_bytes 3s 1s +200%
mlk_matvec_mul 3s 3s +0%
mlk_poly_cbd_eta2 3s 2s +50%
mlk_poly_decompress_du 3s 1s +200%
mlk_poly_frombytes_c 3s 2s +50%
mlk_poly_invntt_tomont_c 3s 3s +0%
mlk_poly_tobytes 3s 1s +200%
mlk_poly_tomont_c 3s 3s +0%
mlk_polymat_permute_bitrev_to_custom 3s 4s -25%
mlk_polyvec_compress_du 3s 2s +50%
mlk_scalar_compress_d11 3s 5s -40%
mlk_scalar_decompress_d11 3s 2s +50%
mlk_scalar_signed_to_unsigned_q 3s 3s +0%
mlk_sha3_256 3s 2s +50%
mlk_shake128x4_absorb_once 3s 1s +200%
mlk_value_barrier_u32 3s 1s +200%
nttunpack_native_x86_64 3s 2s +50%
poly_reduce_native_aarch64 3s 3s +0%
poly_tobytes_native_aarch64 3s 1s +200%
polyvec_basemul_acc_montgomery_cached_k2_native_aarch64 3s 2s +50%
polyvec_basemul_acc_montgomery_cached_k2_native_x86_64 3s 4s -25%
polyvec_basemul_acc_montgomery_cached_k3_native_aarch64 3s 5s -40%
polyvec_basemul_acc_montgomery_cached_k3_native_x86_64 3s 2s +50%
polyvec_basemul_acc_montgomery_cached_k4_native_aarch64 3s 4s -25%
rej_uniform_native_aarch64 3s 5s -40%
keccak_f1600_x1_native_aarch64_v84a 2s 2s +0%
keccak_f1600_x4_native_aarch64_v84a 2s 2s +0%
keccak_f1600_x4_native_aarch64_v8a_v84a_scalar_hybrid 2s 2s +0%
kem_check_sk 2s 3s -33%
kem_enc 2s 4s -50%
kem_enc_derand 2s 4s -50%
kem_keypair_derand 2s 2s +0%
mlk_barrett_reduce 2s 1s +100%
mlk_ct_cmov_zero 2s 3s -33%
mlk_ct_get_optblocker_u32 2s 3s -33%
mlk_ct_memcmp 2s 4s -50%
mlk_ct_sel_int16 2s 2s +0%
mlk_keccakf1600x4_extract_bytes 2s 4s -50%
mlk_keccakf1600x4_permute 2s 3s -33%
mlk_keccakf1600x4_xor_bytes 2s 4s -50%
mlk_montgomery_reduce 2s 3s -33%
mlk_poly_add 2s 4s -50%
mlk_poly_cbd_eta1 2s 1s +100%
mlk_poly_frombytes 2s 2s +0%
mlk_poly_getnoise_eta1122_4x 2s 3s -33%
mlk_poly_getnoise_eta2 2s 4s -50%
mlk_poly_invntt_tomont 2s 2s +0%
mlk_poly_mulcache_compute 2s 1s +100%
mlk_poly_mulcache_compute_c 2s 4s -50%
mlk_poly_mulcache_compute_native 2s 2s +0%
mlk_poly_ntt 2s 3s -33%
mlk_poly_ntt_c 2s 2s +0%
mlk_poly_reduce_c 2s 2s +0%
mlk_poly_tobytes_c 2s 3s -33%
mlk_poly_tobytes_native 2s 2s +0%
mlk_poly_tomont_native 2s 1s +100%
mlk_polyvec_basemul_acc_montgomery_cached 2s 3s -33%
mlk_polyvec_frombytes 2s 2s +0%
mlk_polyvec_invntt_tomont 2s 2s +0%
mlk_polyvec_tobytes 2s 2s +0%
mlk_polyvec_tomont 2s 2s +0%
mlk_rej_uniform 2s 2s +0%
mlk_scalar_compress_d1 2s 3s -33%
mlk_scalar_compress_d10 2s 1s +100%
mlk_scalar_compress_d4 2s 1s +100%
mlk_scalar_decompress_d10 2s 2s +0%
mlk_scalar_decompress_d4 2s 2s +0%
mlk_scalar_decompress_d5 2s 1s +100%
mlk_shake128x4_squeezeblocks 2s 2s +0%
mlk_value_barrier_u8 2s 2s +0%
poly_getnoise_eta1122_4x_native 2s 3s -33%
poly_mulcache_compute_native_aarch64 2s 5s -60%
poly_mulcache_compute_native_x86_64 2s 3s -33%
rej_uniform_native_x86_64 2s 4s -50%
sys_check_capability 2s 4s -50%
intt_native_aarch64 1s 1s +0%
keccak_f1600_x1_native_aarch64 1s 3s -67%
mlk_ct_cmask_neg_i16 1s 4s -75%
mlk_ct_cmask_nonzero_u16 1s 1s +0%
mlk_ct_get_optblocker_u8 1s 3s -67%
mlk_poly_reduce 1s 2s -50%
mlk_poly_tomont 1s 1s +0%
mlk_polyvec_mulcache_compute 1s 3s -67%
mlk_polyvec_permute_bitrev_to_custom 1s 2s -50%
mlk_polyvec_reduce 1s 4s -75%
mlk_sha3_512 1s 1s +0%
mlk_shake128_absorb_once 1s 1s +0%
mlk_shake128_squeezeblocks 1s 1s +0%
mlk_shake256 1s 3s -67%
mlk_value_barrier_i32 1s 4s -75%
poly_reduce_native_x86_64 1s 2s -50%
poly_tobytes_native_x86_64 1s 2s -50%
poly_tomont_native_aarch64 1s 2s -50%
poly_tomont_native_x86_64 1s 2s -50%
polyvec_basemul_acc_montgomery_cached_k4_native_x86_64 1s 5s -80%

@oqs-bot
Copy link
Copy Markdown
Contributor

oqs-bot commented Jan 23, 2026

CBMC Results (ML-KEM-768)

Full Results (153 proofs)
Proof Status Current Previous Change
**TOTAL** 1376s 1782s -22.8%
mlk_indcpa_keypair_derand 229s 291s -21%
mlk_indcpa_enc 211s 262s -19%
mlk_keccak_squeezeblocks_x4 163s 232s -30%
mlk_rej_uniform_c 64s 119s -46%
polyvec_basemul_acc_montgomery_cached_native 62s 77s -19%
mlk_polyvec_basemul_acc_montgomery_cached_c 48s 84s -43%
mlk_poly_rej_uniform 29s 48s -40%
mlk_polyvec_add 28s 35s -20%
poly_ntt_native 27s 46s -41%
keccakf1600x4_permute_native_x4 19s 20s -5%
mlk_ntt_layer 17s 32s -47%
mlk_indcpa_dec 15s 21s -29%
mlk_poly_reduce_native 12s 20s -40%
mlk_poly_sub 10s 13s -23%
mlk_keccak_absorb_once_x4 9s 12s -25%
mlk_poly_rej_uniform_x4 8s 11s -27%
keccakf1600_permute_native 7s 6s +17%
mlk_ntt_butterfly_block 7s 12s -42%
mlk_poly_frommsg 7s 11s -36%
kem_dec 6s 7s -14%
mlk_fqmul 6s 8s -25%
mlk_invntt_layer 6s 7s -14%
mlk_keccak_squeeze_once 6s 10s -40%
mlk_keccak_squeezeblocks 6s 11s -45%
mlk_poly_frombytes_native 6s 12s -50%
mlk_polymat_permute_bitrev_to_custom 6s 8s -25%
mlk_polyvec_permute_bitrev_to_custom 6s 3s +100%
poly_mulcache_compute_native_x86_64 6s 3s +100%
polyvec_basemul_acc_montgomery_cached_k4_native_x86_64 6s 2s +200%
mlk_ct_get_optblocker_u32 5s 2s +150%
mlk_ct_get_optblocker_u8 5s 4s +25%
mlk_keccakf1600_permute 5s 7s -29%
mlk_shake256x4 5s 5s +0%
poly_tomont_native_x86_64 5s 2s +150%
polyvec_basemul_acc_montgomery_cached_k2_native_aarch64 5s 2s +150%
keccak_f1600_x4_native_aarch64_v8a_scalar_hybrid 4s 2s +100%
kem_check_pk 4s 4s +0%
kem_check_sk 4s 3s +33%
kem_enc_derand 4s 3s +33%
mlk_ct_sel_uint8 4s 3s +33%
mlk_gen_matrix_serial 4s 6s -33%
mlk_keccak_absorb_once 4s 4s +0%
mlk_matvec_mul 4s 2s +100%
mlk_poly_cbd_eta2 4s 3s +33%
mlk_poly_compress_du 4s 4s +0%
mlk_poly_decompress_dv 4s 3s +33%
mlk_poly_getnoise_eta1_4x_native 4s 3s +33%
mlk_polyvec_invntt_tomont 4s 2s +100%
mlk_polyvec_reduce 4s 3s +33%
mlk_scalar_compress_d1 4s 1s +300%
mlk_scalar_compress_d4 4s 2s +100%
mlk_scalar_compress_d5 4s 4s +0%
mlk_scalar_decompress_d11 4s 3s +33%
mlk_shake128_squeezeblocks 4s 4s +0%
mlk_shake128x4_squeezeblocks 4s 3s +33%
mlk_value_barrier_u32 4s 3s +33%
ntt_native_x86_64 4s 3s +33%
nttunpack_native_x86_64 4s 2s +100%
poly_frombytes_native_x86_64 4s 7s -43%
poly_tobytes_native_x86_64 4s 3s +33%
intt_native_aarch64 3s 2s +50%
keccak_f1600_x4_native_aarch64_v84a 3s 1s +200%
keccak_f1600_x4_native_aarch64_v8a_v84a_scalar_hybrid 3s 2s +50%
keccakf1600x4_xor_bytes_native 3s - new
kem_enc 3s 3s +0%
kem_keypair 3s 4s -25%
kem_keypair_derand 3s 3s +0%
mlk_ct_cmov_zero 3s 1s +200%
mlk_ct_get_optblocker_i32 3s 4s -25%
mlk_ct_memcmp 3s 3s +0%
mlk_gen_matrix 3s 3s +0%
mlk_keccakf1600_extract_bytes (big endian) 3s 2s +50%
mlk_keccakf1600x4_extract_bytes 3s 3s +0%
mlk_poly_frombytes 3s 3s +0%
mlk_poly_getnoise_eta1122_4x 3s 3s +0%
mlk_poly_invntt_tomont 3s 4s -25%
mlk_poly_invntt_tomont_c 3s 3s +0%
mlk_poly_mulcache_compute 3s 3s +0%
mlk_poly_mulcache_compute_c 3s 2s +50%
mlk_poly_reduce 3s 2s +50%
mlk_poly_reduce_c 3s 4s -25%
mlk_poly_tobytes 3s 6s -50%
mlk_poly_tobytes_native 3s 4s -25%
mlk_poly_tomont_native 3s 3s +0%
mlk_polyvec_frombytes 3s 3s +0%
mlk_polyvec_mulcache_compute 3s 5s -40%
mlk_polyvec_ntt 3s 3s +0%
mlk_polyvec_permute_bitrev_to_custom_native 3s 3s +0%
mlk_polyvec_tobytes 3s 3s +0%
mlk_polyvec_tomont 3s 2s +50%
mlk_scalar_compress_d10 3s 2s +50%
mlk_scalar_decompress_d4 3s 2s +50%
mlk_scalar_signed_to_unsigned_q 3s 4s -25%
mlk_sha3_512 3s 2s +50%
mlk_shake128_absorb_once 3s 3s +0%
mlk_shake256 3s 3s +0%
poly_getnoise_eta1122_4x_native 3s 4s -25%
poly_mulcache_compute_native_aarch64 3s 2s +50%
poly_reduce_native_aarch64 3s 3s +0%
polyvec_basemul_acc_montgomery_cached_k2_native_x86_64 3s 4s -25%
rej_uniform_native_aarch64 3s 2s +50%
rej_uniform_native_x86_64 3s 3s +0%
intt_native_x86_64 2s 3s -33%
keccak_f1600_x1_native_aarch64 2s 2s +0%
keccak_f1600_x1_native_aarch64_v84a 2s 4s -50%
mlk_barrett_reduce 2s 5s -60%
mlk_ct_cmask_neg_i16 2s 2s +0%
mlk_ct_sel_int16 2s 2s +0%
mlk_keccakf1600_xor_bytes (big endian) 2s 2s +0%
mlk_keccakf1600x4_permute 2s 4s -50%
mlk_montgomery_reduce 2s 2s +0%
mlk_poly_add 2s 2s +0%
mlk_poly_cbd_eta1 2s 3s -33%
mlk_poly_compress_dv 2s 7s -71%
mlk_poly_decompress_du 2s 4s -50%
mlk_poly_frombytes_c 2s 3s -33%
mlk_poly_getnoise_eta1_4x 2s 4s -50%
mlk_poly_getnoise_eta2 2s 4s -50%
mlk_poly_mulcache_compute_native 2s 3s -33%
mlk_poly_ntt 2s 5s -60%
mlk_poly_ntt_c 2s 2s +0%
mlk_poly_tobytes_c 2s 4s -50%
mlk_poly_tomont_c 2s 3s -33%
mlk_poly_tomsg 2s 3s -33%
mlk_polyvec_basemul_acc_montgomery_cached 2s 2s +0%
mlk_polyvec_decompress_du 2s 1s +100%
mlk_rej_uniform 2s 2s +0%
mlk_scalar_compress_d11 2s 3s -33%
mlk_scalar_decompress_d5 2s 2s +0%
mlk_sha3_256 2s 2s +0%
mlk_shake128x4_absorb_once 2s 3s -33%
mlk_value_barrier_i32 2s 1s +100%
mlk_value_barrier_u8 2s 2s +0%
poly_reduce_native_x86_64 2s 1s +100%
poly_tobytes_native_aarch64 2s 1s +100%
polyvec_basemul_acc_montgomery_cached_k3_native_aarch64 2s 2s +0%
polyvec_basemul_acc_montgomery_cached_k3_native_x86_64 2s 3s -33%
polyvec_basemul_acc_montgomery_cached_k4_native_aarch64 2s 5s -60%
sys_check_capability 2s 2s +0%
keccakf1600x4_extract_bytes_native 1s - new
mlk_check_pct 1s 4s -75%
mlk_ct_cmask_nonzero_u16 1s 3s -67%
mlk_ct_cmask_nonzero_u8 1s 2s -50%
mlk_keccakf1600_extract_bytes 1s 3s -67%
mlk_keccakf1600_xor_bytes 1s 2s -50%
mlk_keccakf1600x4_xor_bytes 1s 1s +0%
mlk_poly_tomont 1s 4s -75%
mlk_polyvec_compress_du 1s 1s +0%
mlk_scalar_decompress_d10 1s 2s -50%
ntt_native_aarch64 1s 1s +0%
poly_invntt_tomont_native 1s 1s +0%
poly_tomont_native_aarch64 1s 2s -50%
rej_uniform_native 1s 5s -80%

@mkannwischer mkannwischer force-pushed the mve-keccakx4-bitinterleaving branch 2 times, most recently from 3ca7030 to b35d774 Compare January 23, 2026 08:07
Comment thread mlkem/src/fips202/native/api.h
Comment thread dev/fips202/armv81m/src/state_xor_bytes_x4_mve.S Outdated
Comment thread dev/fips202/armv81m/src/state_xor_bytes_x4_mve.S Outdated
Comment thread mlkem/src/fips202/native/auto.h Outdated
Comment thread mlkem/src/fips202/native/armv81m/src/keccak_f1600_x4_mve.c Outdated
@mkannwischer
Copy link
Copy Markdown
Contributor Author

I have opened #1531 to remind us that we still have to superoptimize this code.

Comment thread mlkem/src/fips202/keccakf1600.c Outdated
@mkannwischer mkannwischer force-pushed the mve-keccakx4-bitinterleaving branch 2 times, most recently from 5fce885 to a446149 Compare February 6, 2026 09:40
@mkannwischer mkannwischer marked this pull request as ready for review February 7, 2026 03:52
@mkannwischer mkannwischer requested a review from a team as a code owner February 7, 2026 03:52
Copy link
Copy Markdown
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following other backend functions, could we have a separate CBMC proof for foo_c, and then the foo and foo_native call foo_c by contract? AFAIU, we currently still inline foo_c into the proofs of foo and foo_native.

@mkannwischer
Copy link
Copy Markdown
Contributor Author

Following other backend functions, could we have a separate CBMC proof for foo_c, and then the foo and foo_native call foo_c by contract? AFAIU, we currently still inline foo_c into the proofs of foo and foo_native.

I don't see a separate proof for mlk_keccakf1600_permute_c. I followed the same approach.

Copy link
Copy Markdown
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the work @mkannwischer @bremoran!

I must admit that I find neither explanation of the ASM satisfactory. Could you put yourself in the shoes of someone who hasn't worked on this for months, and explain the layout of the interleaved state in plain terms? What is a "stream"? What is a "bitplane"?

@hanno-becker
Copy link
Copy Markdown
Contributor

hanno-becker commented Feb 7, 2026

I don't see a separate proof for mlk_keccakf1600_permute_c. I followed the same approach.

Ok, we can do this as a follow-up. This seems to be inconsistent between arithmetic and FIPS backend.

Comment thread dev/fips202/armv81m/src/state_extract_bytes_x4_mve.S Outdated
Comment thread mlkem/src/fips202/native/armv81m/src/state_extract_bytes_x4_mve.S
Comment thread dev/fips202/armv81m/src/state_extract_bytes_x4_mve.S
Copy link
Copy Markdown
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the long delay, @bremoran @mkannwischer. I wanted to look at this properly and couldn't find time until now.

I have reviewed the assembly and am very happy with it. I applaud you for the clarity and documentation, which is great and will simplify any further work in the future.

As mentioned, I think some uniformity regarding the presence/absence of macro arguments would be an improvement, but I leave it to you whether you want to do this as a follow-up or not.

@mkannwischer mkannwischer force-pushed the mve-keccakx4-bitinterleaving branch from f5de3c9 to 07baf7d Compare February 24, 2026 08:06
mkannwischer and others added 3 commits February 24, 2026 16:07
Replace test_keccakf1600x4_permute with test_keccakf1600x4_xor_permute_extract
that tests the full x4 Keccak flow (xor_bytes, permute, extract_bytes) against
the x1 C reference implementation.

Testing through the public interface rather than comparing internal state
directly allows verifying backends that use custom state representations
(e.g., bit-interleaved) without requiring state conversion functions.

The test uses random offsets and lengths for both xor_bytes and extract_bytes,
and verifies each of the 4 lanes independently against the x1 reference.

Also reduce functional test iterations for M55 baremetal platform.

Signed-off-by: Matthias J. Kannwischer <matthias@kannwischer.eu>
Extend the FIPS202 native backend API to support implementing XORBytes and
ExtractBytes steps in native code.

This is essential for backends using custom state representations (e.g.,
bit-interleaved state), where these functions handle conversion to/from
the internal format on-the-fly. In such cases, they also account for a
significant amount of processing time.

New flags:
- MLK_USE_FIPS202_X4_XOR_BYTES_NATIVE: Backend provides native XOR bytes
- MLK_USE_FIPS202_X4_EXTRACT_BYTES_NATIVE: Backend provides native extract bytes

When set, backends provide native implementations for:
- mlk_keccakf1600_xor_bytes_x4_native: XOR input data into state
- mlk_keccakf1600_extract_bytes_x4_native: Extract output from state

Signed-off-by: Matthias J. Kannwischer <matthias@kannwischer.eu>
Add native MVE implementations of XORBytes and ExtractBytes that perform
bit-interleaving/deinterleaving on-the-fly, enabling use of a bit-interleaved
state representation without temporary conversions in the permutation.

This improves performance by:
- Reducing the number of bit-interleaving operations
- Accelerating bit-interleaving using MVE vector instructions

The backend uses bit-interleaved state representation where each 64-bit
lane is split into even and odd 32-bit halves for efficient 32-bit
MVE processing.

Co-Authored-By: Brendan Moran <brendan.moran@arm.com>
Signed-off-by: Matthias J. Kannwischer <matthias@kannwischer.eu>
@mkannwischer mkannwischer force-pushed the mve-keccakx4-bitinterleaving branch from 07baf7d to 65aec77 Compare February 24, 2026 08:07
@mkannwischer
Copy link
Copy Markdown
Contributor Author

Apologies for the long delay, @bremoran @mkannwischer. I wanted to look at this properly and couldn't find time until now.

I have reviewed the assembly and am very happy with it. I applaud you for the clarity and documentation, which is great and will simplify any further work in the future.

As mentioned, I think some uniformity regarding the presence/absence of macro arguments would be an improvement, but I leave it to you whether you want to do this as a follow-up or not.

Thanks @hanno-becker - I agree that adding the macro arguments explicilty makes it easier to read. I added it in https://github.com/pq-code-package/mlkem-native/compare/f5de3c946af9d3419825a52bae0bc3e579fede5f..07baf7d75d5046afb1331f0db3fe92a89ac628e6.

@mkannwischer mkannwischer merged commit 566920d into main Feb 24, 2026
370 checks passed
@mkannwischer mkannwischer deleted the mve-keccakx4-bitinterleaving branch February 24, 2026 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Armv8.1-M: Add native bitinterleaving x4

3 participants