Armv8.1-M: Add native Keccak x4 XORBytes and ExtractBytes #1524
Armv8.1-M: Add native Keccak x4 XORBytes and ExtractBytes #1524mkannwischer merged 3 commits intomainfrom
Conversation
f568d85 to
54dafae
Compare
CBMC Results (ML-KEM-512)Full Results (153 proofs)
|
CBMC Results (ML-KEM-1024)Full Results (153 proofs)
|
CBMC Results (ML-KEM-768)Full Results (153 proofs)
|
3ca7030 to
b35d774
Compare
1b437aa to
f2a4195
Compare
f2a4195 to
10b8525
Compare
10b8525 to
e2c35f9
Compare
|
I have opened #1531 to remind us that we still have to superoptimize this code. |
5fce885 to
a446149
Compare
hanno-becker
left a comment
There was a problem hiding this comment.
Following other backend functions, could we have a separate CBMC proof for foo_c, and then the foo and foo_native call foo_c by contract? AFAIU, we currently still inline foo_c into the proofs of foo and foo_native.
I don't see a separate proof for |
hanno-becker
left a comment
There was a problem hiding this comment.
Thank you for all the work @mkannwischer @bremoran!
I must admit that I find neither explanation of the ASM satisfactory. Could you put yourself in the shoes of someone who hasn't worked on this for months, and explain the layout of the interleaved state in plain terms? What is a "stream"? What is a "bitplane"?
Ok, we can do this as a follow-up. This seems to be inconsistent between arithmetic and FIPS backend. |
f9a67ac to
f5de3c9
Compare
hanno-becker
left a comment
There was a problem hiding this comment.
Apologies for the long delay, @bremoran @mkannwischer. I wanted to look at this properly and couldn't find time until now.
I have reviewed the assembly and am very happy with it. I applaud you for the clarity and documentation, which is great and will simplify any further work in the future.
As mentioned, I think some uniformity regarding the presence/absence of macro arguments would be an improvement, but I leave it to you whether you want to do this as a follow-up or not.
f5de3c9 to
07baf7d
Compare
Replace test_keccakf1600x4_permute with test_keccakf1600x4_xor_permute_extract that tests the full x4 Keccak flow (xor_bytes, permute, extract_bytes) against the x1 C reference implementation. Testing through the public interface rather than comparing internal state directly allows verifying backends that use custom state representations (e.g., bit-interleaved) without requiring state conversion functions. The test uses random offsets and lengths for both xor_bytes and extract_bytes, and verifies each of the 4 lanes independently against the x1 reference. Also reduce functional test iterations for M55 baremetal platform. Signed-off-by: Matthias J. Kannwischer <matthias@kannwischer.eu>
Extend the FIPS202 native backend API to support implementing XORBytes and ExtractBytes steps in native code. This is essential for backends using custom state representations (e.g., bit-interleaved state), where these functions handle conversion to/from the internal format on-the-fly. In such cases, they also account for a significant amount of processing time. New flags: - MLK_USE_FIPS202_X4_XOR_BYTES_NATIVE: Backend provides native XOR bytes - MLK_USE_FIPS202_X4_EXTRACT_BYTES_NATIVE: Backend provides native extract bytes When set, backends provide native implementations for: - mlk_keccakf1600_xor_bytes_x4_native: XOR input data into state - mlk_keccakf1600_extract_bytes_x4_native: Extract output from state Signed-off-by: Matthias J. Kannwischer <matthias@kannwischer.eu>
Add native MVE implementations of XORBytes and ExtractBytes that perform bit-interleaving/deinterleaving on-the-fly, enabling use of a bit-interleaved state representation without temporary conversions in the permutation. This improves performance by: - Reducing the number of bit-interleaving operations - Accelerating bit-interleaving using MVE vector instructions The backend uses bit-interleaved state representation where each 64-bit lane is split into even and odd 32-bit halves for efficient 32-bit MVE processing. Co-Authored-By: Brendan Moran <brendan.moran@arm.com> Signed-off-by: Matthias J. Kannwischer <matthias@kannwischer.eu>
07baf7d to
65aec77
Compare
Thanks @hanno-becker - I agree that adding the macro arguments explicilty makes it easier to read. I added it in https://github.com/pq-code-package/mlkem-native/compare/f5de3c946af9d3419825a52bae0bc3e579fede5f..07baf7d75d5046afb1331f0db3fe92a89ac628e6. |
See individual commits