Conversation
| let mut buf = Array::<u8, Rate>::default(); | ||
| let mut chunks = buf.chunks_exact_mut(size_of::<u64>()); | ||
| for (src, dst) in state.iter().zip(&mut chunks) { | ||
| dst.copy_from_slice(&src.to_le_bytes()); |
There was a problem hiding this comment.
Implementation of xor_in and read_into for big-endian targets is quite sub-optimal, but I prioritized simplicity of implementation. Considering that big-endian targets are practically extinct, I think it's a fine trade-off. We could easily improve it in future if someone interested in it.
|
|
||
| /// Squeeze data into `dst`. | ||
| #[allow(clippy::missing_panics_doc, reason = "the function is panic-free")] | ||
| pub fn squeeze_u64_le<const N: usize>( |
There was a problem hiding this comment.
I tested the unrolled loop code here on my own NEKO cipher work, which is using Keccak, and my own cursor impl is about twice as fast and being a lot simpler. But the general construction of my "loop" is as follows:
#[inline(always)]
fn operate_mut(&'s mut self, operation: fn((&mut u8, &mut u8))) {
const {
assert!(S::RATE < keccak::PLEN);
assert!(S::RATE < u8::MAX as usize);
};
loop {
let block = self.neko.block();
let state_bytes = self.neko.state[block..S::RATE].as_mut_bytes();
let pos = state_bytes
.iter_mut()
.zip(self.data.iter_mut())
.map(operation)
.count();
self.data = &mut self.data[pos..];
if !self.data.is_empty() {
// SPONGE
self.neko.permutation_f(ops::CONT);
} else {
self.neko.position += pos;
break;
}
}
}Then operations are functions defined as fn((&mut u8, &mut u8)), so for squeeze, it's:
#[inline]
pub(crate) fn squeeze(neko: &'s mut NekoState<S>, data: &'s mut [u8]) {
NekoOperateMut::new(neko, data).operate_mut(|(state, byte)| {
*byte = *state;
*state = 0;
});
}The simpler construction seems to optimise better, though while I work with u64 blocks with padding, it can be adjusted to work with u8 state.
There's probably other considerations here (I'm not even trying to support BE platforms), but a simpler loop seems like a better start for optimisation work.
But as evidence (target-cpu=x86_64_v3), this PR's code gave me this result on my benchmarks:
NEKO-128 TRANSPORT ROUNDTRIP
time: [8.3069 µs 8.3176 µs 8.3291 µs]
change: [+72.904% +73.293% +73.702%] (p = 0.00 < 0.05)
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe
While my loop code gave me:
NEKO-128 TRANSPORT ROUNDTRIP
time: [4.8059 µs 4.8106 µs 4.8152 µs]
change: [−42.339% −42.211% −42.087%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low mild
2 (2.00%) high mild
2 (2.00%) high severe
There was a problem hiding this comment.
Are you measuring performance by squeezing one byte at a time? I don't see how you can get such performance difference otherwise.
The bulk of work is done by the core loop at lines 182-190 and it's as straightforward as possible (for supported MSRV, it could be simplified a bit further with as_chunks) and compiles into expected optimal code. Everything else is handling of leftover bytes and I intentionally made it possible for the compiler to optimize out those parts since in practice we often squeeze a fixed number of bytes immediately after finalization.
One of consideration while writing this code was to keep the function panic-free (see the gobolt link for the absorption method, I need to make a similar link for squeeze as well).
There was a problem hiding this comment.
This benchmark is for an absorb_set & exchange flow, but the loop code was adapted to use the steps in the PR (and ensured it passed my KATs). What I figure is that the zip of the state bytes + input bytes is easier to optimise if done with a single step, and then get the amount of bytes written/processed (via .count()) to create the tail after the processing, allowing us to check once whether the tail is empty or to loop around. It's doing "less" branching, and through the use of iterators, is probably allowing rustc to optimise better here, though with some help from the assertions.
If you want to check what I'm benching here, feel free to look at https://tangled.org/sachy.dev/wharrgarbl/blob/main/benches/garbl_bench.rs
There was a problem hiding this comment.
That I figure is that the zip of the state bytes + input bytes is easier to optimise if done with a single step
I am not sure I follow. The number of input/output bytes is usually greater than rate over which we operate. There is no "looping around" in the current implementation, after we handled leftover bytes we just split the buffer into chunks with length equal to rate and into tail.
And I am not sure how you did the performance comparison since the current version of sponge-cursor does not provide a method for "squeeze by XORing". Are you reading a keystream block first and then XOR it with data?
Could you provide an alternative implementation which follows the existing method API, so we could do an apples-to-apples comparison? Take a look at RustCrypto/hashes#849 for examples of how we currently use the crate.
There was a problem hiding this comment.
Here is a godbolt link for squeeze_u64_le: https://rust.godbolt.org/z/5c58vPb97
As you can see, the core loop compiles into a very efficient assembly, while handling of leftover bytes is just a bunch of memcpys.
There was a problem hiding this comment.
What I am doing with my code is a duplex construction, inspired by STROBE, so that is a keystream example. For a more apples-to-apples comparison, I did a quick translation as to what the absorb method would look like:
#[inline(always)]
pub fn absorb_input<const N: usize>(&mut self, state: &mut [u64; N], mut data: &[u8], sponge: fn(&mut [u64; N])) {
const {
assert!(Rate::USIZE <= size_of::<[u64; N]>());
assert!(Rate::USIZE < u8::MAX as usize);
assert!(Rate::USIZE % size_of::<u64>() == 0);
};
loop {
let pos = self.pos();
let state_bytes = state.as_mut_bytes()[pos..Rate::USIZE];
let written = state_bytes
.iter_mut()
.zip(data.iter())
.map(|(state, input)| *state ^= *input)
.count();
data = &data[written..];
if !data.is_empty() {
// SPONGE
sponge(state);
self.pos = 0;
} else {
self.pos += written as u8;
return;
}
}
}I kept the as_mut_bytes as a stand-in for the cast, though I'm kinda iffy on the one in the code here since N will not match Rate. But all in all, this is what I meant with my previous explanation. We are incurring fewer branches with this loop, and mapping the block of state directly to the available block of data via zip, then just operating on each byte until we have no more data.
Squeeze could be done the same way with permutation/sponge running at the beginning of the loop, and then the check at the end being for just if data.is_empty() I think.
There was a problem hiding this comment.
I added squeeze_xor_u64_le method. You can see the resulting assembly here. Try to repeat the benchmark with it. Also, try to run it for bigger sizes (e.g. 1 MiB).
As you can see here, your implementation indeed results in a smaller binary, but at the cost of not having specialized core loop for processing of full blocks. So it could be more performant if you process a lot of small data chunks with sizes smaller than rate, but I think it should be less performant on large data sizes, which I think is more frequent in practice.
Your code is also not quite correct since it breaks the invariant that pos is always smaller than RATE and IIUC it does not handle correctly data bigger than rate since it calls sponge only on data.is_empty().
There was a problem hiding this comment.
I was benchmarking against byte strings that were 3348 bytes long, so still many times greater than the rate, and I still got the benched results. Plus, it improved with target-cpu=x86-64-v3, though I did see a regression with v4.
I am targetting smaller code by default (since I am working with MCUs and optimising for more constrained systems), but this is what I noted when testing that particular code path when adapted for my impl. What I wonder here is if the specialisation kicks in at all if we up the cpu feature, or if the smaller loop just ends up working better until much, much larger data sizes than 4kB.
There was a problem hiding this comment.
I fixed your implementation to this variant:
while !data.is_empty() {
let pos = self.pos();
let state_bytes: &mut [u8; RATE] = unsafe { &mut *(state.as_mut_ptr().cast()) };
let state_bytes = &mut state_bytes[pos..];
let written = state_bytes
.iter_mut()
.zip(data.iter())
.map(|(state, input)| *state ^= *input)
.count();
data = &data[written..];
self.pos += written as u8;
if self.pos == RATE as u8 {
sponge(state);
self.pos = 0;
}
}I run our k12 (absorption-only) benchmarks on my laptop (AMD Ryzen 7 5700U) and got the following results.
Baseline:
# Base x86-64
test turboshake128_10 ... bench: 11.85 ns/iter (+/- 0.08) = 909 MB/s
test turboshake128_100 ... bench: 101.20 ns/iter (+/- 0.96) = 990 MB/s
test turboshake128_1000 ... bench: 983.87 ns/iter (+/- 4.83) = 1017 MB/s
test turboshake128_10000 ... bench: 9,811.05 ns/iter (+/- 61.22) = 1019 MB/s
# x86-64-v3
test turboshake128_10 ... bench: 10.58 ns/iter (+/- 0.18) = 1000 MB/s
test turboshake128_100 ... bench: 86.91 ns/iter (+/- 0.86) = 1162 MB/s
test turboshake128_1000 ... bench: 841.96 ns/iter (+/- 4.29) = 1189 MB/s
test turboshake128_10000 ... bench: 8,352.16 ns/iter (+/- 70.54) = 1197 MB/s
Your variant:
# Base x86-64
test turboshake128_10 ... bench: 13.46 ns/iter (+/- 0.07) = 769 MB/s
test turboshake128_100 ... bench: 101.00 ns/iter (+/- 0.68) = 1000 MB/s
test turboshake128_1000 ... bench: 1,001.56 ns/iter (+/- 7.39) = 999 MB/s
test turboshake128_10000 ... bench: 9,925.97 ns/iter (+/- 92.69) = 1007 MB/s
# x86-64-v3
test turboshake128_10 ... bench: 12.47 ns/iter (+/- 0.14) = 833 MB/s
test turboshake128_100 ... bench: 89.52 ns/iter (+/- 0.53) = 1123 MB/s
test turboshake128_1000 ... bench: 857.95 ns/iter (+/- 6.98) = 1166 MB/s
test turboshake128_10000 ... bench: 8,441.75 ns/iter (+/- 24.66) = 1184 MB/s
As you can see, we have a small win for 100 byte chunks on the base x86-64 target, but otherwise your variant is slightly slower everywhere else.
We could introduce a separate "compact" compilation flag for constrained targets like we do in the sponge crates, but it's a separate feature.
There was a problem hiding this comment.
Yeah, a compact code option would be scope creep to this PR and should be considered as something separate.
sponge-cursoris a helper crate which allows to remove unnecessary buffering from sponge-based constructions.TODO: add tests, examples, and CI config