Skip to content

sponge-cursor: initial implementation#1477

Open
newpavlov wants to merge 9 commits intomasterfrom
add-sponge-cursor
Open

sponge-cursor: initial implementation#1477
newpavlov wants to merge 9 commits intomasterfrom
add-sponge-cursor

Conversation

@newpavlov
Copy link
Copy Markdown
Member

@newpavlov newpavlov commented May 2, 2026

sponge-cursor is a helper crate which allows to remove unnecessary buffering from sponge-based constructions.

TODO: add tests, examples, and CI config

Comment thread sponge-cursor/src/lib.rs Outdated
let mut buf = Array::<u8, Rate>::default();
let mut chunks = buf.chunks_exact_mut(size_of::<u64>());
for (src, dst) in state.iter().zip(&mut chunks) {
dst.copy_from_slice(&src.to_le_bytes());
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation of xor_in and read_into for big-endian targets is quite sub-optimal, but I prioritized simplicity of implementation. Considering that big-endian targets are practically extinct, I think it's a fine trade-off. We could easily improve it in future if someone interested in it.

Comment thread sponge-cursor/src/lib.rs Outdated
Comment thread sponge-cursor/src/lib.rs Outdated

/// Squeeze data into `dst`.
#[allow(clippy::missing_panics_doc, reason = "the function is panic-free")]
pub fn squeeze_u64_le<const N: usize>(
Copy link
Copy Markdown

@Bluefinger Bluefinger May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the unrolled loop code here on my own NEKO cipher work, which is using Keccak, and my own cursor impl is about twice as fast and being a lot simpler. But the general construction of my "loop" is as follows:

#[inline(always)]
fn operate_mut(&'s mut self, operation: fn((&mut u8, &mut u8))) {
    const {
        assert!(S::RATE < keccak::PLEN);
        assert!(S::RATE < u8::MAX as usize);
    };

    loop {
        let block = self.neko.block();

        let state_bytes = self.neko.state[block..S::RATE].as_mut_bytes();

        let pos = state_bytes
            .iter_mut()
            .zip(self.data.iter_mut())
            .map(operation)
            .count();

        self.data = &mut self.data[pos..];

        if !self.data.is_empty() {
            // SPONGE
            self.neko.permutation_f(ops::CONT);
        } else {
            self.neko.position += pos;
            break;
        }
    }
}

Then operations are functions defined as fn((&mut u8, &mut u8)), so for squeeze, it's:

    #[inline]
    pub(crate) fn squeeze(neko: &'s mut NekoState<S>, data: &'s mut [u8]) {
        NekoOperateMut::new(neko, data).operate_mut(|(state, byte)| {
            *byte = *state;
            *state = 0;
        });
    }

The simpler construction seems to optimise better, though while I work with u64 blocks with padding, it can be adjusted to work with u8 state.

There's probably other considerations here (I'm not even trying to support BE platforms), but a simpler loop seems like a better start for optimisation work.

But as evidence (target-cpu=x86_64_v3), this PR's code gave me this result on my benchmarks:

NEKO-128 TRANSPORT ROUNDTRIP
                        time:   [8.3069 µs 8.3176 µs 8.3291 µs]
                        change: [+72.904% +73.293% +73.702%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

While my loop code gave me:

NEKO-128 TRANSPORT ROUNDTRIP
                        time:   [4.8059 µs 4.8106 µs 4.8152 µs]
                        change: [−42.339% −42.211% −42.087%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

Copy link
Copy Markdown
Member Author

@newpavlov newpavlov May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you measuring performance by squeezing one byte at a time? I don't see how you can get such performance difference otherwise.

The bulk of work is done by the core loop at lines 182-190 and it's as straightforward as possible (for supported MSRV, it could be simplified a bit further with as_chunks) and compiles into expected optimal code. Everything else is handling of leftover bytes and I intentionally made it possible for the compiler to optimize out those parts since in practice we often squeeze a fixed number of bytes immediately after finalization.

One of consideration while writing this code was to keep the function panic-free (see the gobolt link for the absorption method, I need to make a similar link for squeeze as well).

Copy link
Copy Markdown

@Bluefinger Bluefinger May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This benchmark is for an absorb_set & exchange flow, but the loop code was adapted to use the steps in the PR (and ensured it passed my KATs). What I figure is that the zip of the state bytes + input bytes is easier to optimise if done with a single step, and then get the amount of bytes written/processed (via .count()) to create the tail after the processing, allowing us to check once whether the tail is empty or to loop around. It's doing "less" branching, and through the use of iterators, is probably allowing rustc to optimise better here, though with some help from the assertions.

If you want to check what I'm benching here, feel free to look at https://tangled.org/sachy.dev/wharrgarbl/blob/main/benches/garbl_bench.rs

Copy link
Copy Markdown
Member Author

@newpavlov newpavlov May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I figure is that the zip of the state bytes + input bytes is easier to optimise if done with a single step

I am not sure I follow. The number of input/output bytes is usually greater than rate over which we operate. There is no "looping around" in the current implementation, after we handled leftover bytes we just split the buffer into chunks with length equal to rate and into tail.

And I am not sure how you did the performance comparison since the current version of sponge-cursor does not provide a method for "squeeze by XORing". Are you reading a keystream block first and then XOR it with data?

Could you provide an alternative implementation which follows the existing method API, so we could do an apples-to-apples comparison? Take a look at RustCrypto/hashes#849 for examples of how we currently use the crate.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a godbolt link for squeeze_u64_le: https://rust.godbolt.org/z/5c58vPb97

As you can see, the core loop compiles into a very efficient assembly, while handling of leftover bytes is just a bunch of memcpys.

Copy link
Copy Markdown

@Bluefinger Bluefinger May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am doing with my code is a duplex construction, inspired by STROBE, so that is a keystream example. For a more apples-to-apples comparison, I did a quick translation as to what the absorb method would look like:

#[inline(always)]
pub fn absorb_input<const N: usize>(&mut self, state: &mut [u64; N], mut data: &[u8], sponge: fn(&mut [u64; N])) {
    const {
        assert!(Rate::USIZE <= size_of::<[u64; N]>());
        assert!(Rate::USIZE < u8::MAX as usize);
        assert!(Rate::USIZE % size_of::<u64>() == 0);
    };

    loop {
        let pos = self.pos();

        let state_bytes = state.as_mut_bytes()[pos..Rate::USIZE];

        let written = state_bytes
            .iter_mut()
            .zip(data.iter())
            .map(|(state, input)| *state ^= *input)
            .count();

        data = &data[written..];

        if !data.is_empty() {
            // SPONGE
            sponge(state);
            self.pos = 0;
        } else {
            self.pos += written as u8;
            return;
        }
    }
}

I kept the as_mut_bytes as a stand-in for the cast, though I'm kinda iffy on the one in the code here since N will not match Rate. But all in all, this is what I meant with my previous explanation. We are incurring fewer branches with this loop, and mapping the block of state directly to the available block of data via zip, then just operating on each byte until we have no more data.

Squeeze could be done the same way with permutation/sponge running at the beginning of the loop, and then the check at the end being for just if data.is_empty() I think.

Copy link
Copy Markdown
Member Author

@newpavlov newpavlov May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added squeeze_xor_u64_le method. You can see the resulting assembly here. Try to repeat the benchmark with it. Also, try to run it for bigger sizes (e.g. 1 MiB).

As you can see here, your implementation indeed results in a smaller binary, but at the cost of not having specialized core loop for processing of full blocks. So it could be more performant if you process a lot of small data chunks with sizes smaller than rate, but I think it should be less performant on large data sizes, which I think is more frequent in practice.

Your code is also not quite correct since it breaks the invariant that pos is always smaller than RATE and IIUC it does not handle correctly data bigger than rate since it calls sponge only on data.is_empty().

Copy link
Copy Markdown

@Bluefinger Bluefinger May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was benchmarking against byte strings that were 3348 bytes long, so still many times greater than the rate, and I still got the benched results. Plus, it improved with target-cpu=x86-64-v3, though I did see a regression with v4.

I am targetting smaller code by default (since I am working with MCUs and optimising for more constrained systems), but this is what I noted when testing that particular code path when adapted for my impl. What I wonder here is if the specialisation kicks in at all if we up the cpu feature, or if the smaller loop just ends up working better until much, much larger data sizes than 4kB.

Copy link
Copy Markdown
Member Author

@newpavlov newpavlov May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed your implementation to this variant:

while !data.is_empty() {
    let pos = self.pos();

    let state_bytes: &mut [u8; RATE] = unsafe { &mut *(state.as_mut_ptr().cast()) };
    let state_bytes = &mut state_bytes[pos..];

    let written = state_bytes
        .iter_mut()
        .zip(data.iter())
        .map(|(state, input)| *state ^= *input)
        .count();

    data = &data[written..];
    self.pos += written as u8;

    if self.pos == RATE as u8 {
        sponge(state);
        self.pos = 0;
    }
}

I run our k12 (absorption-only) benchmarks on my laptop (AMD Ryzen 7 5700U) and got the following results.

Baseline:

# Base x86-64
test turboshake128_10    ... bench:          11.85 ns/iter (+/- 0.08) = 909 MB/s
test turboshake128_100   ... bench:         101.20 ns/iter (+/- 0.96) = 990 MB/s
test turboshake128_1000  ... bench:         983.87 ns/iter (+/- 4.83) = 1017 MB/s
test turboshake128_10000 ... bench:       9,811.05 ns/iter (+/- 61.22) = 1019 MB/s

# x86-64-v3
test turboshake128_10    ... bench:          10.58 ns/iter (+/- 0.18) = 1000 MB/s
test turboshake128_100   ... bench:          86.91 ns/iter (+/- 0.86) = 1162 MB/s
test turboshake128_1000  ... bench:         841.96 ns/iter (+/- 4.29) = 1189 MB/s
test turboshake128_10000 ... bench:       8,352.16 ns/iter (+/- 70.54) = 1197 MB/s

Your variant:

# Base x86-64
test turboshake128_10    ... bench:          13.46 ns/iter (+/- 0.07) = 769 MB/s
test turboshake128_100   ... bench:         101.00 ns/iter (+/- 0.68) = 1000 MB/s
test turboshake128_1000  ... bench:       1,001.56 ns/iter (+/- 7.39) = 999 MB/s
test turboshake128_10000 ... bench:       9,925.97 ns/iter (+/- 92.69) = 1007 MB/s

# x86-64-v3
test turboshake128_10    ... bench:          12.47 ns/iter (+/- 0.14) = 833 MB/s
test turboshake128_100   ... bench:          89.52 ns/iter (+/- 0.53) = 1123 MB/s
test turboshake128_1000  ... bench:         857.95 ns/iter (+/- 6.98) = 1166 MB/s
test turboshake128_10000 ... bench:       8,441.75 ns/iter (+/- 24.66) = 1184 MB/s

As you can see, we have a small win for 100 byte chunks on the base x86-64 target, but otherwise your variant is slightly slower everywhere else.

We could introduce a separate "compact" compilation flag for constrained targets like we do in the sponge crates, but it's a separate feature.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a compact code option would be scope creep to this PR and should be considered as something separate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants