sponge-cursor: initial implementation by newpavlov · Pull Request #1477 · RustCrypto/utils

newpavlov · 2026-05-02T17:09:46Z

sponge-cursor is a helper crate which allows to remove unnecessary buffering from sponge-based constructions.

TODO: add tests, examples, and CI config

newpavlov · 2026-05-02T17:16:58Z

+        let mut buf = Array::<u8, Rate>::default();
+        let mut chunks = buf.chunks_exact_mut(size_of::<u64>());
+        for (src, dst) in state.iter().zip(&mut chunks) {
+            dst.copy_from_slice(&src.to_le_bytes());


Implementation of xor_in and read_into for big-endian targets is quite sub-optimal, but I prioritized simplicity of implementation. Considering that big-endian targets are practically extinct, I think it's a fine trade-off. We could easily improve it in future if someone interested in it.

Bluefinger · 2026-05-04T08:04:57Z

+
+    /// Squeeze data into `dst`.
+    #[allow(clippy::missing_panics_doc, reason = "the function is panic-free")]
+    pub fn squeeze_u64_le<const N: usize>(


I tested the unrolled loop code here on my own NEKO cipher work, which is using Keccak, and my own cursor impl is about twice as fast and being a lot simpler. But the general construction of my "loop" is as follows:

#[inline(always)] fn operate_mut(&'s mut self, operation: fn((&mut u8, &mut u8))) { const { assert!(S::RATE < keccak::PLEN); assert!(S::RATE < u8::MAX as usize); }; loop { let block = self.neko.block(); let state_bytes = self.neko.state[block..S::RATE].as_mut_bytes(); let pos = state_bytes .iter_mut() .zip(self.data.iter_mut()) .map(operation) .count(); self.data = &mut self.data[pos..]; if !self.data.is_empty() { // SPONGE self.neko.permutation_f(ops::CONT); } else { self.neko.position += pos; break; } } }

Then operations are functions defined as fn((&mut u8, &mut u8)), so for squeeze, it's:

#[inline] pub(crate) fn squeeze(neko: &'s mut NekoState<S>, data: &'s mut [u8]) { NekoOperateMut::new(neko, data).operate_mut(|(state, byte)| { *byte = *state; *state = 0; }); }

The simpler construction seems to optimise better, though while I work with u64 blocks with padding, it can be adjusted to work with u8 state.

There's probably other considerations here (I'm not even trying to support BE platforms), but a simpler loop seems like a better start for optimisation work.

But as evidence (target-cpu=x86_64_v3), this PR's code gave me this result on my benchmarks:

NEKO-128 TRANSPORT ROUNDTRIP time: [8.3069 µs 8.3176 µs 8.3291 µs] change: [+72.904% +73.293% +73.702%] (p = 0.00 < 0.05) Performance has regressed. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe

While my loop code gave me:

NEKO-128 TRANSPORT ROUNDTRIP time: [4.8059 µs 4.8106 µs 4.8152 µs] change: [−42.339% −42.211% −42.087%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) low mild 2 (2.00%) high mild 2 (2.00%) high severe

Are you measuring performance by squeezing one byte at a time? I don't see how you can get such performance difference otherwise.

The bulk of work is done by the core loop at lines 182-190 and it's as straightforward as possible (for supported MSRV, it could be simplified a bit further with as_chunks) and compiles into expected optimal code. Everything else is handling of leftover bytes and I intentionally made it possible for the compiler to optimize out those parts since in practice we often squeeze a fixed number of bytes immediately after finalization.

One of consideration while writing this code was to keep the function panic-free (see the gobolt link for the absorption method, I need to make a similar link for squeeze as well).

This benchmark is for an absorb_set & exchange flow, but the loop code was adapted to use the steps in the PR (and ensured it passed my KATs). What I figure is that the zip of the state bytes + input bytes is easier to optimise if done with a single step, and then get the amount of bytes written/processed (via .count()) to create the tail after the processing, allowing us to check once whether the tail is empty or to loop around. It's doing "less" branching, and through the use of iterators, is probably allowing rustc to optimise better here, though with some help from the assertions.

If you want to check what I'm benching here, feel free to look at https://tangled.org/sachy.dev/wharrgarbl/blob/main/benches/garbl_bench.rs

That I figure is that the zip of the state bytes + input bytes is easier to optimise if done with a single step

I am not sure I follow. The number of input/output bytes is usually greater than rate over which we operate. There is no "looping around" in the current implementation, after we handled leftover bytes we just split the buffer into chunks with length equal to rate and into tail.

And I am not sure how you did the performance comparison since the current version of sponge-cursor does not provide a method for "squeeze by XORing". Are you reading a keystream block first and then XOR it with data?

Could you provide an alternative implementation which follows the existing method API, so we could do an apples-to-apples comparison? Take a look at RustCrypto/hashes#849 for examples of how we currently use the crate.

Here is a godbolt link for squeeze_u64_le: https://rust.godbolt.org/z/5c58vPb97

As you can see, the core loop compiles into a very efficient assembly, while handling of leftover bytes is just a bunch of memcpys.

What I am doing with my code is a duplex construction, inspired by STROBE, so that is a keystream example. For a more apples-to-apples comparison, I did a quick translation as to what the absorb method would look like:

#[inline(always)] pub fn absorb_input<const N: usize>(&mut self, state: &mut [u64; N], mut data: &[u8], sponge: fn(&mut [u64; N])) { const { assert!(Rate::USIZE <= size_of::<[u64; N]>()); assert!(Rate::USIZE < u8::MAX as usize); assert!(Rate::USIZE % size_of::<u64>() == 0); }; loop { let pos = self.pos(); let state_bytes = state.as_mut_bytes()[pos..Rate::USIZE]; let written = state_bytes .iter_mut() .zip(data.iter()) .map(|(state, input)| *state ^= *input) .count(); data = &data[written..]; if !data.is_empty() { // SPONGE sponge(state); self.pos = 0; } else { self.pos += written as u8; return; } } }

I kept the as_mut_bytes as a stand-in for the cast, though I'm kinda iffy on the one in the code here since N will not match Rate. But all in all, this is what I meant with my previous explanation. We are incurring fewer branches with this loop, and mapping the block of state directly to the available block of data via zip, then just operating on each byte until we have no more data.

Squeeze could be done the same way with permutation/sponge running at the beginning of the loop, and then the check at the end being for just if data.is_empty() I think.

I added squeeze_xor_u64_le method. You can see the resulting assembly here. Try to repeat the benchmark with it. Also, try to run it for bigger sizes (e.g. 1 MiB).

As you can see here, your implementation indeed results in a smaller binary, but at the cost of not having specialized core loop for processing of full blocks. So it could be more performant if you process a lot of small data chunks with sizes smaller than rate, but I think it should be less performant on large data sizes, which I think is more frequent in practice.

Your code is also not quite correct since it breaks the invariant that pos is always smaller than RATE and IIUC it does not handle correctly data bigger than rate since it calls sponge only on data.is_empty().

I was benchmarking against byte strings that were 3348 bytes long, so still many times greater than the rate, and I still got the benched results. Plus, it improved with target-cpu=x86-64-v3, though I did see a regression with v4.

I am targetting smaller code by default (since I am working with MCUs and optimising for more constrained systems), but this is what I noted when testing that particular code path when adapted for my impl. What I wonder here is if the specialisation kicks in at all if we up the cpu feature, or if the smaller loop just ends up working better until much, much larger data sizes than 4kB.

I fixed your implementation to this variant:

while !data.is_empty() { let pos = self.pos(); let state_bytes: &mut [u8; RATE] = unsafe { &mut *(state.as_mut_ptr().cast()) }; let state_bytes = &mut state_bytes[pos..]; let written = state_bytes .iter_mut() .zip(data.iter()) .map(|(state, input)| *state ^= *input) .count(); data = &data[written..]; self.pos += written as u8; if self.pos == RATE as u8 { sponge(state); self.pos = 0; } }

I run our k12 (absorption-only) benchmarks on my laptop (AMD Ryzen 7 5700U) and got the following results.

Baseline:

# Base x86-64 test turboshake128_10 ... bench: 11.85 ns/iter (+/- 0.08) = 909 MB/s test turboshake128_100 ... bench: 101.20 ns/iter (+/- 0.96) = 990 MB/s test turboshake128_1000 ... bench: 983.87 ns/iter (+/- 4.83) = 1017 MB/s test turboshake128_10000 ... bench: 9,811.05 ns/iter (+/- 61.22) = 1019 MB/s # x86-64-v3 test turboshake128_10 ... bench: 10.58 ns/iter (+/- 0.18) = 1000 MB/s test turboshake128_100 ... bench: 86.91 ns/iter (+/- 0.86) = 1162 MB/s test turboshake128_1000 ... bench: 841.96 ns/iter (+/- 4.29) = 1189 MB/s test turboshake128_10000 ... bench: 8,352.16 ns/iter (+/- 70.54) = 1197 MB/s

Your variant:

# Base x86-64 test turboshake128_10 ... bench: 13.46 ns/iter (+/- 0.07) = 769 MB/s test turboshake128_100 ... bench: 101.00 ns/iter (+/- 0.68) = 1000 MB/s test turboshake128_1000 ... bench: 1,001.56 ns/iter (+/- 7.39) = 999 MB/s test turboshake128_10000 ... bench: 9,925.97 ns/iter (+/- 92.69) = 1007 MB/s # x86-64-v3 test turboshake128_10 ... bench: 12.47 ns/iter (+/- 0.14) = 833 MB/s test turboshake128_100 ... bench: 89.52 ns/iter (+/- 0.53) = 1123 MB/s test turboshake128_1000 ... bench: 857.95 ns/iter (+/- 6.98) = 1166 MB/s test turboshake128_10000 ... bench: 8,441.75 ns/iter (+/- 24.66) = 1184 MB/s

As you can see, we have a small win for 100 byte chunks on the base x86-64 target, but otherwise your variant is slightly slower everywhere else.

We could introduce a separate "compact" compilation flag for constrained targets like we do in the sponge crates, but it's a separate feature.

Yeah, a compact code option would be scope creep to this PR and should be considered as something separate.

Add sponge-cursor crate

8ff587c

newpavlov requested a review from tarcieri May 2, 2026 17:09

newpavlov mentioned this pull request May 2, 2026

Use sponge-cursor to remove unnecessary buffering RustCrypto/hashes#849

Open

newpavlov added 2 commits May 2, 2026 20:13

fix typo

ef5d7be

update changelog

5a3c94a

newpavlov commented May 2, 2026

View reviewed changes

Comment thread sponge-cursor/src/lib.rs Outdated

newpavlov added 3 commits May 3, 2026 18:55

Add new method

6d8a512

Add raw_pos method

d5900bf

rename xor_in to xor_into

e6eb9fc

Bluefinger reviewed May 4, 2026

View reviewed changes

newpavlov added 3 commits May 4, 2026 16:29

Switch rate to const generics

9b92270

tweak code

fbb0fad

tweak code

075f14c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sponge-cursor: initial implementation#1477

sponge-cursor: initial implementation#1477
newpavlov wants to merge 9 commits intomasterfrom
add-sponge-cursor

newpavlov commented May 2, 2026 •

edited

Loading

Uh oh!

newpavlov May 2, 2026

Uh oh!

Uh oh!

Bluefinger May 4, 2026 •

edited

Loading

Uh oh!

newpavlov May 4, 2026 •

edited

Loading

Uh oh!

Bluefinger May 4, 2026 •

edited

Loading

Uh oh!

newpavlov May 4, 2026 •

edited

Loading

Uh oh!

newpavlov May 4, 2026

Uh oh!

Bluefinger May 4, 2026 •

edited

Loading

Uh oh!

newpavlov May 4, 2026 •

edited

Loading

Uh oh!

Bluefinger May 4, 2026 •

edited

Loading

Uh oh!

newpavlov May 4, 2026 •

edited

Loading

Uh oh!

Bluefinger May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

newpavlov commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

newpavlov May 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Bluefinger May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

newpavlov May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bluefinger May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

newpavlov May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

newpavlov May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Bluefinger May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

newpavlov May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bluefinger May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

newpavlov May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bluefinger May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

newpavlov commented May 2, 2026 •

edited

Loading

Bluefinger May 4, 2026 •

edited

Loading

newpavlov May 4, 2026 •

edited

Loading

Bluefinger May 4, 2026 •

edited

Loading

newpavlov May 4, 2026 •

edited

Loading

Bluefinger May 4, 2026 •

edited

Loading

newpavlov May 4, 2026 •

edited

Loading

Bluefinger May 4, 2026 •

edited

Loading

newpavlov May 4, 2026 •

edited

Loading