-
Notifications
You must be signed in to change notification settings - Fork 142
Description
I built a benchmark tool which measures throughput of AES-CTR on 8k buffer using various versions of aes.
I noticed a significant slowdown between 0.8.4 and 0.9.0-rc.2 versions. I think it is related to inlining in autodetect.rs and to switching from 8 to 9 blocks per run. I drafted the patch here where I restore 8 blocks per run and wrappers in autodetect.rs to the version used in 0.8.4. VAES code is still there, i.e. the fix is not a breaking change.
Below are performance numbers on two machines.
One machine (Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz):
0.8.4: Avg: 3373.41 MiB/s | Median: 3478.85 | Min: 2683.36 | Max: 3677.39
0.9.0-rc.2: Avg: 2338.59 MiB/s | Median: 2393.62 | Min: 2066.86 | Max: 2459.26
fix: Avg: 3598.64 MiB/s | Median: 3713.41 | Min: 2730.50 | Max: 3864.82
Another machine (AMD EPYC-Milan Processor):
0.8.4: Avg: 7637.36 MiB/s | Median: 8301.11 | Min: 3398.54 | Max: 8330.20
0.9.0-rc.2: Avg: 4451.80 MiB/s | Median: 4979.17 | Min: 2435.00 | Max: 4986.76
fix: Avg: 7601.95 MiB/s | Median: 8267.81 | Min: 3375.63 | Max: 8278.00
It was built with cargo build --release in all cases.
To reproduce this, run the following commits of my benchmark tool:
- 0.8.4 starius/rust-aes-ctr-bench@f680618
- 0.9.0-rc.2 starius/rust-aes-ctr-bench@2ef76fc
- fix starius/rust-aes-ctr-bench@1dc16db
The only difference between them is versions of aes and ctr used.
I'm attaching the flamegraph generated for version 0.9.0-rc.2. It demonstrates that 24% of time is spent in <cipher::stream::wrapper::StreamCipherCoreWrapper<T> as cipher::stream::StreamCipher>::try_apply_keystream_inout