perf: make BufList::remaining() O(1) with a cached byte count#4031
perf: make BufList::remaining() O(1) with a cached byte count#4031DioCrafts wants to merge 2 commits intohyperium:masterfrom
Conversation
Add a field to BufList that tracks the total number of bytes across all buffers. This avoids iterating the entire VecDeque on every call to remaining(), which is invoked from hot paths like poll_flush, can_buffer, and advance. Previously, remaining() was O(n) where n is the number of buffers in the queue (up to 16 in Queue write strategy). Now it is O(1) — a simple field read.
|
@paolobarbolini You're right, added an assert guard. Appreciate the review! 🙏 |
|
Do you have benchmarks or something that is able to measure the change of this with more and less write buffers? |
|
Not yet. I can write a micro-benchmark that pushes N small buffers and calls remaining() in a loop to show the O(n) vs O(1) difference. For typical connections with few queued buffers the gain is small, but it becomes visible with many small writes. Do you want me to add the benchmark and post the numbers ?? Kind regards |
|
I was interested in the motivation, and to be sure we measure that things were improved. Was this showing high in profiles in a server you had, or something? |
|
@seanmonstar exactly, in my team we are running several critical connections with hyper in different applications, reducing latency is totally important for us. Perfect, I already have the benchmark. BufList::remaining() O(1) Optimization — Benchmark ResultsBenchmarks run with
|
| Buffers | Before (O(n)) | After (O(1)) | Speedup |
|---|---|---|---|
| 1 | 0.99 ns/iter | 0.32 ns/iter | 3x |
| 4 | 1.14 ns/iter | 0.27 ns/iter | 4x |
| 16 | 2.48 ns/iter | 0.40 ns/iter | 6x |
| 128 | 20.54 ns/iter | 0.35 ns/iter | 59x |
| 1024 | 302.59 ns/iter | 0.33 ns/iter | 917x |
remaining() is now constant-time (~0.3 ns) regardless of buffer count.
push() + remaining() cycle (simulates real write loop)
| Buffers | Before | After | Improvement |
|---|---|---|---|
| 16 | 148.50 ns/iter | 118.62 ns/iter | ~20% |
| 128 | 2,032.20 ns/iter | 468.55 ns/iter | ~77% |
With hyper's current MAX_BUF_LIST_BUFFERS = 16, the push+remaining cycle sees a ~20% speedup, growing to ~77% at higher buffer counts.
Raw output
Before (O(n) iteration)
test common::buf::bench::buflist_remaining_1_buf ... bench: 0.99 ns/iter (+/- 0.15)
test common::buf::bench::buflist_remaining_4_bufs ... bench: 1.14 ns/iter (+/- 0.19)
test common::buf::bench::buflist_remaining_16_bufs ... bench: 2.48 ns/iter (+/- 0.35)
test common::buf::bench::buflist_remaining_128_bufs ... bench: 20.54 ns/iter (+/- 2.53)
test common::buf::bench::buflist_remaining_1024_bufs ... bench: 302.59 ns/iter (+/- 19.95)
test common::buf::bench::buflist_push_and_remaining_16 ... bench: 148.50 ns/iter (+/- 25.87)
test common::buf::bench::buflist_push_and_remaining_128 ... bench: 2,032.20 ns/iter (+/- 285.50)
After (O(1) cached field)
test common::buf::bench::buflist_remaining_1_buf ... bench: 0.32 ns/iter (+/- 0.05)
test common::buf::bench::buflist_remaining_4_bufs ... bench: 0.27 ns/iter (+/- 0.05)
test common::buf::bench::buflist_remaining_16_bufs ... bench: 0.40 ns/iter (+/- 0.15)
test common::buf::bench::buflist_remaining_128_bufs ... bench: 0.35 ns/iter (+/- 0.08)
test common::buf::bench::buflist_remaining_1024_bufs ... bench: 0.33 ns/iter (+/- 0.13)
test common::buf::bench::buflist_push_and_remaining_16 ... bench: 118.62 ns/iter (+/- 19.20)
test common::buf::bench::buflist_push_and_remaining_128 ... bench: 468.55 ns/iter (+/- 76.48)
|
I don't have review powers on this repository. For what it's worth, I think Sean was asking about the overall improvement to the client and/or server when these patches are applied on top of hyper, rather than the micro-benchmarks on the BufList APIs. While these are good, they don't really demonstrate how they would help hyper. |
Description:
BufList::remaining()iterated over all queued buffers to sum their lengths. For write-heavy workloads with many small buffers queued, this O(n) scan ran on every poll cycle.This PR adds a
remaining: usizefield that tracks the total byte count incrementally.push()adds,advance()andcopy_to_bytes()subtract.remaining()returns the cached value directly.What changed
buf.rs: addedremainingfield toBufList<T>push(),advance(),copy_to_bytes()to maintain the counterremaining()now returns the field instead of iteratingNo public API changes. All tests pass, including the existing
BufListunit tests.