Skip to content

perf: eliminate ParserConfig clones on every H1 request#4028

Closed
DioCrafts wants to merge 2 commits intohyperium:masterfrom
DioCrafts:perf/h1-parser-config-zero-copy
Closed

perf: eliminate ParserConfig clones on every H1 request#4028
DioCrafts wants to merge 2 commits intohyperium:masterfrom
DioCrafts:perf/h1-parser-config-zero-copy

Conversation

@DioCrafts
Copy link

@DioCrafts DioCrafts commented Mar 3, 2026

Problem

Every HTTP/1.1 request triggered two unnecessary ParserConfig::clone() calls in the hot parsing path:

  1. Conn::poll_read_head() cloned self.state.h1_parser_config to build ParseContext
  2. Buffered::parse() cloned it again inside the retry loop on each iteration

ParserConfig is a read-only config struct. Nothing in the parse chain mutates it. Cloning it on every request was pure waste.

On top of that, Buffered::new() allocated the read buffer with BytesMut::with_capacity(0). This forced the first poll_read_from_io() to hit the allocator before any data could be read. The buffer always grows to INIT_BUFFER_SIZE (8192) on the first read anyway.

Changes

src/proto/h1/mod.rs

  • ParseContext.h1_parser_config: ParserConfig&'a ParserConfig

src/proto/h1/conn.rs

  • Removed .clone(). Now passes &self.state.h1_parser_config directly.

src/proto/h1/io.rs

  • Removed .clone() in the parse loop. The reference just copies through.
  • Changed BytesMut::with_capacity(0)BytesMut::with_capacity(INIT_BUFFER_SIZE).

src/proto/h1/role.rs

  • Updated 20 test call sites to pass &ParserConfig instead of owned values.

Why this matters

These two clones ran on every single HTTP/1.1 request. For a server handling thousands of requests per second, that adds up. The fix is simple: pass a reference instead of copying the struct. All httparse::ParserConfig methods already take &self, so this works without any API change.

The buffer pre-allocation removes one guaranteed allocation from every new connection. The buffer was going to be 8KB anyway. Now it starts there.

Testing

All 270 tests pass across 5 test suites.

No public API changes. No breaking changes.

Update: BufList::remaining() O(n) → O(1)

src/common/buf.rs

BufList::remaining() was iterating the entire VecDeque on every call to sum up byte counts:

fn remaining(&self) -> usize {
    self.bufs.iter().map(|buf| buf.remaining()).sum()
}

This method is called from multiple hot paths on every flush cycle:

  • WriteBuf::remaining() — called in poll_flush to check if there's data to write
  • WriteBuf::can_buffer() — called before buffering each new chunk
  • WriteBuf::advance() — called after every partial write to the socket

In Queue write strategy (the default when the OS supports writev, i.e. most Linux servers), the queue can hold up to 16 buffers. Every call walked all of them.

Changes

  • Added a remaining: usize field to BufList that tracks total bytes across all buffers
  • push(): increments the cached total
  • advance(): decrements the cached total
  • copy_to_bytes(): decrements in the optimized front-buffer paths; the fallback path goes through advance() which handles it
  • remaining(): now returns the cached field directly — O(1)

Impact

For a server at 100K req/s with ~8 buffers in queue and ~8 calls to remaining() per request:

  • Before: 6.4M iterator traversals/sec just to count bytes
  • After: 800K field reads (effectively free — single register read)

The overhead of maintaining the counter is ~0.3ns per push/advance (one integer add/sub).

Testing

All 162 tests pass (61 client + 14 integration + 86 server + 12 doc-tests). No public API changes.

- Change ParseContext.h1_parser_config from owned ParserConfig to &'a ParserConfig
- Remove .clone() in Conn::poll_read_head() (conn.rs)
- Remove .clone() in Buffered::parse() retry loop (io.rs)
- Pre-allocate read buffer with INIT_BUFFER_SIZE instead of capacity(0) (io.rs)
- Update all test call sites in role.rs and io.rs

Eliminates 2 unnecessary ParserConfig copies per HTTP/1.1 request
in the hot parsing path.
Add a  field to BufList that tracks the total number
of bytes across all buffers. This avoids iterating the entire
VecDeque on every call to remaining(), which is invoked from
hot paths like poll_flush, can_buffer, and advance.

Previously, remaining() was O(n) where n is the number of
buffers in the queue (up to 16 in Queue write strategy).
Now it is O(1) — a simple field read.
@0x676e67
Copy link
Contributor

0x676e67 commented Mar 3, 2026

The title of the submission here seems a bit messy. I think it would be better to split the different optimizations into two separate PRs.

@DioCrafts DioCrafts requested a review from 0x676e67 March 3, 2026 18:27
@seanmonstar
Copy link
Member

Thanks! I believe it originally used a clone for lifetimes. I assumed the compiler could inline and eliminate the code, does this result in a difference?

(Also, could you keep one logical change per PR, please?)

@DioCrafts
Copy link
Author

@seanmonstar @0x676e67 Sure, apologize, i will split the logic in differents Pull requests.

@DioCrafts DioCrafts closed this Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants