add generic list api with slab and rc impls by jkarneges · Pull Request #48291 · fastly/pushpin

jkarneges · 2026-01-09T01:49:57Z

This is a new linked list implementation that allows for a non-fixed capacity. It is added alongside the existing implementation and a fixed-capacity variant of it is used from a couple of places to confirm it works as expected. After it is merged, we can look at actually using dynamic capacities and eventually removing the old implementation.

Background

Currently, our linked list implementation stores nodes in preallocated slabs in order to avoid heap operations at runtime. This approach is performant but requires knowing in advance how many nodes will be needed, and this isn't always easy to know. Notably, reactor registrations are kept in a linked list, and determining the number of needed registrations in the whole app pretty much requires reading the entire codebase since registrations can occur anywhere.

Approach

This PR aims to provide a linked list that has a dynamic capacity while remaining performant. It introduces a new type, RcList, capable of working with ref-counted nodes using memorypool::Rc from our core lib. memorypool::Rc is like std::rc::Rc but it supports allocating with either the system allocator or a slab (pool). This approach lets us continue to use preallocated pools for node memory as desired, with the advantage that not every node in a list has to live in the same pool nor live in a pool at all.

RcList is naturally !Send, and there is at least one place where this would cause us trouble: connmgr's Pool (TCP connection pool) which is shared between threads. The simplest solution to that problem is to continue offering an implementation based on a single slab as well, for that specific use-case. In order to avoid having multiple linked list implementations, the new list is generic over a Backend which supplies indexing and linking logic.

Notable types/impls:

trait Backend - Abstract API for node/link management.
impl<T> Backend for Slab<SlabNode<T>> - A backend using usize indexes with a single slab for node memory, implemented directly on Slab.
RcBackend - A backend using Rc for indexes with node memory living wherever. It is a zero-sized type.
struct GenericList<B: Backend> - A linked list that operates on an arbitrary backend. Note that the type does not wrap a Backend instance, and instead a mutable reference to a Backend must be passed in every method call. This allows multiple lists to share the same backend, mainly in order to share node storage.
struct BoundGenericList<B: Backend> - Like GenericList, but it owns a Backend instance and so it is not necessary to provide one in method calls. This is handy for when the backend instance does not need to be shared, either because only one list uses the backend or the backend is a zero-sized type.
type SlabList<T> = GenericList<Slab<SlabNode<T>>> - Fixed-capacity list using a single slab for storage. Essentially a drop-in substitute for the current linked list implementation.
type RcList<T> = BoundGenericList<RcBackend<T>> - Dynamic-capacity list using Rc.

RcList should be preferred for everything going forward, except for a couple of special cases:

When the list needs to be sharable from multiple threads, such as with connmgr's Pool. In that case, use SlabList.
When code needs to be generic over the list type, for example TimerWheel which needs to be sharable from multiple threads in one case (in Pool) but not others (Reactor). In that case, use GenericList/BoundGenericList and let the caller provide the appropriate backend.

Compatibility / perf

The new API is basically the same as the current one. RcList/SlabList don't implement Copy, but they do still implement Clone. SlabList methods take &mut Backend instead of &mut IndexMut. However, since Slab implements Backend, it is possible to keep passing &mut Slab in the same argument position.

Care is taken to ensure the API doesn't require unnecessary cloning of ref-counted nodes, mainly in case we ever want to add an Arc-based backend. For example, the remove() method takes an index reference (&RcNode when using the rc-based backend) rather than an owned index.

At the same time, we don't want to have to pass a &usize when using the single slab backend as this adds unnecessary indirection. To work around that, the index reference type is made generic. For the single slab backend, the index type is usize and the index reference type is also usize, whereas for the rc-based backend, the index type is RcNode and the index reference type is &RcNode.

In theory, being able to index using usize by value should enable the single slab backend to remain as performant as the current list implementation which does the same, though the generified code is a bit noisy (<Backend::Index as Index>::Ref all over the place). The benchmarks appear to support this.

Benchmarks

Benchmarks are included that do 10k pushes/pops against the various implementations. Results on Linux:

slab-push-pop-x10k      time:   [86.588 µs 86.607 µs 86.630 µs]
gen-slab-push-pop-x10k  time:   [84.222 µs 84.298 µs 84.393 µs]
mp-push-pop-x10k        time:   [238.39 µs 238.45 µs 238.52 µs]
sys-push-pop-x10k       time:   [432.94 µs 433.11 µs 433.31 µs]
pre-mp-push-pop-x10k    time:   [96.806 µs 96.829 µs 96.854 µs]
pre-sys-push-pop-x10k   time:   [101.55 µs 101.83 µs 102.31 µs]

The first benchmark is of the current implementation (single slab), and the second is of the new implementation with the single slab backend, and the numbers are very close. This makes sense since they're both the same logic. With static dispatch and good inlining they should compile down to more or less the same thing.

The rest are rc-based benches. mp- uses a memory pool and sys- uses the system allocator. With the pool, it's about 3x slower than rc-less slab. With the system allocator, it's about 5x slower. The pre- benches exclude node allocation and deallocation from the measurements, so that only the link manipulations are measured. In that case, the overhead is much smaller (~1.1x slower).

Overhead

These results mean there is a cost for lists with dynamic capacity (the rc overhead). However, it is mostly due to the memory management and not the list operations themselves, and we still retain good control over how that memory management works. Technically, when using rc'd nodes backed by a memory pool there is no overhead from "conventionally costly operations", which is usually our highest bar for performance. Therefore, the overhead of dynamic capacity should be considered acceptable when the ergonomics of dynamic capacity are desired.

Per-task memory pool opportunity

Since nodes in RcList can optionally live in different memory pools, this opens up the possibility of having per-task pools for node memory. Each task could create nodes within its own pools, even if the nodes will be added to lists shared among multiple tasks. If a task wants to create a node into a pool that's full, it can create it on the heap instead. There could be an automatic right-sizing mechanism too. If a node needs to go to the heap, it could be noted somewhere that the next spawned task should have a larger pool.

deg4uss3r · 2026-01-09T14:20:21Z

Benchmarks ran on M2 Pro

index slab list push pop 1000
                        time:   [5.4506 µs 5.4555 µs 5.4605 µs]
[...]
generic slab list push pop 1000
                        time:   [5.3215 µs 5.3258 µs 5.3308 µs]
[...]
arena rc list push pop 1000
                        time:   [34.838 µs 34.866 µs 34.898 µs]
[...]
std rc list push pop 1000
                        time:   [34.515 µs 34.548 µs 34.582 µs]

jkarneges · 2026-01-12T21:39:36Z

Maybe the rc logic could be more optimized.

There is indeed room for improvement. arena::Rc is written in mostly safe Rust and hits a RefCell for every operation, whereas std::rc::Rc uses raw pointers and has better inlining. Reworking arena::Rc to be like std's makes clone, drop-without-destruct (refs > 1), and deref operations perform similarly. I've added benchmarks using preallocated nodes to show this.

As expected, list operations that don't preallocate the nodes, such that new and drop-with-destruct (refs == 1) rc operations are included in the measurements, show the arena-based impl outperforming.

Now the results make more sense. cargo bench --bench list:

index slab list push pop 10000
                        time:   [86.453 µs 86.479 µs 86.508 µs]
[...]
list push pop 10000
                        time:   [81.789 µs 82.121 µs 82.535 µs]
[...]
arena rc list push pop 10000
                        time:   [766.01 µs 766.56 µs 767.26 µs]
[...]
std rc list push pop 10000
                        time:   [875.06 µs 875.56 µs 876.14 µs]
[...]
arena rc list push pop 10000 (preallocated nodes)
                        time:   [208.33 µs 208.37 µs 208.42 µs]
[...]
std rc list push pop 10000 (preallocated nodes)
                        time:   [208.98 µs 209.02 µs 209.06 µs]
[...]

I also switched to using Cell instead of RefCell for node links. I wasn't able to measure any improvement from this, but it's better for the optimizer. Cell ops are always zero cost, whereas eliding ref count manipulations in RefCell requires full context.

Looking at the assembly, the overhead of twiddling link refs is about as minimal as one would expect. The chain of method calls, including trait methods, gets inlined down to a handful of instructions. Maybe the unstable Cell::get_cloned could knock off an instruction or two someday. In any case, the 8x overhead seems like the cost of doing business if we want ref-counted nodes.

That said, nodes very often get reused. The overhead of the actual list operations is more like 2.5x. That may be a more acceptable trade for the convenience of ref-counting.

Maybe the allocator on Linux can be super fast (for small types?).

Rust uses glibc's allocator on Linux. Cursory investigation of glibc suggests it does contain some optimizations for "small" allocations, likely using thread-local storage with a slab-like algorithm, i.e. nearly the same thing our arena is doing. This could explain the competitive performance.

It would be good to investigate these "small" allocation optimizations more deeply and understand how they may apply in our case. I would bet most of our list nodes are at most a few hundred bytes, within the threshold of the optimized pathway.

jkarneges · 2026-03-11T19:23:11Z

Tweaked the API a bit further, and updated the benchmarks with the latest memorypool::Rc optimizations. Things look a lot better. This is ready to go. I've updated the initial comment.

jkarneges force-pushed the jkarneges/generic-list branch from 4ca9cf8 to e3b560b Compare January 17, 2026 00:01

jkarneges mentioned this pull request Mar 5, 2026

list: add head/tail methods for use instead of direct fields #48309

Merged

jkarneges force-pushed the jkarneges/generic-list branch from e3b560b to 5a5811d Compare March 5, 2026 19:29

jkarneges added 3 commits March 9, 2026 17:03

add generic list api with slab and rc impls

f34e904

fixes

47c149b

update bench names

c236f9a

jkarneges force-pushed the jkarneges/generic-list branch from 5a5811d to c236f9a Compare March 10, 2026 00:05

jkarneges added 6 commits March 9, 2026 19:12

use small node sizes in all benches

a3aabb5

add track_caller on remove() code path

05f6b3e

add rclist convenience type

2bf9b1f

make rclist an alias, add docs

43c7fdc

docs nit

4fedce8

use "with"-style api for accessing value

72423e7

jkarneges marked this pull request as ready for review March 11, 2026 19:23

jkarneges requested a review from a team March 11, 2026 19:23

jkarneges added 3 commits March 12, 2026 16:26

bench typo

9df55cc

impl clone

9f588ef

clippy

9de76d0

deg4uss3r approved these changes Mar 13, 2026

View reviewed changes

jkarneges merged commit 8592292 into main Mar 13, 2026
19 checks passed

jkarneges deleted the jkarneges/generic-list branch March 13, 2026 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add generic list api with slab and rc impls#48291

add generic list api with slab and rc impls#48291
jkarneges merged 12 commits intomainfrom
jkarneges/generic-list

jkarneges commented Jan 9, 2026 •

edited

Loading

Uh oh!

deg4uss3r commented Jan 9, 2026

Uh oh!

jkarneges commented Jan 12, 2026

Uh oh!

jkarneges commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jkarneges commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Approach

Compatibility / perf

Benchmarks

Overhead

Per-task memory pool opportunity

Uh oh!

deg4uss3r commented Jan 9, 2026

Uh oh!

jkarneges commented Jan 12, 2026

Uh oh!

jkarneges commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jkarneges commented Jan 9, 2026 •

edited

Loading