Skip to content

add generic list api with slab and rc impls#48291

Merged
jkarneges merged 12 commits intomainfrom
jkarneges/generic-list
Mar 13, 2026
Merged

add generic list api with slab and rc impls#48291
jkarneges merged 12 commits intomainfrom
jkarneges/generic-list

Conversation

@jkarneges
Copy link
Copy Markdown
Member

@jkarneges jkarneges commented Jan 9, 2026

This is a new linked list implementation that allows for a non-fixed capacity. It is added alongside the existing implementation and a fixed-capacity variant of it is used from a couple of places to confirm it works as expected. After it is merged, we can look at actually using dynamic capacities and eventually removing the old implementation.

Background

Currently, our linked list implementation stores nodes in preallocated slabs in order to avoid heap operations at runtime. This approach is performant but requires knowing in advance how many nodes will be needed, and this isn't always easy to know. Notably, reactor registrations are kept in a linked list, and determining the number of needed registrations in the whole app pretty much requires reading the entire codebase since registrations can occur anywhere.

Approach

This PR aims to provide a linked list that has a dynamic capacity while remaining performant. It introduces a new type, RcList, capable of working with ref-counted nodes using memorypool::Rc from our core lib. memorypool::Rc is like std::rc::Rc but it supports allocating with either the system allocator or a slab (pool). This approach lets us continue to use preallocated pools for node memory as desired, with the advantage that not every node in a list has to live in the same pool nor live in a pool at all.

RcList is naturally !Send, and there is at least one place where this would cause us trouble: connmgr's Pool (TCP connection pool) which is shared between threads. The simplest solution to that problem is to continue offering an implementation based on a single slab as well, for that specific use-case. In order to avoid having multiple linked list implementations, the new list is generic over a Backend which supplies indexing and linking logic.

Notable types/impls:

  • trait Backend - Abstract API for node/link management.
  • impl<T> Backend for Slab<SlabNode<T>> - A backend using usize indexes with a single slab for node memory, implemented directly on Slab.
  • RcBackend - A backend using Rc for indexes with node memory living wherever. It is a zero-sized type.
  • struct GenericList<B: Backend> - A linked list that operates on an arbitrary backend. Note that the type does not wrap a Backend instance, and instead a mutable reference to a Backend must be passed in every method call. This allows multiple lists to share the same backend, mainly in order to share node storage.
  • struct BoundGenericList<B: Backend> - Like GenericList, but it owns a Backend instance and so it is not necessary to provide one in method calls. This is handy for when the backend instance does not need to be shared, either because only one list uses the backend or the backend is a zero-sized type.
  • type SlabList<T> = GenericList<Slab<SlabNode<T>>> - Fixed-capacity list using a single slab for storage. Essentially a drop-in substitute for the current linked list implementation.
  • type RcList<T> = BoundGenericList<RcBackend<T>> - Dynamic-capacity list using Rc.

RcList should be preferred for everything going forward, except for a couple of special cases:

  • When the list needs to be sharable from multiple threads, such as with connmgr's Pool. In that case, use SlabList.
  • When code needs to be generic over the list type, for example TimerWheel which needs to be sharable from multiple threads in one case (in Pool) but not others (Reactor). In that case, use GenericList/BoundGenericList and let the caller provide the appropriate backend.

Compatibility / perf

The new API is basically the same as the current one. RcList/SlabList don't implement Copy, but they do still implement Clone. SlabList methods take &mut Backend instead of &mut IndexMut. However, since Slab implements Backend, it is possible to keep passing &mut Slab in the same argument position.

Care is taken to ensure the API doesn't require unnecessary cloning of ref-counted nodes, mainly in case we ever want to add an Arc-based backend. For example, the remove() method takes an index reference (&RcNode when using the rc-based backend) rather than an owned index.

At the same time, we don't want to have to pass a &usize when using the single slab backend as this adds unnecessary indirection. To work around that, the index reference type is made generic. For the single slab backend, the index type is usize and the index reference type is also usize, whereas for the rc-based backend, the index type is RcNode and the index reference type is &RcNode.

In theory, being able to index using usize by value should enable the single slab backend to remain as performant as the current list implementation which does the same, though the generified code is a bit noisy (<Backend::Index as Index>::Ref all over the place). The benchmarks appear to support this.

Benchmarks

Benchmarks are included that do 10k pushes/pops against the various implementations. Results on Linux:

slab-push-pop-x10k      time:   [86.588 µs 86.607 µs 86.630 µs]
gen-slab-push-pop-x10k  time:   [84.222 µs 84.298 µs 84.393 µs]
mp-push-pop-x10k        time:   [238.39 µs 238.45 µs 238.52 µs]
sys-push-pop-x10k       time:   [432.94 µs 433.11 µs 433.31 µs]
pre-mp-push-pop-x10k    time:   [96.806 µs 96.829 µs 96.854 µs]
pre-sys-push-pop-x10k   time:   [101.55 µs 101.83 µs 102.31 µs]

The first benchmark is of the current implementation (single slab), and the second is of the new implementation with the single slab backend, and the numbers are very close. This makes sense since they're both the same logic. With static dispatch and good inlining they should compile down to more or less the same thing.

The rest are rc-based benches. mp- uses a memory pool and sys- uses the system allocator. With the pool, it's about 3x slower than rc-less slab. With the system allocator, it's about 5x slower. The pre- benches exclude node allocation and deallocation from the measurements, so that only the link manipulations are measured. In that case, the overhead is much smaller (~1.1x slower).

Overhead

These results mean there is a cost for lists with dynamic capacity (the rc overhead). However, it is mostly due to the memory management and not the list operations themselves, and we still retain good control over how that memory management works. Technically, when using rc'd nodes backed by a memory pool there is no overhead from "conventionally costly operations", which is usually our highest bar for performance. Therefore, the overhead of dynamic capacity should be considered acceptable when the ergonomics of dynamic capacity are desired.

Per-task memory pool opportunity

Since nodes in RcList can optionally live in different memory pools, this opens up the possibility of having per-task pools for node memory. Each task could create nodes within its own pools, even if the nodes will be added to lists shared among multiple tasks. If a task wants to create a node into a pool that's full, it can create it on the heap instead. There could be an automatic right-sizing mechanism too. If a node needs to go to the heap, it could be noted somewhere that the next spawned task should have a larger pool.

@deg4uss3r
Copy link
Copy Markdown
Member

Benchmarks ran on M2 Pro

index slab list push pop 1000
                        time:   [5.4506 µs 5.4555 µs 5.4605 µs]
[...]
generic slab list push pop 1000
                        time:   [5.3215 µs 5.3258 µs 5.3308 µs]
[...]
arena rc list push pop 1000
                        time:   [34.838 µs 34.866 µs 34.898 µs]
[...]
std rc list push pop 1000
                        time:   [34.515 µs 34.548 µs 34.582 µs]

@jkarneges
Copy link
Copy Markdown
Member Author

Maybe the rc logic could be more optimized.

There is indeed room for improvement. arena::Rc is written in mostly safe Rust and hits a RefCell for every operation, whereas std::rc::Rc uses raw pointers and has better inlining. Reworking arena::Rc to be like std's makes clone, drop-without-destruct (refs > 1), and deref operations perform similarly. I've added benchmarks using preallocated nodes to show this.

As expected, list operations that don't preallocate the nodes, such that new and drop-with-destruct (refs == 1) rc operations are included in the measurements, show the arena-based impl outperforming.

Now the results make more sense. cargo bench --bench list:

index slab list push pop 10000
                        time:   [86.453 µs 86.479 µs 86.508 µs]
[...]
list push pop 10000
                        time:   [81.789 µs 82.121 µs 82.535 µs]
[...]
arena rc list push pop 10000
                        time:   [766.01 µs 766.56 µs 767.26 µs]
[...]
std rc list push pop 10000
                        time:   [875.06 µs 875.56 µs 876.14 µs]
[...]
arena rc list push pop 10000 (preallocated nodes)
                        time:   [208.33 µs 208.37 µs 208.42 µs]
[...]
std rc list push pop 10000 (preallocated nodes)
                        time:   [208.98 µs 209.02 µs 209.06 µs]
[...]

I also switched to using Cell instead of RefCell for node links. I wasn't able to measure any improvement from this, but it's better for the optimizer. Cell ops are always zero cost, whereas eliding ref count manipulations in RefCell requires full context.

Looking at the assembly, the overhead of twiddling link refs is about as minimal as one would expect. The chain of method calls, including trait methods, gets inlined down to a handful of instructions. Maybe the unstable Cell::get_cloned could knock off an instruction or two someday. In any case, the 8x overhead seems like the cost of doing business if we want ref-counted nodes.

That said, nodes very often get reused. The overhead of the actual list operations is more like 2.5x. That may be a more acceptable trade for the convenience of ref-counting.

Maybe the allocator on Linux can be super fast (for small types?).

Rust uses glibc's allocator on Linux. Cursory investigation of glibc suggests it does contain some optimizations for "small" allocations, likely using thread-local storage with a slab-like algorithm, i.e. nearly the same thing our arena is doing. This could explain the competitive performance.

It would be good to investigate these "small" allocation optimizations more deeply and understand how they may apply in our case. I would bet most of our list nodes are at most a few hundred bytes, within the threshold of the optimized pathway.

@jkarneges jkarneges force-pushed the jkarneges/generic-list branch from 5a5811d to c236f9a Compare March 10, 2026 00:05
@jkarneges
Copy link
Copy Markdown
Member Author

Tweaked the API a bit further, and updated the benchmarks with the latest memorypool::Rc optimizations. Things look a lot better. This is ready to go. I've updated the initial comment.

@jkarneges jkarneges marked this pull request as ready for review March 11, 2026 19:23
@jkarneges jkarneges requested a review from a team March 11, 2026 19:23
@jkarneges jkarneges merged commit 8592292 into main Mar 13, 2026
19 checks passed
@jkarneges jkarneges deleted the jkarneges/generic-list branch March 13, 2026 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants