add generic list api with slab and rc impls#48291
Conversation
|
Benchmarks ran on M2 Pro |
There is indeed room for improvement. As expected, list operations that don't preallocate the nodes, such that new and drop-with-destruct (refs == 1) rc operations are included in the measurements, show the arena-based impl outperforming. Now the results make more sense. I also switched to using Looking at the assembly, the overhead of twiddling link refs is about as minimal as one would expect. The chain of method calls, including trait methods, gets inlined down to a handful of instructions. Maybe the unstable Cell::get_cloned could knock off an instruction or two someday. In any case, the 8x overhead seems like the cost of doing business if we want ref-counted nodes. That said, nodes very often get reused. The overhead of the actual list operations is more like 2.5x. That may be a more acceptable trade for the convenience of ref-counting.
Rust uses glibc's allocator on Linux. Cursory investigation of glibc suggests it does contain some optimizations for "small" allocations, likely using thread-local storage with a slab-like algorithm, i.e. nearly the same thing our arena is doing. This could explain the competitive performance. It would be good to investigate these "small" allocation optimizations more deeply and understand how they may apply in our case. I would bet most of our list nodes are at most a few hundred bytes, within the threshold of the optimized pathway. |
4ca9cf8 to
e3b560b
Compare
e3b560b to
5a5811d
Compare
5a5811d to
c236f9a
Compare
|
Tweaked the API a bit further, and updated the benchmarks with the latest |
This is a new linked list implementation that allows for a non-fixed capacity. It is added alongside the existing implementation and a fixed-capacity variant of it is used from a couple of places to confirm it works as expected. After it is merged, we can look at actually using dynamic capacities and eventually removing the old implementation.
Background
Currently, our linked list implementation stores nodes in preallocated slabs in order to avoid heap operations at runtime. This approach is performant but requires knowing in advance how many nodes will be needed, and this isn't always easy to know. Notably, reactor registrations are kept in a linked list, and determining the number of needed registrations in the whole app pretty much requires reading the entire codebase since registrations can occur anywhere.
Approach
This PR aims to provide a linked list that has a dynamic capacity while remaining performant. It introduces a new type,
RcList, capable of working with ref-counted nodes usingmemorypool::Rcfrom our core lib.memorypool::Rcis likestd::rc::Rcbut it supports allocating with either the system allocator or a slab (pool). This approach lets us continue to use preallocated pools for node memory as desired, with the advantage that not every node in a list has to live in the same pool nor live in a pool at all.RcListis naturally!Send, and there is at least one place where this would cause us trouble: connmgr'sPool(TCP connection pool) which is shared between threads. The simplest solution to that problem is to continue offering an implementation based on a single slab as well, for that specific use-case. In order to avoid having multiple linked list implementations, the new list is generic over aBackendwhich supplies indexing and linking logic.Notable types/impls:
trait Backend- Abstract API for node/link management.impl<T> Backend for Slab<SlabNode<T>>- A backend usingusizeindexes with a single slab for node memory, implemented directly onSlab.RcBackend- A backend usingRcfor indexes with node memory living wherever. It is a zero-sized type.struct GenericList<B: Backend>- A linked list that operates on an arbitrary backend. Note that the type does not wrap aBackendinstance, and instead a mutable reference to aBackendmust be passed in every method call. This allows multiple lists to share the same backend, mainly in order to share node storage.struct BoundGenericList<B: Backend>- LikeGenericList, but it owns aBackendinstance and so it is not necessary to provide one in method calls. This is handy for when the backend instance does not need to be shared, either because only one list uses the backend or the backend is a zero-sized type.type SlabList<T> = GenericList<Slab<SlabNode<T>>>- Fixed-capacity list using a single slab for storage. Essentially a drop-in substitute for the current linked list implementation.type RcList<T> = BoundGenericList<RcBackend<T>>- Dynamic-capacity list usingRc.RcListshould be preferred for everything going forward, except for a couple of special cases:Pool. In that case, useSlabList.TimerWheelwhich needs to be sharable from multiple threads in one case (inPool) but not others (Reactor). In that case, useGenericList/BoundGenericListand let the caller provide the appropriate backend.Compatibility / perf
The new API is basically the same as the current one.
RcList/SlabListdon't implementCopy, but they do still implementClone.SlabListmethods take&mut Backendinstead of&mut IndexMut. However, sinceSlabimplementsBackend, it is possible to keep passing&mut Slabin the same argument position.Care is taken to ensure the API doesn't require unnecessary cloning of ref-counted nodes, mainly in case we ever want to add an
Arc-based backend. For example, theremove()method takes an index reference (&RcNodewhen using the rc-based backend) rather than an owned index.At the same time, we don't want to have to pass a
&usizewhen using the single slab backend as this adds unnecessary indirection. To work around that, the index reference type is made generic. For the single slab backend, the index type isusizeand the index reference type is alsousize, whereas for the rc-based backend, the index type isRcNodeand the index reference type is&RcNode.In theory, being able to index using
usizeby value should enable the single slab backend to remain as performant as the current list implementation which does the same, though the generified code is a bit noisy (<Backend::Index as Index>::Refall over the place). The benchmarks appear to support this.Benchmarks
Benchmarks are included that do 10k pushes/pops against the various implementations. Results on Linux:
The first benchmark is of the current implementation (single slab), and the second is of the new implementation with the single slab backend, and the numbers are very close. This makes sense since they're both the same logic. With static dispatch and good inlining they should compile down to more or less the same thing.
The rest are rc-based benches.
mp-uses a memory pool andsys-uses the system allocator. With the pool, it's about 3x slower than rc-less slab. With the system allocator, it's about 5x slower. Thepre-benches exclude node allocation and deallocation from the measurements, so that only the link manipulations are measured. In that case, the overhead is much smaller (~1.1x slower).Overhead
These results mean there is a cost for lists with dynamic capacity (the rc overhead). However, it is mostly due to the memory management and not the list operations themselves, and we still retain good control over how that memory management works. Technically, when using rc'd nodes backed by a memory pool there is no overhead from "conventionally costly operations", which is usually our highest bar for performance. Therefore, the overhead of dynamic capacity should be considered acceptable when the ergonomics of dynamic capacity are desired.
Per-task memory pool opportunity
Since nodes in
RcListcan optionally live in different memory pools, this opens up the possibility of having per-task pools for node memory. Each task could create nodes within its own pools, even if the nodes will be added to lists shared among multiple tasks. If a task wants to create a node into a pool that's full, it can create it on the heap instead. There could be an automatic right-sizing mechanism too. If a node needs to go to the heap, it could be noted somewhere that the next spawned task should have a larger pool.