Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,4 @@

# rust target
/target
PLAN.md
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -548,6 +548,7 @@ if(NOT SNMALLOC_HEADER_ONLY_LIBRARY)
# These are mitigation-independent and can be compiled once, then linked
# against both fast and check testlib variants.
set(TESTLIB_ONLY_TESTS
aligned_dealloc arena arenabins largearenarange smallarenarange
bits first_operation memory memory_usage multi_atexit multi_threadatexit
redblack statistics teardown
contention external_pointer large_alloc lotsofthreads post_teardown
Expand Down
43 changes: 19 additions & 24 deletions docs/AddressSpace.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,14 @@ For simplicity, we gloss over much of the "lazy initialization" that would actua
Because the two exercise similar bits of machinery, we now track them in parallel in prose despite their sequential nature.

4. The `BackendAllocator` has a chain of "range" types that it uses to manage address space.
By default (and in the case we are considering), that chain begins with a per-thread "small buddy allocator range".
By default (and in the case we are considering), that chain begins with a per-thread *small arena range*.

1. For the metadata allocation, the size is (well) below `MIN_CHUNK_SIZE` and so this allocator, which by supposition is empty, attempts to `refill` itself from its parent.
This results in a request for a `MIN_CHUNK_SIZE` chunk from the parent allocator.

2. For the chunk allocation, the size is `MIN_CHUNK_SIZE` or larger, so this allocator immediately forwards the request to its parent.

5. The next range allocator in the chain is a per-thread *large* buddy allocator that refills in 2 MiB granules.
5. The next range allocator in the chain is a per-thread `LargeArenaRange` that refills in 2 MiB granules.
(2 MiB chosen because it is a typical superpage size.)
At this point, both requests are for at least one and no more than a few times `MIN_CHUNK_SIZE` bytes.

Expand All @@ -48,7 +48,7 @@ For simplicity, we gloss over much of the "lazy initialization" that would actua
8. The next entry in the chain is a `StatsRange` which serves to accumulate statistics.
We ignore this stage and continue onwards.

9. The next entry in the chain is another *large* buddy allocator which refills at 16 MiB but can hold regions
9. The next entry in the chain is another `LargeArenaRange` which refills at 16 MiB but can hold regions
of any size up to the entire address space.
The first request triggers a `refill`, continuing along the chain as a 16 MiB request.
(Recall that the second allocation will be handled at an earlier point on the chain.)
Expand All @@ -61,15 +61,15 @@ For simplicity, we gloss over much of the "lazy initialization" that would actua
12. Having wound the chain onto our stack, we now unwind!
The `PagemapRegisterRange` ensures that the Pagemap entries for allocations passing through it are mapped and returns the allocation unaltered.

13. The global large buddy allocator splits the 16 MiB refill into 8, 4, and 2 MiB regions it retains as well as returning the remaining 2 MiB back along the chain.
13. The global `LargeArenaRange` carves the request out of its 16 MiB refill and keeps the unused remainder as a single free block in its internal red-black trees of free ranges, returning the carved portion back along the chain.

14. The `StatsRange` makes its observations, the `GlobalRange` now unlocks the global component of the chain, and the `CommitRange` ensures that the allocation is mapped.
Aside from these side effects, these propagate the allocation along the chain unaltered.

15. We now arrive back at the thread-local large buddy allocator, which takes its 2 MiB refill and breaks it down into powers of two down to the requested `MIN_CHUNK_SIZE`.
The second allocation (of the chunk), will either return or again break down one of these intermediate chunks.
15. We now arrive back at the thread-local `LargeArenaRange`, which takes its 2 MiB refill and carves out the requested chunk(s); the unused remainder stays in its free-range trees.
The second allocation (of the chunk) will either be satisfied from this leftover or trigger another carve.

16. For the first (metadata) allocation, the thread-local *small* allocator breaks the `MIN_CHUNK_SIZE` allocation down into powers of two down to `PAGEMAP_METADATA_STRUCT_SIZE` and returns one of that size.
16. For the first (metadata) allocation, the thread-local *small arena range* takes its `MIN_CHUNK_SIZE` refill, hands back a sub-chunk fragment large enough for `PAGEMAP_METADATA_STRUCT_SIZE`, and tracks the remainder as free sub-chunk space using tree nodes stored inside the free fragments themselves.
The second allocation will have been forwarded and so is not additionally handled here.

Exciting, no?
Expand Down Expand Up @@ -98,26 +98,19 @@ For chunks owned by the *frontend* (`REMOTE_BACKEND_MARKER` not asserted),

2. A bit (`META_BOUNDARY_BIT`) that serves to limit chunk coalescing on platforms where that may not be possible, such as CHERI.

See `src/backend/metatypes.h` and `src/mem/metaslab.h`.
See `src/snmalloc/mem/metadata.h`.

For chunks owned by a *backend* (`REMOTE_BACKEND_MARKER` asserted), there are again multiple possibilities.

For chunks owned by a *small buddy allocator*, the remainder of the `MetaEntry` is zero.
For chunks owned by a *small arena range* (`SmallArenaRange`), the remainder of the `MetaEntry` is zero.
That is, it appears to have small sizeclass 0 and an implausible `RemoteAllocator*`.
The free-fragment tree itself is stored in-band, inside the free space of the chunk, rather than in the pagemap (see `InplaceRep` in `src/snmalloc/backend_helpers/inplacerep.h`).

For chunks owned by a *large buddy allocator*, the `MetaEntry` is instead a node in a red-black tree of all such chunks.
Its contents can be decoded as follows:
For chunks owned by a `LargeArenaRange`, the `MetaEntry` is instead a node in the red-black trees of free ranges.
A free block of *N* units consumes the `MetaEntry`s of its first *min(N, 3)* unit-aligned addresses; their words encode the bin-tree node (unit 0), the range-tree node (unit 1, for blocks of two or more units), and the large-chunk count (unit 2, for blocks of three or more units).
The pagemap reserves the low `MetaEntryBase::BACKEND_LAYOUT_FIRST_FREE_BIT` bits of each word for the meta-entry layout itself; the tree-node encoding (left/right pointers, red bit, variant tag, large-size count) lives at or above that bit.

1. The `meta` field's `META_BOUNDARY_BIT` is preserved, with the same meaning as in the frontend case, above.

2. `meta` (resp. `remote_and_sizeclass`) includes a pointer to the left (resp. right) *chunk* of address space.
(The corresponding child *node* in this tree is found by taking the *address* of this chunk and looking up the `MetaEntry` in the Pagemap.
This trick of pointing at the child's chunk rather than at the child `MetaEntry` is particularly useful on CHERI:
it allows us to capture the authority to the chunk without needing another pointer and costs just a shift and add.)

3. The `meta` field's `LargeBuddyRep::RED_BIT` is used to carry the red/black color of this node.

See `src/backend/largebuddyrange.h`.
See `PagemapRep` in `src/snmalloc/backend_helpers/largearenarange.h`.

### Encoding a MetaEntry

Expand All @@ -131,18 +124,20 @@ The following cases apply:
* has "small" sizeclass 0, which has size 0.
* has no associated metadata structure.

2. The address is part of a free chunk in a backend's Large Buddy Allocator:
2. The address is part of a free chunk in a backend `LargeArenaRange`:
The `MetaEntry`...
* has `REMOTE_BACKEND_MARKER` asserted in `remote_and_sizeclass`.
* has "small" sizeclass 0, which has size 0.
* the remainder of its `MetaEntry` structure will be a Large Buddy Allocator rbtree node.
* the remainder of its `MetaEntry` structure (and those of the next one or two unit-aligned `MetaEntry`s if the free block spans them) carries the `Arena`'s red-black-tree node encoding.
* has no associated metadata structure.

3. The address is part of a free chunk inside a backend's Small Buddy Allocator:
3. The address is part of a free fragment inside a backend `SmallArenaRange`:
Here, the `MetaEntry` is zero aside from the asserted `REMOTE_BACKEND_MARKER` bit, and so it...
* has "small" sizeclass 0, which has size 0.
* has no associated metadata structure.

The tree of free sub-chunk fragments for this chunk is stored inside the free fragments themselves (`InplaceRep`), not in the pagemap.

4. The address is part of a live large allocation (spanning one or more 16KiB chunks):
Here, the `MetaEntry`...
* has `REMOTE_BACKEND_MARKER` clear in `remote_and_sizeclass`.
Expand Down
152 changes: 152 additions & 0 deletions docs/Arena.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# The Arena: A Bitmap-Indexed Coalescing Range

`Arena` is snmalloc's address-space range that stores free blocks at their
**natural** size — no power-of-two rounding — and serves any request from the
full snmalloc size-class sequence. It sits in the per-thread range pipeline
underneath the slab caches and replaces the historical buddy-based ranges.

This document is the conceptual introduction. For where `Arena` plugs into
the wider range chain, see [`AddressSpace.md`](AddressSpace.md).

## The problem

A buddy allocator only stores power-of-two blocks. A request for 5 chunks
must be served from an 8-chunk buddy block, wasting 3 chunks. We wanted a
range that

* stores blocks at their actual size,
* uses snmalloc's full `(exponent, mantissa)` size-class sequence at the
range level, and
* still answers "find a block that can serve this request" in O(1).

## The core idea: search upward, mask out exceptions

Free blocks are binned by the *set of size classes they can serve* — the
**servable set**. To allocate, you walk a per-arena non-empty-bins bitmap
upward through the bins; any larger block can be carved down. This almost
works perfectly. The exception is alignment: some bins hold blocks whose
address alignment is too poor to serve certain smaller, *more* aligned size
classes. Those bins must be excluded from the search for those requests.

The implementation builds the per-request filter *positively* as a **serve
mask** — bit `k` set means bin `k` can serve this request — and the lookup
is `find_first_set(bitmap & serve_mask, start_word)`. The serve mask
depends only on the requested size class, not on the block, so it is
precomputed at compile time.

(The original sketch of this design used the equivalent inverse framing of
a "skip mask" with `bitmap & ~skip_mask`; see `arenabins.h` for the
in-tree explanation of why positive is preferred.)

## Why the exceptions exist

snmalloc's size classes follow `S = 2^e + m · 2^(e−B)`, where `B` is the
mantissa-bit width (`INTERMEDIATE_BITS`, 2 in production). Each size class
has a natural alignment `align(S) = S & -S`.

A size class with high alignment needs padding to reach an aligned address
within a block. A block of a *larger* size class with *lower* alignment may
not have room for that padding. Concretely: a block of size 5 at address 1
can serve size 5 (alignment 1) but cannot serve size 4 (alignment 4) —
there is not enough space after padding to the first 4-aligned address.

Same size block, different address, different servable set. This is why
distinct bins per servable-set are needed.

## Bin count grows slowly in B

At each exponent, the distinct servable sets are enumerated exhaustively:

| B | Mantissas/exponent | Bins/exponent | Max mask bits |
|---|-------------------:|--------------:|--------------:|
| 1 | 2 | 2 | 0 |
| 2 | 4 | 5 | 1 |
| 3 | 8 | 13 | 4 |
| 4 | 16 | 34 | 11 |

Most requests need no exceptions at all. Only size classes whose alignment
exceeds the expected alignment for their position in the sequence have any
bits to mask. The whole structure is constant-folded into a few small tables.

## The two-tree structure

A bitmap alone is not enough — when a bin is non-empty, the arena still has
to *retrieve* and *coalesce* blocks. Each `Arena` therefore maintains:

* **One red-black tree per non-empty bin** (the "bin trees"), keyed by
block address, giving O(log n) selection within a bin. The non-empty-bins
bitmap is the index over these trees.

* **One red-black tree of all free blocks** (the "range tree"), keyed by
address, used to find a block's left/right neighbours for coalescing on
free.

On allocation: bitmap lookup → choose the bin → pop a block from its
bin tree → `carve` returns pre-pad / aligned request / post-pad → pre and
post (if any) re-enter the arena via the bin and range trees.

On free: range tree lookup → coalesce with neighbours if their tags allow
→ insert the resulting (possibly merged) block.

## Two variants over the same Arena

`Arena` is parameterised by a **Rep** (representation) that decides where
the per-block tree-node state lives. Two reps ship today:

* **`PagemapRep`** — node state lives in the pagemap entry that already
covers the block. Used by **`LargeArenaRange`**, which manages whole
chunks and larger. Node access is a pagemap lookup; no in-band space is
consumed.

* **`InplaceRep`** — node state lives *in the free block itself*, in the
first units. Used by **`SmallArenaRange`**, which manages sub-chunk
metadata fragments where no pagemap entry exists for the fragment. The
layout packs the bin tree pointers, the range tree pointers, and (for
blocks ≥ 3 units) a large-size word into the leading units of the free
block. Unit size is `next_pow2(2 · sizeof(CapPtr))` — 16 B without
CHERI, 32 B with pure-capability CHERI/Morello — large enough to hold
the two pointers a free block must store.

Both reps drive the same bin / range tree logic in `arena.h`; the bin
classifier and bitmap in `arenabins.h` are shared.

## Why this matters for metadata

Slab metadata typically wants a pow2 client structure (e.g. a 128 B
bitmap) plus a fixed ~32 B header. A buddy-based small range rounds
`160 B → 256 B` (96 B wasted per slab). `SmallArenaRange` rounds to a unit
multiple (`MIN_META_ALIGN`), so the same allocation costs ~160 B. Across
many slabs and large heaps this is real memory.

## Concrete example (B = 2, in-production)

At exponent `e = 2` the size classes are 4, 5, 6, 7, and there are 5 bins,
each labeled by the set of sizes it can serve at this exponent:

Bin 0: serves {4}
Bin 1: serves {5}
Bin 2: serves {4, 5}
Bin 3: serves {4, 5, 6}
Bin 4: serves {4, 5, 6, 7}

The per-request serve masks (within this exponent — higher exponents
always serve, so their bits are set):

Request for 7: serve bins {4}
Request for 6: serve bins {3, 4}
Request for 5: serve bins {1, 2, 3, 4}
Request for 4: serve bins {0, 2, 3, 4} — bin 1 holds only {5} blocks

Only the size-4 request has an exception: bin 1 must not be picked. All
other requests get the simple "everything at or above" mask.

## Where to look in the code

* `src/snmalloc/backend_helpers/arenabins.h` — bin classification, serve
masks, the non-empty-bins bitmap, the `carve` primitive.
* `src/snmalloc/backend_helpers/arena.h` — bin-tree-per-bin + range-tree
structure, allocation and free / coalesce paths.
* `src/snmalloc/backend_helpers/largearenarange.h` — `Arena<PagemapRep>`
for whole-chunk allocations.
* `src/snmalloc/backend_helpers/smallarenarange.h`,
`inplacerep.h` — `Arena<InplaceRep>` for sub-chunk metadata.
2 changes: 1 addition & 1 deletion src/snmalloc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ These are arranged in a hierarchy such that each of the directories may include
- `mem/` provides the core allocator abstractions.
The code here is templated over a back-end, which defines a particular embedding of snmalloc.
- `backend_helpers/` provides helper classes for use in defining a back end.
This includes data structures such as pagemap implementations (efficient maps from a chunk address to associated metadata) and buddy allocators for managing address-space ranges.
This includes data structures such as pagemap implementations (efficient maps from a chunk address to associated metadata) and range allocators for managing address-space ranges.
- `backend/` provides some example implementations for snmalloc embeddings that provide a global memory allocator for an address space.
Users may ignore this entirely and use the types in `mem/` with a custom back end to expose an snmalloc instance with specific behaviour.
Layers above this can be used with a custom configuration by defining `SNMALLOC_PROVIDE_OWN_CONFIG` and exporting a type as `snmalloc::Config` that defines the configuration.
Expand Down
Loading
Loading