refactor: replace deepsize crate with custom DeepSizeOf trait for Arrow-aware memory accounting by wjones127 · Pull Request #6229 · lance-format/lance

wjones127 · 2026-03-19T16:27:17Z

The external deepsize crate does not correctly account for Arrow buffers
that are shared across Arc references, causing double-counting in cache
size calculations.

This PR introduces a custom DeepSizeOf trait in lance-core::deepsize
with a Context that tracks both Arc and raw buffer pointers in a
unified HashSet. It also adds a lance-derive proc-macro crate for
#[derive(DeepSizeOf)] and removes the dependency on the external
deepsize crate.

Changes:

Add lance-core::deepsize module with custom DeepSizeOf trait and pointer-tracking Context
Add lance-derive proc-macro crate with #[derive(DeepSizeOf)]
Remove external deepsize crate from all crates; update all impls to use lance_core::deepsize
Fix all deep_size_of_children implementations to thread context through (previously some ignored the parameter)
Add DeepSizeOf for arrow_buffer::ScalarBuffer<T> with proper pointer tracking
Fix PlainPostingList and Positions to use deep_size_of_children with context

github-actions · 2026-03-19T16:29:14Z

Review

Good motivation — the external deepsize crate doesn't track Arc/buffer pointer identity, so shared Arrow buffers get double-counted in cache size calculations. The custom trait with a pointer-tracking Context is the right approach.

Issues to address

P0: dyn Array::deep_size_of_children calls self.to_data() which clones all buffers

Array::to_data() can be expensive — it materializes an ArrayData by cloning Arc<Buffer>s and for some array types does non-trivial work. This is called every time you measure a column's size in a RecordBatch, which happens on every cache insertion/eviction. Consider using Array::get_array_memory_size() for the total and only using the Context for dedup at the Arc<dyn Array> / Arc<RecordBatch> level. Alternatively, use array.to_data() only once and cache the result, or see if ArrayData can be obtained without cloning (e.g. array.into_data() if you own it, or accessing internal data references).

P1: CompressedPostingList still uses get_buffer_memory_size() without context tracking

CompressedPostingList::deep_size_of_children (rust/lance-index/src/scalar/inverted/index.rs) was only half-migrated — it still calls self.blocks.get_buffer_memory_size() and self.positions.as_ref().map(Array::get_buffer_memory_size) directly, bypassing the Context dedup. This means shared buffers in compressed posting lists can still be double-counted.

P1: ScalarBuffer reports len * size_of::<T>() instead of capacity

In deepsize.rs, ScalarBuffer::deep_size_of_children returns self.len() * size_of::<T>(), but the underlying Buffer may have been allocated with more capacity (e.g. after slicing). Other impls in the same file (like Vec) correctly use capacity. For consistency and accuracy, consider using the underlying buffer's capacity.

Minor

The HashMap size estimate uses capacity * (size_of::<K>() + size_of::<V>() + 1), which is a reasonable approximation for hashbrown/Swiss tables but the actual overhead per bucket is closer to 1 byte of control metadata per slot (not per occupied entry). This is fine as an estimate since you're using capacity() which already accounts for the total slot count — just noting it's approximate.
No tests for the derive macro on enums or generics. Consider adding at least one test struct/enum in deepsize.rs tests that uses #[derive(DeepSizeOf)] to verify the proc macro generates correct code.

codecov · 2026-03-19T17:48:43Z

Codecov Report

❌ Patch coverage is 86.18926% with 54 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-core/src/deepsize.rs	87.72%	26 Missing and 1 partial ⚠️
rust/lance-derive/src/lib.rs	92.53%	4 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/inverted/index.rs	73.33%	4 Missing ⚠️
rust/lance/src/index/vector/fixture_test.rs	0.00%	2 Missing ⚠️
rust/lance/src/index/vector/pq.rs	81.81%	2 Missing ⚠️
rust/lance-core/src/container/list.rs	0.00%	1 Missing ⚠️
rust/lance-index/src/mem_wal.rs	50.00%	1 Missing ⚠️
rust/lance-index/src/scalar/inverted/builder.rs	0.00%	1 Missing ⚠️
rust/lance-index/src/scalar/json.rs	0.00%	1 Missing ⚠️
rust/lance-index/src/scalar/ngram.rs	66.66%	1 Missing ⚠️
... and 9 more

📢 Thoughts on this report? Let us know!

…ow-aware memory accounting The external `deepsize` crate does not correctly account for Arrow buffers that are shared across `Arc` references, causing double-counting in cache size calculations. Introduces a custom `DeepSizeOf` trait in `lance-core::deepsize` with a `Context` that tracks both `Arc` and raw buffer pointers in a unified `HashSet`. Also adds a `lance-derive` proc-macro crate for `#[derive(DeepSizeOf)]` and removes the dependency on the external `deepsize` crate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…uplicate definitions, add derive test - Fix CompressedPositionStorage and Positions to use (array as &dyn Array).deep_size_of_children(context) instead of get_buffer_memory_size(), so shared ListArray buffers are deduplicated correctly - Remove duplicate CompressedPositionStorage and SharedPositionStream definitions introduced by bad merge - Fix lance-io uring readers to use lance_core::deepsize instead of external deepsize crate - Add test_derive_macro test exercising #[derive(DeepSizeOf)] end-to-end Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

westonpace

This is nice, I didn't look too closely at all the changes as I think most were find/replace.

Should we do some blanket impl like?

impl<T: lance_core::deepsize::DeepSizeOf> deepsize::DeepSizeOf for T {
  ...
}

Although maybe better to keep the dependency away and let callers do that themselves if they are using deepsize I suppose.

westonpace · 2026-04-01T16:45:33Z

rust/lance-derive/src/lib.rs

+///
+/// Generates an implementation that sums the `deep_size_of_children` of all
+/// fields (for structs) or the active variant's fields (for enums).
+#[proc_macro_derive(DeepSizeOf)]


Are there tests for this macro somewhere? I guess we get some testing via usage? You can, in theory, do integration tests with proc-macros if I recall (unit tests don't make a ton of sense)

Maybe not important for an internal utility.

You are right, I should add some more tests.

wjones127 · 2026-04-01T18:28:34Z

This is nice, I didn't look too closely at all the changes as I think most were find/replace.

The critical change from deepsize crate is that they track just Arc and RC pointers in Context: ¹:

pub struct Context {
    /// A set of all [`Arc`](std::sync::Arc)s that have already been counted
    arcs: GenericSet<usize>,
    /// A set of all [`Rc`](std::sync::Arc)s that have already been counted
    rcs: GenericSet<usize>,
}

We instead just track one list of pointers:

pub struct Context {
    seen: HashSet<usize>,
}

This is important because Arrow doesn't give you access to the Arc, just the underlying pointers. So if you want to mark Arrow buffers as seen, you can't use the Context from deepsize. (I initially tried to implement deepsize::DeepSizeOf for a new type wrapper around Arrow types, but ran into this. That's what motivated me to make this change.)

Should we do some blanket impl like?
impl<T: lance_core::deepsize::DeepSizeOf> deepsize::DeepSizeOf for T {
  ...
}

I think this might be difficult, given I would need to create some sort of adapter between our Context and their Context. Plus, I don't get the sense that deepsize crate is actively maintained. All the PRs are left as stale. It's our only dependency that still uses syn 1.x. So I lean towards dropping it.

https://github.com/Aeledfyr/deepsize/blob/5deebe687695610e6038dc70753defbe7634d549/src/lib.rs#L151-L156 ↩

github-actions bot added the python label Mar 19, 2026

wjones127 marked this pull request as ready for review March 19, 2026 17:44

wjones127 mentioned this pull request Mar 19, 2026

feat: introduce CacheCodec for serializing index cache entries #6223

Merged

wjones127 mentioned this pull request Mar 29, 2026

feat: pluggable index cache via CacheBackend trait #6222

Merged

wjones127 force-pushed the feat/custom-deepsize-trait branch from 8566f00 to 1c70985 Compare April 1, 2026 01:31

github-actions bot added the java label Apr 1, 2026

wjones127 force-pushed the feat/custom-deepsize-trait branch from 1c70985 to 23df3cb Compare April 1, 2026 02:26

westonpace reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: replace deepsize crate with custom DeepSizeOf trait for Arrow-aware memory accounting#6229

refactor: replace deepsize crate with custom DeepSizeOf trait for Arrow-aware memory accounting#6229
wjones127 wants to merge 2 commits intolance-format:mainfrom
wjones127:feat/custom-deepsize-trait

wjones127 commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

codecov bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

westonpace left a comment

Uh oh!

westonpace Apr 1, 2026

Uh oh!

wjones127 Apr 1, 2026

Uh oh!

wjones127 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wjones127 commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Review

Issues to address

Minor

Uh oh!

codecov bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

wjones127 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

wjones127 commented Apr 1, 2026

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Mar 19, 2026 •

edited

Loading