Skip to content

refactor: replace deepsize crate with custom DeepSizeOf trait for Arrow-aware memory accounting#6229

Open
wjones127 wants to merge 2 commits intolance-format:mainfrom
wjones127:feat/custom-deepsize-trait
Open

refactor: replace deepsize crate with custom DeepSizeOf trait for Arrow-aware memory accounting#6229
wjones127 wants to merge 2 commits intolance-format:mainfrom
wjones127:feat/custom-deepsize-trait

Conversation

@wjones127
Copy link
Copy Markdown
Contributor

The external deepsize crate does not correctly account for Arrow buffers
that are shared across Arc references, causing double-counting in cache
size calculations.

This PR introduces a custom DeepSizeOf trait in lance-core::deepsize
with a Context that tracks both Arc and raw buffer pointers in a
unified HashSet. It also adds a lance-derive proc-macro crate for
#[derive(DeepSizeOf)] and removes the dependency on the external
deepsize crate.

Changes:

  • Add lance-core::deepsize module with custom DeepSizeOf trait and pointer-tracking Context
  • Add lance-derive proc-macro crate with #[derive(DeepSizeOf)]
  • Remove external deepsize crate from all crates; update all impls to use lance_core::deepsize
  • Fix all deep_size_of_children implementations to thread context through (previously some ignored the parameter)
  • Add DeepSizeOf for arrow_buffer::ScalarBuffer<T> with proper pointer tracking
  • Fix PlainPostingList and Positions to use deep_size_of_children with context

@github-actions
Copy link
Copy Markdown
Contributor

Review

Good motivation — the external deepsize crate doesn't track Arc/buffer pointer identity, so shared Arrow buffers get double-counted in cache size calculations. The custom trait with a pointer-tracking Context is the right approach.

Issues to address

P0: dyn Array::deep_size_of_children calls self.to_data() which clones all buffers

Array::to_data() can be expensive — it materializes an ArrayData by cloning Arc<Buffer>s and for some array types does non-trivial work. This is called every time you measure a column's size in a RecordBatch, which happens on every cache insertion/eviction. Consider using Array::get_array_memory_size() for the total and only using the Context for dedup at the Arc<dyn Array> / Arc<RecordBatch> level. Alternatively, use array.to_data() only once and cache the result, or see if ArrayData can be obtained without cloning (e.g. array.into_data() if you own it, or accessing internal data references).

P1: CompressedPostingList still uses get_buffer_memory_size() without context tracking

CompressedPostingList::deep_size_of_children (rust/lance-index/src/scalar/inverted/index.rs) was only half-migrated — it still calls self.blocks.get_buffer_memory_size() and self.positions.as_ref().map(Array::get_buffer_memory_size) directly, bypassing the Context dedup. This means shared buffers in compressed posting lists can still be double-counted.

P1: ScalarBuffer reports len * size_of::<T>() instead of capacity

In deepsize.rs, ScalarBuffer::deep_size_of_children returns self.len() * size_of::<T>(), but the underlying Buffer may have been allocated with more capacity (e.g. after slicing). Other impls in the same file (like Vec) correctly use capacity. For consistency and accuracy, consider using the underlying buffer's capacity.

Minor

  • The HashMap size estimate uses capacity * (size_of::<K>() + size_of::<V>() + 1), which is a reasonable approximation for hashbrown/Swiss tables but the actual overhead per bucket is closer to 1 byte of control metadata per slot (not per occupied entry). This is fine as an estimate since you're using capacity() which already accounts for the total slot count — just noting it's approximate.

  • No tests for the derive macro on enums or generics. Consider adding at least one test struct/enum in deepsize.rs tests that uses #[derive(DeepSizeOf)] to verify the proc macro generates correct code.

@wjones127 wjones127 marked this pull request as ready for review March 19, 2026 17:44
…ow-aware memory accounting

The external `deepsize` crate does not correctly account for Arrow buffers
that are shared across `Arc` references, causing double-counting in cache
size calculations.

Introduces a custom `DeepSizeOf` trait in `lance-core::deepsize` with a
`Context` that tracks both `Arc` and raw buffer pointers in a unified
`HashSet`. Also adds a `lance-derive` proc-macro crate for
`#[derive(DeepSizeOf)]` and removes the dependency on the external
`deepsize` crate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@wjones127 wjones127 force-pushed the feat/custom-deepsize-trait branch from 1c70985 to 23df3cb Compare April 1, 2026 02:26
…uplicate definitions, add derive test

- Fix CompressedPositionStorage and Positions to use (array as &dyn Array).deep_size_of_children(context)
  instead of get_buffer_memory_size(), so shared ListArray buffers are deduplicated correctly
- Remove duplicate CompressedPositionStorage and SharedPositionStream definitions introduced by bad merge
- Fix lance-io uring readers to use lance_core::deepsize instead of external deepsize crate
- Add test_derive_macro test exercising #[derive(DeepSizeOf)] end-to-end

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, I didn't look too closely at all the changes as I think most were find/replace.

Should we do some blanket impl like?

impl<T: lance_core::deepsize::DeepSizeOf> deepsize::DeepSizeOf for T {
  ...
}

Although maybe better to keep the dependency away and let callers do that themselves if they are using deepsize I suppose.

///
/// Generates an implementation that sums the `deep_size_of_children` of all
/// fields (for structs) or the active variant's fields (for enums).
#[proc_macro_derive(DeepSizeOf)]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there tests for this macro somewhere? I guess we get some testing via usage? You can, in theory, do integration tests with proc-macros if I recall (unit tests don't make a ton of sense)

Maybe not important for an internal utility.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I should add some more tests.

@wjones127
Copy link
Copy Markdown
Contributor Author

This is nice, I didn't look too closely at all the changes as I think most were find/replace.

The critical change from deepsize crate is that they track just Arc and RC pointers in Context: 1:

pub struct Context {
    /// A set of all [`Arc`](std::sync::Arc)s that have already been counted
    arcs: GenericSet<usize>,
    /// A set of all [`Rc`](std::sync::Arc)s that have already been counted
    rcs: GenericSet<usize>,
}

We instead just track one list of pointers:

pub struct Context {
    seen: HashSet<usize>,
}

This is important because Arrow doesn't give you access to the Arc, just the underlying pointers. So if you want to mark Arrow buffers as seen, you can't use the Context from deepsize. (I initially tried to implement deepsize::DeepSizeOf for a new type wrapper around Arrow types, but ran into this. That's what motivated me to make this change.)

Should we do some blanket impl like?

impl<T: lance_core::deepsize::DeepSizeOf> deepsize::DeepSizeOf for T {
  ...
}

I think this might be difficult, given I would need to create some sort of adapter between our Context and their Context. Plus, I don't get the sense that deepsize crate is actively maintained. All the PRs are left as stale. It's our only dependency that still uses syn 1.x. So I lean towards dropping it.

Footnotes

  1. https://github.com/Aeledfyr/deepsize/blob/5deebe687695610e6038dc70753defbe7634d549/src/lib.rs#L151-L156

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants