refactor: replace deepsize crate with custom DeepSizeOf trait for Arrow-aware memory accounting#6229
refactor: replace deepsize crate with custom DeepSizeOf trait for Arrow-aware memory accounting#6229wjones127 wants to merge 2 commits intolance-format:mainfrom
Conversation
ReviewGood motivation — the external Issues to addressP0:
P1:
P1: In Minor
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
8566f00 to
1c70985
Compare
…ow-aware memory accounting The external `deepsize` crate does not correctly account for Arrow buffers that are shared across `Arc` references, causing double-counting in cache size calculations. Introduces a custom `DeepSizeOf` trait in `lance-core::deepsize` with a `Context` that tracks both `Arc` and raw buffer pointers in a unified `HashSet`. Also adds a `lance-derive` proc-macro crate for `#[derive(DeepSizeOf)]` and removes the dependency on the external `deepsize` crate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1c70985 to
23df3cb
Compare
…uplicate definitions, add derive test - Fix CompressedPositionStorage and Positions to use (array as &dyn Array).deep_size_of_children(context) instead of get_buffer_memory_size(), so shared ListArray buffers are deduplicated correctly - Remove duplicate CompressedPositionStorage and SharedPositionStream definitions introduced by bad merge - Fix lance-io uring readers to use lance_core::deepsize instead of external deepsize crate - Add test_derive_macro test exercising #[derive(DeepSizeOf)] end-to-end Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
westonpace
left a comment
There was a problem hiding this comment.
This is nice, I didn't look too closely at all the changes as I think most were find/replace.
Should we do some blanket impl like?
impl<T: lance_core::deepsize::DeepSizeOf> deepsize::DeepSizeOf for T {
...
}
Although maybe better to keep the dependency away and let callers do that themselves if they are using deepsize I suppose.
| /// | ||
| /// Generates an implementation that sums the `deep_size_of_children` of all | ||
| /// fields (for structs) or the active variant's fields (for enums). | ||
| #[proc_macro_derive(DeepSizeOf)] |
There was a problem hiding this comment.
Are there tests for this macro somewhere? I guess we get some testing via usage? You can, in theory, do integration tests with proc-macros if I recall (unit tests don't make a ton of sense)
Maybe not important for an internal utility.
There was a problem hiding this comment.
You are right, I should add some more tests.
The critical change from pub struct Context {
/// A set of all [`Arc`](std::sync::Arc)s that have already been counted
arcs: GenericSet<usize>,
/// A set of all [`Rc`](std::sync::Arc)s that have already been counted
rcs: GenericSet<usize>,
}We instead just track one list of pointers: pub struct Context {
seen: HashSet<usize>,
}This is important because Arrow doesn't give you access to the
I think this might be difficult, given I would need to create some sort of adapter between our Footnotes |
The external
deepsizecrate does not correctly account for Arrow buffersthat are shared across
Arcreferences, causing double-counting in cachesize calculations.
This PR introduces a custom
DeepSizeOftrait inlance-core::deepsizewith a
Contextthat tracks bothArcand raw buffer pointers in aunified
HashSet. It also adds alance-deriveproc-macro crate for#[derive(DeepSizeOf)]and removes the dependency on the externaldeepsizecrate.Changes:
lance-core::deepsizemodule with customDeepSizeOftrait and pointer-trackingContextlance-deriveproc-macro crate with#[derive(DeepSizeOf)]deepsizecrate from all crates; update all impls to uselance_core::deepsizedeep_size_of_childrenimplementations to threadcontextthrough (previously some ignored the parameter)DeepSizeOfforarrow_buffer::ScalarBuffer<T>with proper pointer trackingPlainPostingListandPositionsto usedeep_size_of_childrenwith context