Skip to content

feat: introduce CacheCodec for serializing index cache entries#6223

Merged
wjones127 merged 13 commits intolance-format:mainfrom
wjones127:feat/partition-entry-serde
Apr 7, 2026
Merged

feat: introduce CacheCodec for serializing index cache entries#6223
wjones127 merged 13 commits intolance-format:mainfrom
wjones127:feat/partition-entry-serde

Conversation

@wjones127
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 commented Mar 18, 2026

Introduce a codec: CacheCodec parameter to CacheBackend methods, providing the ability to serialize and deserialize entries, if desired. This struct is just a pair of function pointers: serialize and deserialize. So it's very cheap if not used.

To provide this for a type, implement the CacheKey::codec() method to return a CacheCodec. By default, this trait method returns None. We only implement serialization for some types right now, but the eventual goal is to implement it for all types. To keep this PR focused, I only implement this for vector indexes, but the same approach could be used for FTS and other indexes.

To make it easy to implement, we provide a CacheCodecImpl trait:

pub trait CacheCodecImpl: Send + Sync {
    fn serialize(&self, writer: &mut dyn std::io::Write) -> Result<()>;

    fn deserialize(reader: &mut dyn std::io::Read) -> Result<Self>
    where
        Self: Sized;
}

The implementors of CacheKey can then write:

    fn codec() -> Option<CacheCodec> {
        Some(CacheCodec::from_impl::<Self::ValueType>())
    }

The implementation of serialization / deserialization is designed to have very low overhead. We use serde-derived strings for small metadata, and use Arrow IPC for Arrow data. When deserializing, we use zero-copy deserialization to re-use the buffers read in.

One important change is we no longer cache Arc<dyn VectorIndex> for V2+ vector indices. This is because IVFIndex (which is the V3 implementation of VectorIndex) contains non-serializable data structures like FileReader. Instead we cache IvfIndexState, which is a serializable representation of the vector index. Importantly, this does not require any IO to reconstruct, which is validated by the pre-existing test test_prewarm_ivf_pq.

@github-actions github-actions bot added the enhancement New feature or request label Mar 18, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review: feat: add serialize/deserialize for IVF PQ partition cache entries

Clean implementation overall — the zero-copy IPC approach is well-suited for cache serde. A few issues to flag:

P1: Integer overflow in deserialization offset arithmetic

In deserialize(), the section boundary calculations use unchecked addition:

let sub_index_end = sub_index_start + header.sub_index_len as usize;
let codebook_end = codebook_start + header.codebook_len as usize;
let storage_end = storage_start + header.storage_len as usize;

If the header contains corrupted or adversarial values (e.g., lengths close to u64::MAX), these additions can wrap around, causing storage_end to be a small value that passes the data.len() < storage_end check. The subsequent Buffer::slice_with_length calls would then read incorrect regions or panic.

Use checked_add and return an error on overflow, same as the defensive pattern already used for trailer_start / footer_start in read_ipc_zero_copy.

P1: Missing Hamming in distance type roundtrip test

test_roundtrip_preserves_distance_type covers L2, Cosine, and Dot but omits Hamming, which is handled in the distance_type_to_u8/u8_to_distance_type mapping. Either add it to the test or document why it's excluded (e.g., PQ storage doesn't support Hamming).

Minor

  • The module is pub mod partition_serde — if this is only intended for internal caching, consider pub(crate) to avoid leaking it as public API.

🤖 Generated with Claude Code

Comment on lines +16 to +18
//! Each IPC section is a complete Arrow IPC file. On deserialization, the IPC
//! sections are read zero-copy using [`FileDecoder`] so that Arrow arrays
//! reference the original buffer directly.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My big worry with this is that the DeepSizeOf will no longer be accurate after a roundtrip.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This concern will be addressed by #6229. As long as we aren't sharing buffers across cache entries, we should be fine.

@wjones127 wjones127 marked this pull request as ready for review March 19, 2026 19:34
wjones127 added a commit to wjones127/lance that referenced this pull request Mar 20, 2026
wjones127 added a commit to wjones127/lance that referenced this pull request Mar 20, 2026


Move VectorIndexData, IvfIndexState, partition_serde, cacheable_state,
and zero-IO reconstruction out of this PR to keep it focused on the
pluggable cache backend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wjones127 wjones127 force-pushed the feat/partition-entry-serde branch from 3cffe37 to 3ead7ca Compare March 20, 2026 23:49
@wjones127 wjones127 changed the title feat: add serialize/deserialize for IVF PQ partition cache entries feat: VectorIndex serialization and zero-IO reconstruction Mar 20, 2026
@github-actions
Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@wjones127 wjones127 force-pushed the feat/partition-entry-serde branch from 60d7a31 to 5872a51 Compare March 25, 2026 20:21
@wjones127 wjones127 changed the title feat: VectorIndex serialization and zero-IO reconstruction feat: introduce CacheCodec for serializing index cache entries Mar 27, 2026
wjones127 added a commit that referenced this pull request Mar 29, 2026
Previously the Session's index cache was hardcoded to Moka. This adds a
`CacheBackend` trait so users can provide their own cache
implementation,
with `MokaCacheBackend` as the default.

Serialization and zero-IO reconstruction are in the follow-up PR #6223.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wjones127 wjones127 force-pushed the feat/partition-entry-serde branch from a9d86bc to 895756b Compare March 29, 2026 22:51
wjones127 added a commit to wjones127/lance that referenced this pull request Mar 29, 2026


Move VectorIndexData, IvfIndexState, partition_serde, cacheable_state,
and zero-IO reconstruction out of this PR to keep it focused on the
pluggable cache backend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@westonpace
Copy link
Copy Markdown
Member

I think this PR might be in need of a rebase or something?

@wjones127 wjones127 force-pushed the feat/partition-entry-serde branch 3 times, most recently from d5acf1b to 86e25fc Compare March 30, 2026 17:34
Add CacheCodec infrastructure and IvfIndexState for serializing IVF
index cache entries. This enables disk-backed cache backends to persist
index state across process restarts.

Key changes:
- CacheCodecImpl trait + CacheCodec type-erased codec in lance-core
- IvfIndexState: serializable snapshot of IVF index metadata that
  reconstructs into a live VectorIndex without IO
- PartitionEntry serde via Arrow IPC (lance-arrow::ipc utilities)
- CacheKey::codec() method for opt-in serialization per cache key
- Fix FileMetadataCacheKey prefix mismatch for zero-IO reconstruction
- Fix FlatBin quantizer type handling in index type strings and remap

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wjones127 wjones127 force-pushed the feat/partition-entry-serde branch from 1c187b3 to 7e85853 Compare April 1, 2026 00:01
@wjones127
Copy link
Copy Markdown
Contributor Author

@westonpace Okay, I've finally finished rebasing and refactoring. I think it's good to review again.

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions but overall I think this is some great work. I guess I didn't appreciate how much complexity there would be in serializing all the different kinds of cache keys.

Comment thread rust/lance-core/src/cache/backend.rs Outdated
/// Structured cache key passed to [`CacheBackend`] methods.
///
/// Composed of three parts:
/// Backend authors receive these ready-made from [`LanceCache`](super::LanceCache)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Backend authors receive these ready-made from [`LanceCache`](super::LanceCache)
/// CacheBackend impls receive these ready-made from [`LanceCache`](super::LanceCache)

Comment on lines +56 to +59
/// `CacheCodec` is two plain function pointers — it is `Copy` and has no
/// heap allocation. Construct one via [`CacheCodec::from_impl`] for types
/// that implement [`CacheCodecImpl`], or [`CacheCodec::new`] for custom
/// cases (e.g. when the orphan rule prevents a direct impl).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh...why not Box<dyn CacheCodecImpl>? Is it because you're expecting to need to violate the orphan rule?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Box<dyn CacheCodecImpl> only works for serialize(). deserialize() returns Result<Self>, so it can't be called from &dyn CacheCodecImpl. This is why there is a where Self: Sized bound on the trait method.

In CacheCodec::deserialize, it returns Result<ArcAny>, which works without having to know the concrete type.

I went through a lot of iterations on these traits and structs. This is one place Rust gets really complicated 🫣

Comment on lines +107 to +109
/// Returns `None` by default. Cache backends that support persistence
/// (e.g. disk-backed caches) use this to serialize entries on insert and
/// deserialize on get. Types without a codec will only be stored in-memory.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in a disk cache if this returns None? Do these caches fail? Can we document that fact?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to the cache backend. If it's a disk-only cache, then it would not cache. If it's a memory and disk cache, then it can store in memory but when evicted it won't go to disk.

Comment thread rust/lance-index/src/vector.rs Outdated
///
/// Produced by [`VectorIndex::cacheable_state`] and consumed by
/// `reconstruct_vector_index` in the `lance` crate. All fields are `pub`
/// because the type is shared across crate boundaries; they are **not**
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this in lance-core?

Comment thread rust/lance-index/src/vector.rs Outdated
Comment on lines +212 to +213
/// `[header_json_len: u64 LE][header JSON][ivf_pb_len: u64 LE][ivf protobuf]
/// [extra_len: u64 LE][extra bytes][aux_ivf_pb_len: u64 LE][aux_ivf protobuf]`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there is no need to worry about stability here? Do we plan for disk caches to be completely lost on upgrades? Or are we actually thinking we might want to keep these stable between versions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I'm marking them as unstable and experimental. I think we can see about making them stable in the future. I'd probably put some magic bytes and version in the header when we do.

Do we plan for disk caches to be completely lost on upgrades?

Given we have a pre-warming mechanism, I feel okay with letting disk caches be ephemeral to start.

// ---------------------------------------------------------------------------

#[derive(Serialize, Deserialize)]
struct RabitPartitionHeader {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to have these structs in their respective modules? E.g. rq, pq, etc.?

Comment on lines +287 to +296
file_metadata_cache
.with_key_prefix(uri.as_ref())
.insert_with_key(&FileMetadataCacheKey, index_reader.metadata().clone())
.await;
let aux_path = index_dir
.child(uuid.as_str())
.child(INDEX_AUXILIARY_FILE_NAME);
file_metadata_cache
.with_key_prefix(aux_path.as_ref())
.insert_with_key(&FileMetadataCacheKey, storage.reader().metadata().clone())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice

Comment thread rust/lance/src/index/vector/ivf/v2.rs Outdated
Comment on lines +4634 to +4636
/// A cache backend that serializes entries to bytes on insert and
/// deserializes on get. Entries without a codec are stored in-memory
/// as-is. This simulates a disk or remote cache backend.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this comment be on SerializingBackend?

wjones127 and others added 5 commits April 1, 2026 11:59
IvfIndexState was defined in lance-index but only produced and consumed
within the lance crate, forcing all fields to be pub and requiring the
VectorIndexData trait as type-erasure glue. This moves IvfIndexState
into lance, removes cacheable_state() from the VectorIndex trait, and
extracts the index state before type erasure instead of going through
a trait method + downcast.

Also converts LegacyVectorIndexCacheKey from UnsizedCacheKey to sized
CacheKey, removing the VectorIndexData trait entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each IvfIndex reconstruction called open_reader_cached twice, which
did spawn_blocking(|| File::open(...)) even when all partitions were
already cached and the readers were never used. This caused ~56%
regression in single-threaded cached query throughput.

Cache Arc<dyn EncodingsIo> for both index and aux files under
CachedIndexEncodingsIoKey. On warm cache hits, FileReader is
reconstructed from the cached IO + already-cached CachedFileMetadata,
with zero spawn_blocking calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace `CachedIndexEncodingsIo` (which wrapped `Arc<dyn EncodingsIo>`)
with `CachedIndexReaders`, which caches the `Arc<FileReader>` values
directly. This avoids repeated `spawn_blocking(File::open)` calls on
warm-path reconstructions without the complexity of the IO handle layer.

Violates the no-shared-data cache constraint (FileReader shares
CachedFileMetadata with the separate FileMetadataCacheKey entry), but
this is an acceptable temporary simplification.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The moka cache backend uses `optionally_get_with` which already
deduplicates concurrent loads for the same key, making the manual
mutex + double-check pattern redundant.

Also moves the bounds check before the cache lookup, and moves
`metrics.record_part_load()` into the loader so it only fires on
actual loads (previously it fired even when the double-check after
the lock found a cached entry).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
wjones127 and others added 2 commits April 6, 2026 17:00
Replace the non-generic `IvfIndexState` with `IvfIndexState<Q>` that
stores `Q::Metadata` directly, avoiding JSON re-parsing on every
warm-path reconstruction from cache.

Add `IvfStateEntry` trait and `IvfStateEntryBox` wrapper:
- `IvfStateEntry::reconstruct` replaces `reconstruct_vector_index`,
  dispatching only on `sub_index_type` (S) since Q is already encoded
  in the concrete type. The previous 9-arm match (all S×Q pairs) is
  split into a 2-arm match on S per Q variant.
- `IvfStateEntryBox` is the `CacheKey::ValueType`; its `CacheCodecImpl`
  holds the full Q-dispatch for deserialization so callers get a typed
  entry back and call `.reconstruct()` directly — no match at the call
  site, no downcast.

The wire format (header JSON + protobuf sections) is unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…SizeOf for CachedIndexReaders

`CachedFileSize::new(0)` already equals `unknown()` by the type's documented
invariant, so remove the redundant `if size > 0` guard in `open_reader_cached`
and document the 0=unknown semantic on `new()`.

`CachedIndexReaders::deep_size_of_children` previously returned 0. Now returns
`size_of::<FileReader>() * 2` plus the `Arc<CachedFileMetadata>` heap contents
for both readers. May over-count vs `FileMetadataCacheKey` entries across cache
entries, but over-counting is safer than 0 for eviction purposes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
wjones127 and others added 5 commits April 7, 2026 13:36
…ding

Previously read_ipc_stream / read_ipc_stream_single / read_one_ipc_message
took &mut dyn Read and copied each IPC message body into a fresh
MutableBuffer before decoding. Now they take &Bytes and use Bytes::slice
(a reference-count increment) to produce each message Buffer. FileDecoder
then creates array buffers as sub-slices of the same allocation — no body
copy.

Add read_ipc_stream_to_bytes for callers that receive data from a streaming
dyn Read source (partition_serde deserializers). These do one copy into
memory, then decode zero-copy from the resulting Bytes.

Fix the zero-copy test assertion, which previously compared array buffers
to each other. It now checks that each buffer's pointer falls within the
input Bytes allocation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously `CacheCodecImpl::deserialize` accepted `&mut dyn Read`,
requiring callers to wrap byte slices in a `Cursor`. Now it takes
`&bytes::Bytes` directly, enabling zero-copy slice operations in
implementors (e.g. IVF partition deserialization reuses sub-slices of
the input buffer without copying).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@wjones127 wjones127 merged commit 774d1bd into lance-format:main Apr 7, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants