feat: introduce `CacheCodec` for serializing index cache entries by wjones127 · Pull Request #6223 · lance-format/lance

wjones127 · 2026-03-18T17:30:07Z

Introduce a codec: CacheCodec parameter to CacheBackend methods, providing the ability to serialize and deserialize entries, if desired. This struct is just a pair of function pointers: serialize and deserialize. So it's very cheap if not used.

To provide this for a type, implement the CacheKey::codec() method to return a CacheCodec. By default, this trait method returns None. We only implement serialization for some types right now, but the eventual goal is to implement it for all types. To keep this PR focused, I only implement this for vector indexes, but the same approach could be used for FTS and other indexes.

To make it easy to implement, we provide a CacheCodecImpl trait:

pub trait CacheCodecImpl: Send + Sync {
    fn serialize(&self, writer: &mut dyn std::io::Write) -> Result<()>;

    fn deserialize(reader: &mut dyn std::io::Read) -> Result<Self>
    where
        Self: Sized;
}

The implementors of CacheKey can then write:

    fn codec() -> Option<CacheCodec> {
        Some(CacheCodec::from_impl::<Self::ValueType>())
    }

The implementation of serialization / deserialization is designed to have very low overhead. We use serde-derived strings for small metadata, and use Arrow IPC for Arrow data. When deserializing, we use zero-copy deserialization to re-use the buffers read in.

One important change is we no longer cache Arc<dyn VectorIndex> for V2+ vector indices. This is because IVFIndex (which is the V3 implementation of VectorIndex) contains non-serializable data structures like FileReader. Instead we cache IvfIndexState, which is a serializable representation of the vector index. Importantly, this does not require any IO to reconstruct, which is validated by the pre-existing test test_prewarm_ivf_pq.

github-actions · 2026-03-18T17:32:15Z

PR Review: feat: add serialize/deserialize for IVF PQ partition cache entries

Clean implementation overall — the zero-copy IPC approach is well-suited for cache serde. A few issues to flag:

P1: Integer overflow in deserialization offset arithmetic

In deserialize(), the section boundary calculations use unchecked addition:

let sub_index_end = sub_index_start + header.sub_index_len as usize;
let codebook_end = codebook_start + header.codebook_len as usize;
let storage_end = storage_start + header.storage_len as usize;

If the header contains corrupted or adversarial values (e.g., lengths close to u64::MAX), these additions can wrap around, causing storage_end to be a small value that passes the data.len() < storage_end check. The subsequent Buffer::slice_with_length calls would then read incorrect regions or panic.

Use checked_add and return an error on overflow, same as the defensive pattern already used for trailer_start / footer_start in read_ipc_zero_copy.

P1: Missing `Hamming` in distance type roundtrip test

test_roundtrip_preserves_distance_type covers L2, Cosine, and Dot but omits Hamming, which is handled in the distance_type_to_u8/u8_to_distance_type mapping. Either add it to the test or document why it's excluded (e.g., PQ storage doesn't support Hamming).

Minor

The module is pub mod partition_serde — if this is only intended for internal caching, consider pub(crate) to avoid leaking it as public API.

🤖 Generated with Claude Code

codecov · 2026-03-18T18:02:44Z

Codecov Report

❌ Patch coverage is 80.81049% with 322 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/index/vector/ivf/v2.rs	77.52%	92 Missing and 26 partials ⚠️
rust/lance-arrow/src/ipc.rs	61.41%	72 Missing and 21 partials ⚠️
rust/lance/src/index/vector/ivf/partition_serde.rs	86.64%	50 Missing and 42 partials ⚠️
rust/lance-index/src/vector/pq/storage.rs	37.50%	3 Missing and 2 partials ⚠️
rust/lance-core/src/cache/codec.rs	89.18%	3 Missing and 1 partial ⚠️
rust/lance-index/src/vector/quantizer.rs	0.00%	2 Missing and 1 partial ⚠️
rust/lance-table/src/format/index.rs	96.20%	0 Missing and 3 partials ⚠️
rust/lance-core/src/cache/mod.rs	90.90%	0 Missing and 2 partials ⚠️
rust/lance/src/index/vector/ivf.rs	87.50%	0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

wjones127 · 2026-03-19T02:44:33Z

+//! Each IPC section is a complete Arrow IPC file. On deserialization, the IPC
+//! sections are read zero-copy using [`FileDecoder`] so that Arrow arrays
+//! reference the original buffer directly.


My big worry with this is that the DeepSizeOf will no longer be accurate after a roundtrip.

This concern will be addressed by #6229. As long as we aren't sharing buffers across cache entries, we should be fine.

Move VectorIndexData, IvfIndexState, partition_serde, cacheable_state, and zero-IO reconstruction out of this PR to keep it focused on the pluggable cache backend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-20T23:58:25Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

Previously the Session's index cache was hardcoded to Moka. This adds a `CacheBackend` trait so users can provide their own cache implementation, with `MokaCacheBackend` as the default. Serialization and zero-IO reconstruction are in the follow-up PR #6223. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move VectorIndexData, IvfIndexState, partition_serde, cacheable_state, and zero-IO reconstruction out of this PR to keep it focused on the pluggable cache backend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

westonpace · 2026-03-30T11:58:27Z

I think this PR might be in need of a rebase or something?

Add CacheCodec infrastructure and IvfIndexState for serializing IVF index cache entries. This enables disk-backed cache backends to persist index state across process restarts. Key changes: - CacheCodecImpl trait + CacheCodec type-erased codec in lance-core - IvfIndexState: serializable snapshot of IVF index metadata that reconstructs into a live VectorIndex without IO - PartitionEntry serde via Arrow IPC (lance-arrow::ipc utilities) - CacheKey::codec() method for opt-in serialization per cache key - Fix FileMetadataCacheKey prefix mismatch for zero-IO reconstruction - Fix FlatBin quantizer type handling in index type strings and remap Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wjones127 · 2026-04-01T00:20:51Z

@westonpace Okay, I've finally finished rebasing and refactoring. I think it's good to review again.

westonpace

Some questions but overall I think this is some great work. I guess I didn't appreciate how much complexity there would be in serializing all the different kinds of cache keys.

westonpace · 2026-04-01T16:22:24Z

 /// Structured cache key passed to [`CacheBackend`] methods.
 ///
-/// Composed of three parts:
+/// Backend authors receive these ready-made from [`LanceCache`](super::LanceCache)


Suggested change

/// Backend authors receive these ready-made from [`LanceCache`](super::LanceCache)

/// CacheBackend impls receive these ready-made from [`LanceCache`](super::LanceCache)

westonpace · 2026-04-01T16:24:10Z

+/// `CacheCodec` is two plain function pointers — it is `Copy` and has no
+/// heap allocation. Construct one via [`CacheCodec::from_impl`] for types
+/// that implement [`CacheCodecImpl`], or [`CacheCodec::new`] for custom
+/// cases (e.g. when the orphan rule prevents a direct impl).


Huh...why not Box<dyn CacheCodecImpl>? Is it because you're expecting to need to violate the orphan rule?

Box<dyn CacheCodecImpl> only works for serialize(). deserialize() returns Result<Self>, so it can't be called from &dyn CacheCodecImpl. This is why there is a where Self: Sized bound on the trait method.

In CacheCodec::deserialize, it returns Result<ArcAny>, which works without having to know the concrete type.

I went through a lot of iterations on these traits and structs. This is one place Rust gets really complicated 🫣

westonpace · 2026-04-01T16:28:55Z

+    /// Returns `None` by default. Cache backends that support persistence
+    /// (e.g. disk-backed caches) use this to serialize entries on insert and
+    /// deserialize on get. Types without a codec will only be stored in-memory.


What happens in a disk cache if this returns None? Do these caches fail? Can we document that fact?

Up to the cache backend. If it's a disk-only cache, then it would not cache. If it's a memory and disk cache, then it can store in memory but when evicted it won't go to disk.

westonpace · 2026-04-01T16:31:56Z

+///
+/// Produced by [`VectorIndex::cacheable_state`] and consumed by
+/// `reconstruct_vector_index` in the `lance` crate. All fields are `pub`
+/// because the type is shared across crate boundaries; they are **not**


Why is this in lance-core?

westonpace · 2026-04-01T16:33:03Z

+/// `[header_json_len: u64 LE][header JSON][ivf_pb_len: u64 LE][ivf protobuf]
+///  [extra_len: u64 LE][extra bytes][aux_ivf_pb_len: u64 LE][aux_ivf protobuf]`


I guess there is no need to worry about stability here? Do we plan for disk caches to be completely lost on upgrades? Or are we actually thinking we might want to keep these stable between versions?

For now, I'm marking them as unstable and experimental. I think we can see about making them stable in the future. I'd probably put some magic bytes and version in the header when we do.

Do we plan for disk caches to be completely lost on upgrades?

Given we have a pre-warming mechanism, I feel okay with letting disk caches be ephemeral to start.

westonpace · 2026-04-01T16:35:58Z

+// ---------------------------------------------------------------------------
+
+#[derive(Serialize, Deserialize)]
+struct RabitPartitionHeader {


Would it make more sense to have these structs in their respective modules? E.g. rq, pq, etc.?

westonpace · 2026-04-01T16:36:55Z

+        file_metadata_cache
+            .with_key_prefix(uri.as_ref())
+            .insert_with_key(&FileMetadataCacheKey, index_reader.metadata().clone())
+            .await;
+        let aux_path = index_dir
+            .child(uuid.as_str())
+            .child(INDEX_AUXILIARY_FILE_NAME);
+        file_metadata_cache
+            .with_key_prefix(aux_path.as_ref())
+            .insert_with_key(&FileMetadataCacheKey, storage.reader().metadata().clone())


This is nice

westonpace · 2026-04-01T16:38:11Z

+    /// A cache backend that serializes entries to bytes on insert and
+    /// deserializes on get. Entries without a codec are stored in-memory
+    /// as-is. This simulates a disk or remote cache backend.


Should this comment be on SerializingBackend?

IvfIndexState was defined in lance-index but only produced and consumed within the lance crate, forcing all fields to be pub and requiring the VectorIndexData trait as type-erasure glue. This moves IvfIndexState into lance, removes cacheable_state() from the VectorIndex trait, and extracts the index state before type erasure instead of going through a trait method + downcast. Also converts LegacyVectorIndexCacheKey from UnsizedCacheKey to sized CacheKey, removing the VectorIndexData trait entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Each IvfIndex reconstruction called open_reader_cached twice, which did spawn_blocking(|| File::open(...)) even when all partitions were already cached and the readers were never used. This caused ~56% regression in single-threaded cached query throughput. Cache Arc<dyn EncodingsIo> for both index and aux files under CachedIndexEncodingsIoKey. On warm cache hits, FileReader is reconstructed from the cached IO + already-cached CachedFileMetadata, with zero spawn_blocking calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace `CachedIndexEncodingsIo` (which wrapped `Arc<dyn EncodingsIo>`) with `CachedIndexReaders`, which caches the `Arc<FileReader>` values directly. This avoids repeated `spawn_blocking(File::open)` calls on warm-path reconstructions without the complexity of the IO handle layer. Violates the no-shared-data cache constraint (FileReader shares CachedFileMetadata with the separate FileMetadataCacheKey entry), but this is an acceptable temporary simplification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The moka cache backend uses `optionally_get_with` which already deduplicates concurrent loads for the same key, making the manual mutex + double-check pattern redundant. Also moves the bounds check before the cache lookup, and moves `metrics.record_part_load()` into the loader so it only fires on actual loads (previously it fired even when the double-check after the lock found a cached entry). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the non-generic `IvfIndexState` with `IvfIndexState<Q>` that stores `Q::Metadata` directly, avoiding JSON re-parsing on every warm-path reconstruction from cache. Add `IvfStateEntry` trait and `IvfStateEntryBox` wrapper: - `IvfStateEntry::reconstruct` replaces `reconstruct_vector_index`, dispatching only on `sub_index_type` (S) since Q is already encoded in the concrete type. The previous 9-arm match (all S×Q pairs) is split into a 2-arm match on S per Q variant. - `IvfStateEntryBox` is the `CacheKey::ValueType`; its `CacheCodecImpl` holds the full Q-dispatch for deserialization so callers get a typed entry back and call `.reconstruct()` directly — no match at the call site, no downcast. The wire format (header JSON + protobuf sections) is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…SizeOf for CachedIndexReaders `CachedFileSize::new(0)` already equals `unknown()` by the type's documented invariant, so remove the redundant `if size > 0` guard in `open_reader_cached` and document the 0=unknown semantic on `new()`. `CachedIndexReaders::deep_size_of_children` previously returned 0. Now returns `size_of::<FileReader>() * 2` plus the `Arc<CachedFileMetadata>` heap contents for both readers. May over-count vs `FileMetadataCacheKey` entries across cache entries, but over-counting is safer than 0 for eviction purposes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ding Previously read_ipc_stream / read_ipc_stream_single / read_one_ipc_message took &mut dyn Read and copied each IPC message body into a fresh MutableBuffer before decoding. Now they take &Bytes and use Bytes::slice (a reference-count increment) to produce each message Buffer. FileDecoder then creates array buffers as sub-slices of the same allocation — no body copy. Add read_ipc_stream_to_bytes for callers that receive data from a streaming dyn Read source (partition_serde deserializers). These do one copy into memory, then decode zero-copy from the resulting Bytes. Fix the zero-copy test assertion, which previously compared array buffers to each other. It now checks that each buffer's pointer falls within the input Bytes allocation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously `CacheCodecImpl::deserialize` accepted `&mut dyn Read`, requiring callers to wrap byte slices in a `Cursor`. Now it takes `&bytes::Bytes` directly, enabling zero-copy slice operations in implementors (e.g. IVF partition deserialization reuses sub-slices of the input buffer without copying). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions bot added the enhancement New feature or request label Mar 18, 2026

wjones127 commented Mar 19, 2026

View reviewed changes

wjones127 marked this pull request as ready for review March 19, 2026 19:34

wjones127 added a commit to wjones127/lance that referenced this pull request Mar 20, 2026

feat: add partition serde for all quantizer types (PR lance-format#6223)

2e7602e

wjones127 force-pushed the feat/partition-entry-serde branch from 3cffe37 to 3ead7ca Compare March 20, 2026 23:49

wjones127 mentioned this pull request Mar 20, 2026

feat: pluggable index cache via CacheBackend trait #6222

Merged

wjones127 changed the title ~~feat: add serialize/deserialize for IVF PQ partition cache entries~~ feat: VectorIndex serialization and zero-IO reconstruction Mar 20, 2026

wjones127 force-pushed the feat/partition-entry-serde branch from 60d7a31 to 5872a51 Compare March 25, 2026 20:21

wjones127 changed the title ~~feat: VectorIndex serialization and zero-IO reconstruction~~ feat: introduce CacheCodec for serializing index cache entries Mar 27, 2026

wjones127 force-pushed the feat/partition-entry-serde branch from a9d86bc to 895756b Compare March 29, 2026 22:51

wjones127 force-pushed the feat/partition-entry-serde branch 3 times, most recently from d5acf1b to 86e25fc Compare March 30, 2026 17:34

github-actions bot added the python label Mar 31, 2026

wjones127 force-pushed the feat/partition-entry-serde branch from 1c187b3 to 7e85853 Compare April 1, 2026 00:01

westonpace approved these changes Apr 1, 2026

View reviewed changes

wjones127 and others added 5 commits April 1, 2026 11:59

PR feedback

312f305

wjones127 and others added 2 commits April 6, 2026 17:00

wjones127 mentioned this pull request Apr 7, 2026

Implement CacheCodec for FTS Indices #6431

Open

wjones127 and others added 5 commits April 7, 2026 13:36

refactor: remove unused read_ipc_stream_to_bytes

227ec10

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

partial clippy

73d0a45

refactor: remove unused IvfStateEntry::as_any

79b20a9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wjones127 merged commit 774d1bd into lance-format:main Apr 7, 2026
28 checks passed

	/// Backend authors receive these ready-made from [`LanceCache`](super::LanceCache)
	/// CacheBackend impls receive these ready-made from [`LanceCache`](super::LanceCache)

		/// `[header_json_len: u64 LE][header JSON][ivf_pb_len: u64 LE][ivf protobuf]
		/// [extra_len: u64 LE][extra bytes][aux_ivf_pb_len: u64 LE][aux_ivf protobuf]`

Conversation

wjones127 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 18, 2026

PR Review: feat: add serialize/deserialize for IVF PQ partition cache entries

P1: Integer overflow in deserialization offset arithmetic

P1: Missing Hamming in distance type roundtrip test

Minor

Uh oh!

codecov bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

westonpace commented Mar 30, 2026

Uh oh!

wjones127 commented Apr 1, 2026

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wjones127 commented Mar 18, 2026 •

edited

Loading

P1: Missing `Hamming` in distance type roundtrip test

codecov bot commented Mar 18, 2026 •

edited

Loading