Skip to content

feat: pluggable index cache via CacheBackend trait#6222

Merged
wjones127 merged 27 commits intolance-format:mainfrom
wjones127:feat/pluggable-index-cache
Mar 29, 2026
Merged

feat: pluggable index cache via CacheBackend trait#6222
wjones127 merged 27 commits intolance-format:mainfrom
wjones127:feat/pluggable-index-cache

Conversation

@wjones127
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 commented Mar 18, 2026

Previously the Session's index cache was hardcoded to Moka. This adds a
CacheBackend trait so users can provide their own cache implementation,
with MokaCacheBackend as the default.

Serialization and zero-IO reconstruction are in the follow-up PR #6223.

The Session's index cache was hardcoded to use Moka. This adds a
CacheBackend trait so users can provide their own cache implementation
(e.g. Redis-backed, disk-backed, shared across processes).

Two-layer design:
- CacheBackend: object-safe async trait with opaque byte keys. This is
  what plugin authors implement (get, insert, invalidate_prefix, clear,
  num_entries, size_bytes).
- LanceCache: typed wrapper handling key construction (prefix + type
  tag), type-safe get/insert, DeepSizeOf size computation, hit/miss
  stats, and concurrent load deduplication.

MokaCacheBackend is the default, preserving existing behavior. Custom
backends are wired through Session::with_index_cache_backend() or
DatasetBuilder::with_index_cache_backend().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the enhancement New feature or request label Mar 18, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review: feat: pluggable index cache via CacheBackend trait

P0: TOCTOU race in get_or_insert — dedup is broken

The check-then-register pattern in get_or_insert has a race between the two separately-locked critical sections:

// Block 1: check (drops lock at end of block)
{
    let map = self.in_flight.lock().await;
    if let Some(rx) = map.get(&cache_key) { ... }
}
// <-- gap: another task can also see an empty map here

// Block 2: register as leader (re-acquires lock)
let (tx, rx) = tokio::sync::watch::channel(None);
{
    let mut map = self.in_flight.lock().await;
    map.insert(cache_key.clone(), rx);
}

Between dropping the lock in block 1 and re-acquiring it in block 2, N tasks can all see an empty in-flight map and all decide they are the "leader." The last one to insert overwrites the earlier receivers, so:

  1. The loader runs N times (dedup completely defeated).
  2. Tasks that cloned an earlier receiver get RecvError when that sender drops without the value propagating through the map.

Fix: Merge the check and register into a single critical section:

let mut map = self.in_flight.lock().await;
if let Some(rx) = map.get(&cache_key) {
    let mut rx = rx.clone();
    drop(map);
    // ... wait for leader ...
} else {
    let (tx, rx) = tokio::sync::watch::channel(None);
    map.insert(cache_key.clone(), rx);
    drop(map);
    // ... run loader ...
}

The test_get_or_insert_dedup test passes by luck because tokio::task::yield_now() with a broadcast barrier doesn't reliably interleave into this gap. A test with a longer sleep or explicit synchronization at the gap point would expose this.

P1: type_tag relies on unspecified pointer identity

fn type_tag<T: 'static>() -> [u8; 8] {
    (std::any::type_name::<T>().as_ptr() as u64).to_le_bytes()
}

The Rust spec does not guarantee that type_name::<T>().as_ptr() returns the same address across calls, nor that distinct types produce distinct addresses (string deduplication/interning is compiler-implementation-defined). The old code used TypeId as a key component, which has correct identity semantics.

If two different types happen to collide (same pointer or same string literal address), the cache will silently return a wrong type and the downcast::<T>().unwrap() will panic at runtime.

Consider using TypeId bytes instead. TypeId implements Hash and Eq, so you could hash it to get a stable discriminator, or use transmute to extract its bytes (it's currently 128 bits).

P1: invalidate_prefix is fire-and-forget

pub fn invalidate_prefix(&self, prefix: &str) {
    let prefix_bytes = self.make_prefix(prefix);
    let cache = self.cache.clone();
    tokio::spawn(async move {
        cache.invalidate_prefix(&prefix_bytes).await;
    });
}

This means invalidation is not guaranteed to complete before subsequent cache reads. A caller that invalidates then immediately reads could get stale data. The old code was synchronous. If this must remain sync (non-async signature), at minimum document this caveat prominently, or consider changing the signature to async fn.

Minor notes

  • DeepSizeOf for LanceCache now returns 0, which is a silent behavioral regression for any code that relies on deep_size_of() for memory accounting. Consider at least delegating to approx_size_bytes().
  • WeakLanceCache::get_or_insert lost all dedup behavior — concurrent loads for the same key will all run the loader independently. The old code used moka's optionally_get_with which handled this.

wjones127 and others added 2 commits March 18, 2026 20:04
Add type_name()/type_id() to CacheKey and UnsizedCacheKey traits so
backends can identify the type of cached entries. Add parse_cache_key()
utility for backends to extract (user_key, type_id) from opaque key
bytes.

CacheKey-based methods now pipe the key's type_id through to the
backend. Non-CacheKey methods use type_id_of::<T>() as a sentinel.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Remove #[cfg(test)] convenience methods; tests now use CacheKey
   via a TestKey helper, eliminating the parallel method hierarchy.

2. Fix dedup race condition: re-check the cache while holding the
   in-flight lock so no two tasks can both become leader for the
   same key.

3. Use Arc::try_unwrap on the leader error path to preserve the
   original error type when possible.

4. Make invalidate_prefix async instead of fire-and-forget spawn.

5. Replace type_name().as_ptr() with a hash of std::any::TypeId for
   stable type discrimination. Defined once in type_id_of() and used
   by CacheKey::type_id() default.

6. Add dedup to WeakLanceCache::get_or_insert, sharing the in-flight
   map from the parent LanceCache.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 19, 2026

Address feedback:

1. Move get_or_insert() onto CacheBackend. The method takes a pinned
   future (not a closure), so LanceCache can type-erase the user's
   non-'static loader before passing it to the backend. Default impl
   does simple get-then-insert; MokaCacheBackend uses moka's built-in
   optionally_get_with for dedup. This eliminates duplicated dedup
   logic and the manual watch-channel machinery.

2. Restore type_name().as_ptr() for type_id derivation on CacheKey.
   Remove standalone type_id_of() function. The derivation lives in
   one place: CacheKey::type_id()/UnsizedCacheKey::type_id().

3. Remove approx_size_bytes from CacheBackend trait and Session debug
   output. Only approx_num_entries remains.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wjones127 wjones127 force-pushed the feat/pluggable-index-cache branch from 56a3273 to 00867ad Compare March 19, 2026 16:56
wjones127 and others added 3 commits March 19, 2026 10:05
Remove all methods that bypass CacheKey from WeakLanceCache (get,
insert, get_or_insert, get_unsized, insert_unsized). Remove
insert_unsized/get_unsized from LanceCache. Remove type_tag helper.
All cache access now goes through CacheKey/UnsizedCacheKey.

Make parse_cache_key return (empty, 0) instead of panicking on short
keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore approx_size_bytes on CacheBackend so DeepSizeOf on LanceCache
reports actual cache memory usage (used by Session::size_bytes). Fixes
test_metadata_cache_size Python test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wjones127 wjones127 force-pushed the feat/pluggable-index-cache branch from 369afe2 to 1ba4ac3 Compare March 19, 2026 18:29
@wjones127 wjones127 marked this pull request as ready for review March 19, 2026 19:05
wjones127 and others added 6 commits March 19, 2026 17:07
The type_name().as_ptr() approach for type discrimination was unstable
across crate boundaries due to monomorphization. Replace with an
explicit fn type_id() -> &'static str that each CacheKey impl provides
as a short human-readable literal (e.g. 'Vec<IndexMetadata>', 'Manifest').

Key format changes from user_key\0<8 LE bytes> to user_key\0<type_id str>.
parse_cache_key() now returns (&[u8], &str).
Add IvfIndexState struct and serialization to lance-index, enabling
IVFIndex to export its reconstructable state (IVF model, quantizer
metadata) without non-serializable handles. Add reconstruct_vector_index()
which rebuilds an IVFIndex from cached state by re-opening FileReaders
(cheap with warm metadata cache) instead of re-fetching global buffers
from object storage.

Also adds IvfQuantizationStorage::from_cached() to skip global buffer
reads during reconstruction, and Session::file_metadata_cache() to
expose the metadata cache for the reconstruction context.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reconstructed VectorIndex instances need the original cache key prefix
to share partition entries with the two-tier cache backend. Also adds
LanceCache::with_backend_and_prefix() and WeakLanceCache::prefix().

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@wjones127 wjones127 marked this pull request as draft March 20, 2026 18:58
wjones127 and others added 3 commits March 20, 2026 12:32
Previously, the disk cache codec reconstructed `Arc<dyn VectorIndex>`
from `IvfIndexState` during deserialization, requiring a
`ReconstructionContext` with deferred OnceLock initialization and
sync-to-async runtime juggling. The ObjectStore in that context also
lacked proper credential wrappers.

Now the cache stores `Arc<dyn VectorIndexData>` (serializable state)
instead of `Arc<dyn VectorIndex>` (live index). Lance's
`open_vector_index()` detects cached state and reconstructs using its
own ObjectStore (with credentials) and metadata cache. This eliminates
the ReconstructionContext, OnceLock pattern, and runtime juggling.

Changes:
- Add VectorIndexData trait (lance-index) with write_to/as_any/tag
- Add DeepSizeOf impl for IvfIndexState
- Change VectorIndexCacheKey::ValueType to dyn VectorIndexData
- Add reconstruction-from-cache path in open_vector_index()
- Fix panicking downcast in LanceCache::get_with_id (return None)
- Add Debug/Clone/Copy derives to SubIndexType

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread rust/lance/src/session.rs
Comment thread rust/lance-core/src/cache.rs Outdated
Comment thread rust/lance-core/src/cache/mod.rs Outdated
Comment thread rust/lance-core/src/cache/mod.rs Outdated
Comment thread rust/lance-core/src/cache.rs Outdated
Comment thread rust/lance-core/src/cache.rs Outdated
Comment thread rust/lance/src/index/vector/ivf/partition_serde.rs Outdated
Comment thread rust/lance/src/index/vector/ivf/v2.rs Outdated
wjones127 and others added 2 commits March 20, 2026 16:06
- Split cache.rs into submodules (backend, keys, moka, mod)
- Rename CacheKey::type_id() to type_name() across all implementors
- Improve CacheBackend and get_or_insert docs
- Add Spillable trait with writer-based serialize for partition_serde
- Cache file metadata and file sizes to enable zero-IO reconstruction
- Add test_reconstruct_from_cache_zero_io test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>


Move VectorIndexData, IvfIndexState, partition_serde, cacheable_state,
and zero-IO reconstruction out of this PR to keep it focused on the
pluggable cache backend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wjones127 and others added 3 commits March 20, 2026 16:59
The method was renamed in lance-format#6209 but the test call site in v2.rs was not
updated during the merge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The file_sizes parameter on IVFIndex::try_new and the file_size_map()
usage in open_vector_index were from merged PR lance-format#5497, not the
serialization PR. Restoring them avoids unnecessary HEAD requests.
Also restores vector index cache check in open_generic_index.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wjones127 and others added 4 commits March 26, 2026 14:11
… contract

- Presize `make_cache_key` Vec to avoid intermediate String allocation
- Extract `cache_entry_size` helper to replace magic `+ 8` pattern
- Document `type_name` uniqueness requirement on CacheKey/UnsizedCacheKey
- Remove unused derives from SubIndexType (confirmed compiles without them)
- Use moka's `weighted_size()` instead of iterating in `approx_size_bytes`

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Inline `get_or_insert_with_id`, `insert_unsized_with_id`, and
`get_unsized_with_id` into their sole public callers. Keep
`insert_with_id` and `get_with_id` as shared helpers since they're
used by both sized and unsized paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace `&[u8]` keys in `CacheBackend` with a structured `InternalCacheKey`
type that provides direct access to prefix, key, and type_name fields.
This eliminates the need for `parse_cache_key()` and `make_cache_key()`.

Also change `get_or_insert` to return `(CacheEntry, bool)` where the bool
indicates whether the entry was already cached. This enables accurate
hit/miss tracking instead of counting all get_or_insert calls as misses.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`weighted_size()` can be stale without `run_pending_tasks()` (which is
async). Revert `approx_size_bytes` to iterating entries so the
synchronous `DeepSizeOf` path returns accurate values.

Also remove reference to private `keys` module in module-level doc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wjones127 wjones127 self-assigned this Mar 26, 2026
@wjones127 wjones127 added rust Rust related tasks indexes Related to secondary index implementations need triaging labels Mar 26, 2026
@wjones127 wjones127 removed their assignment Mar 26, 2026
@wjones127 wjones127 marked this pull request as ready for review March 26, 2026 22:23
wjones127 added a commit to wjones127/lance that referenced this pull request Mar 26, 2026
Revert cache module split, CacheBackend trait, and related changes
that are in the pluggable-index-cache PR. Keep only the partition
serde and FlatBin quantizer changes unique to this branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…acheKey::type_name

No implementation uses `self`, so make `type_name` an associated function
instead of a method. Update all call sites to use `K::type_name()`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nit picking

Comment thread rust/lance-core/src/cache/backend.rs Outdated
/// caller. Boxing the future once at the call site (in `LanceCache`)
/// avoids this lifetime conflict while keeping the trait object-safe.
///
/// The future borrows from the caller's scope and will be `.await`ed within
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this description feels kind of redundant / obvious. Like, if im a cache and i keep the loader around im gonna start leaking memory or file handles or something.

My first instinct is that its clutter but maybe in this day of agents it is better safe than sorry?

async fn num_entries(&self) -> usize;

/// Total weighted size in bytes of all stored entries (may flush pending operations).
async fn size_bytes(&self) -> usize;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should assume here that no two entries share buffers? It's probably the best we can do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do. There are ways we can detect this in the future. You may be interested in this PR: #6229


/// Approximate number of entries, callable from synchronous contexts.
/// Backends that cannot provide this cheaply should return 0.
fn approx_num_entries(&self) -> usize {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better for these to return Option with default of None?

Comment thread rust/lance-core/src/cache/keys.rs Outdated
Comment thread rust/lance-core/src/cache/keys.rs Outdated
}
}

pub trait CacheKey {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there both InternalCacheKey and CacheKey? Should this trait be documented?

Comment thread rust/lance-core/src/cache/keys.rs Outdated
fn type_name() -> &'static str;
}

pub trait UnsizedCacheKey {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document this? Why is it needed? (Yes, I know this is just copied from existing but still)

Comment thread rust/lance-core/src/cache/keys.rs Outdated
/// Short, stable string that distinguishes this value type from others in
/// the cache. Used as the suffix in the encoded cache key (`user_key\0type_name`).
///
/// **Must be unique per value type.** If two `CacheKey` impls return the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Unique per value type" is strange wording. I think you're trying to say that two type names are equal only if their value types are equal?

Or, to put another way, if two value types are not equal they must not have equal type names?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is meant to be used as a key in a registry of types we'll use later for deserialization of cache entries.

}

fn type_name() -> &'static str {
std::any::type_name::<T>()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't you say not to use this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment. We can use this internally, but shouldn't call it directly from other crates.

Comment thread rust/lance-core/src/cache/keys.rs Outdated
/// same `type_name` but different `ValueType`s, entries will collide and
/// downcasts will fail silently (returning `None` on get).
///
/// Must be consistent across crate boundaries — use a short literal, not
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does consistent across crate boundaries mean?

Restructure the cache module so the two audiences are clear:
- mod.rs: user-facing API (LanceCache, CacheKey, UnsizedCacheKey)
- backend.rs: implementor-facing API (CacheBackend, InternalCacheKey)

Delete keys.rs, moving CacheKey/UnsizedCacheKey into mod.rs and
InternalCacheKey into backend.rs.

Also: rename has_prefix -> starts_with, trim verbose get_or_insert doc,
add shared-buffer note to approx_size_bytes, improve type_name docs,
explain std::any::type_name usage in previous/reader.rs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wjones127 wjones127 merged commit 7342abf into lance-format:main Mar 29, 2026
33 of 34 checks passed
wjones127 added a commit to wjones127/lance that referenced this pull request Mar 29, 2026
Revert cache module split, CacheBackend trait, and related changes
that are in the pluggable-index-cache PR. Keep only the partition
serde and FlatBin quantizer changes unique to this branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request indexes Related to secondary index implementations rust Rust related tasks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants