Skip to content

feat: vortex-datafusion segment cache#7809

Draft
ch-sc wants to merge 2 commits intovortex-data:developfrom
ch-sc:ch-sc/vortex-datafusion-segment-cache
Draft

feat: vortex-datafusion segment cache#7809
ch-sc wants to merge 2 commits intovortex-data:developfrom
ch-sc:ch-sc/vortex-datafusion-segment-cache

Conversation

@ch-sc
Copy link
Copy Markdown

@ch-sc ch-sc commented May 6, 2026

Summary

vortex-layout already provides a SegmentCache trait and a MokaSegmentCache implementation, and vortex-file's OpenOptions already has with_segment_cache() that wires it into every file open. vortex-datafusion just did not thread this through and so every new query re-reads data segments.

This PR:

  • Adds segment_cache: Option<Arc<dyn SegmentCache>> to VortexOpener and VortexSource.
  • Passes the cache into OpenOptions::with_segment_cache() when opening each file in VortexOpener::open().
  • Exposes VortexSource::with_segment_cache() so callers can inject a shared cache instance.

Closes: #000

API Changes

 impl VortexSource {
      /// Sets a [`SegmentCache`] to reuse decoded segment bytes across scans of the same file.
      pub fn with_segment_cache(self, segment_cache: Arc<dyn SegmentCache>) -> Self;
  }

Testing

The segment_cache field defaults to None, so all existing behavior is unchanged when the cache is not configured.

ch-sc added 2 commits May 6, 2026 13:15
Signed-off-by: Christoph Schulze <christoph.schulze@polygon.io>
…rtex-datafusion-segment-cache

Signed-off-by: Christoph Schulze <christoph.schulze@polygon.io>
@AdamGS
Copy link
Copy Markdown
Contributor

AdamGS commented May 6, 2026

I think there's an issue here where that the segment caches uses SegmentId for its keys, so you can't be re-used across files as they are not unique IIRC.
In some cases this will probably work correctly, but often a single VortexSource will be shared between multiple files.

@ch-sc
Copy link
Copy Markdown
Author

ch-sc commented May 6, 2026

Ah, I see. SegmentId is file dependent, but only consists of a u32.

I don't have a clear solution in mind, but if SegmentId would contain file information (maybe path and an e-tag for updates to the file) we could resolve the file dependency.

@ch-sc ch-sc marked this pull request as draft May 6, 2026 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants