Skip to content

[FSTORE-2030] Add support for specifying lookback windows for PIT queries#583

Open
manu-sj wants to merge 1 commit into
logicalclocks:mainfrom
manu-sj:FSTORE-2030
Open

[FSTORE-2030] Add support for specifying lookback windows for PIT queries#583
manu-sj wants to merge 1 commit into
logicalclocks:mainfrom
manu-sj:FSTORE-2030

Conversation

@manu-sj
Copy link
Copy Markdown
Contributor

@manu-sj manu-sj commented May 21, 2026

Summary

  • Adds a user-guide section Lookback window for PIT joins to feature_view/batch-data.md covering the two modes, the dict and dataclass call shapes, partition pruning behavior, and the one-sided lower-only form.
  • Cross-links from feature_view/training-data.md so users hitting create_training_data find the same reference.

JIRA

FSTORE-2030

Test plan

  • One-sentence-per-line convention respected.
  • Python code blocks valid Python (run through ruff via the workspace policy).
  • Reviewer to verify the page renders correctly in the mkdocs preview.

Companion PRs

  • Backend: logicalclocks/hopsworks-ee → branch FSTORE-2030
  • SDK: logicalclocks/hopsworks-api → branch FSTORE-2030
  • Integration tests: logicalclocks/loadtest → branch FSTORE-2030

@manu-sj manu-sj marked this pull request as ready for review May 23, 2026 23:55
…ries

https://hopsworks.atlassian.net/browse/FSTORE-2030

PIT joins for Hopsworks training datasets and batch feature retrieval
emit `feature_fg.event_time <= root_fg.event_time` to pick the latest
matching record. The range predicate defeats partition pruning so every
historical partition of every joined feature group is scanned on every
read, and the scan grows unboundedly with daily ingestion.

Add an optional `lookback` parameter to the PIT join API (on
`feature_view.get_batch_data`, the three in-memory training-data
methods `training_data` / `train_test_split` /
`train_validation_test_split`, and the three materialised create
variants `create_training_data` / `create_train_test_split` /
`create_train_validation_test_split`). When set, the backend emits an
additional constant-bound predicate on a single DATE partition column
of the root and each joined feature group, so Spark Catalyst's
PartitionFilters, Hudi's HoodieFileIndex, and flyingduck's directory
walker can each prune partitions before opening files.

Adds a "Lookback window for PIT joins" section to the batch-data and
training-data user guides covering uniform and per-feature-group
shapes, with concrete instance and dict examples and the partition-
column eligibility caveat (single DATE partition column). Adds a
"Combining `lookback` with other filters" section explaining the
sub-query vs outer-filter pruning behaviour, including the mixed-FG
outer-filter case where Catalyst inlines the dim wrap and the root
loses FileScan-level pruning.

Reviewed-by: OpenAI Codex (GPT-5 via codex-plugin-cc 1.0.4) <codex@openai.com>
Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@logicalclocks.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant