From f0d81ebf4b9036673cc424a0252bf652b4cb0354 Mon Sep 17 00:00:00 2001 From: Jim Dowling Date: Thu, 21 May 2026 14:48:19 +0200 Subject: [PATCH 1/2] [HWORKS-2802] Document partitioned_by parameter on feature group creation https://hopsworks.atlassian.net/browse/HWORKS-2802 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a section to docs/user_guides/fs/feature_group/create.md describing the storage-engine-native partitioned_by parameter for Delta feature groups. Covers: - Usage example with create_feature_group / get_or_create_feature_group. - The CREATE TABLE … USING DELTA … GENERATED ALWAYS AS … contract: the storage layer derives the partition columns; the user's dataframe never carries them. - Validation rules: mutual exclusion with partition_key, requires event_time. - Partition pruning table — Delta auto-derives partition predicates from the GENERATED expressions for hierarchical specs (year / year+month / year+month+day / year+month+day+hour), so `fg.read(start_time=..., end_time=...)` and `fg.filter(fg.event_time >= ...)` prune at the partition level. Non-hierarchical specs (e.g. ["month"], ["year","week"]) are valid but skip the auto-derivation — only direct predicates on the grain columns prune. Recommend hierarchical specs. - Online feature store behavior: derived columns live offline-only by default; online_partition_columns=true opts into online materialization. Until the onlinefs consumer filter ships, the backend rejects partitioned_by + online_enabled=true with the default online_partition_columns=false. Document both workarounds. - Hudi: partitioned_by + HUDI is rejected at creation; Hudi support is tracked under a separate follow-up ticket. Signed-off-by: Jim Dowling Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/user_guides/fs/feature_group/create.md | 54 +++++++++++++++++++++ 1 file changed, 54 insertions(+) diff --git a/docs/user_guides/fs/feature_group/create.md b/docs/user_guides/fs/feature_group/create.md index c6db36f3ef..c7c6a91d0f 100644 --- a/docs/user_guides/fs/feature_group/create.md +++ b/docs/user_guides/fs/feature_group/create.md @@ -102,6 +102,60 @@ MaxDirectoryItemsExceededException - The directory item limit is exceeded: limit By using partitioning the system will write the feature data in different subdirectories, thus allowing you to write 10240 files per partition. +##### Time-grain partitioning with `partitioned_by` (Delta only) + +When the partition columns are derived from the feature group's `event_time`, the Python client can hand the backend the desired time grains and let the storage engine generate the partition columns automatically. +Pass `partitioned_by=[...]` with one or more grains drawn from `hour`, `day`, `week`, `month`, and `year`. + +```python +fg = fs.get_or_create_feature_group( + name="transactions", + version=1, + primary_key=["tx_id"], + event_time="tx_ts", + partitioned_by=["year", "month", "day"], + time_travel_format="DELTA", +) +fg.insert(df) # df does not need year/month/day — Delta derives them +``` + +The example above is equivalent to manually decomposing `tx_ts` into three columns and passing `partition_key=["year", "month", "day"]`. +The backend creates the table via `CREATE TABLE … USING DELTA … GENERATED ALWAYS AS …`, so the derived columns live entirely inside the storage layer; the source dataframe never carries them. + +`partitioned_by` and `partition_key` are mutually exclusive. +`partitioned_by` requires `event_time` to be set. + +###### Partition pruning + +Delta auto-derives partition predicates from the GENERATED expressions when the user filters on the source column. +Filtering on `event_time` ranges therefore prunes partitions for free on hierarchical specs: + +| `partitioned_by` | Prunes on `event_time` range? | Prunes on `year` / `month` / `day` filter? | +| --- | --- | --- | +| `["year"]` | ✅ | ✅ | +| `["year", "month"]` | ✅ | ✅ | +| `["year", "month", "day"]` | ✅ | ✅ | +| `["year", "month", "day", "hour"]` | ✅ | ✅ | +| `["month"]` (no year) | ⚠️ no — month alone is ambiguous across years | ✅ filter on month works | +| `["year", "week"]` | ⚠️ year only — week isn't directly derivable from a date range | ✅ both columns prune | +| `["day"]` (no year/month) | ⚠️ no — day-of-month is ambiguous | ✅ filter on day works | + +Prefer hierarchical specs (`["year"]`, `["year", "month"]`, `["year", "month", "day"]`) — they line up with the typical batch-pipeline access pattern and prune naturally. + +###### Online feature store + +By default, the derived partition columns live only in the offline storage; the online feature store does not get them. +Pass `online_partition_columns=True` to materialize them in the online row as well. + +While the online-store filter (the `onlinefs` consumer that drops `offline_only` columns from the RonDB write) is still pending, the backend rejects `partitioned_by` together with `online_enabled=true` and the default `online_partition_columns=false` to avoid writing the grain columns to RonDB by accident. +The two workarounds: keep the feature group offline-only, or set `online_partition_columns=True` to materialize the grains online explicitly. + +###### Hudi + +`partitioned_by` on `time_travel_format="HUDI"` feature groups is not yet supported and the backend rejects it at creation. +Hudi needs a different mechanism (a `CustomKeyGenerator` + server-side `Transformer`) and is tracked under a separate follow-up ticket. +Until that lands, use `time_travel_format="DELTA"` to get time-grain partitioning, or partition Hudi groups explicitly via `partition_key=["year"]` with a `year` column the upstream pipeline computes. + ##### Table format When you create a feature group, you can specify the table format you want to use to store the data in your feature group by setting the `time_travel_format` parameter. From 6b0c36317e3d698c6f82eaee2b64952b6a4267ef Mon Sep 17 00:00:00 2001 From: Jim Dowling Date: Sun, 31 May 2026 15:18:16 +0200 Subject: [PATCH 2/2] [HWORKS-2802] Update partitioned_by docs for the real-column design https://hopsworks.atlassian.net/browse/HWORKS-2802 The partitioned_by section described Delta GENERATED ALWAYS AS columns and storage-engine-side derivation, which is no longer how it works. Document the real design: the client derives the grain columns from event_time and writes them as real partition columns, pruning works natively on grain filters and via predicate translation on event_time ranges. Correct the online-store note: online-enabled partitioned_by feature groups are rejected entirely until HWORKS-2808, not only with the default online_partition_columns. Signed-off-by: Jim Dowling Co-Authored-By: Claude Opus 4.8 --- docs/user_guides/fs/feature_group/create.md | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/docs/user_guides/fs/feature_group/create.md b/docs/user_guides/fs/feature_group/create.md index c7c6a91d0f..8197c9245f 100644 --- a/docs/user_guides/fs/feature_group/create.md +++ b/docs/user_guides/fs/feature_group/create.md @@ -104,8 +104,8 @@ By using partitioning the system will write the feature data in different subdir ##### Time-grain partitioning with `partitioned_by` (Delta only) -When the partition columns are derived from the feature group's `event_time`, the Python client can hand the backend the desired time grains and let the storage engine generate the partition columns automatically. -Pass `partitioned_by=[...]` with one or more grains drawn from `hour`, `day`, `week`, `month`, and `year`. +When the partition columns are derived from the feature group's `event_time`, hand the backend the desired time grains with `partitioned_by=[...]` and the Python client derives the partition columns for you. +Pass one or more grains drawn from `hour`, `day`, `week`, `month`, and `year`. ```python fg = fs.get_or_create_feature_group( @@ -116,19 +116,20 @@ fg = fs.get_or_create_feature_group( partitioned_by=["year", "month", "day"], time_travel_format="DELTA", ) -fg.insert(df) # df does not need year/month/day — Delta derives them +fg.insert(df) # df does not need year/month/day — the client derives them ``` The example above is equivalent to manually decomposing `tx_ts` into three columns and passing `partition_key=["year", "month", "day"]`. -The backend creates the table via `CREATE TABLE … USING DELTA … GENERATED ALWAYS AS …`, so the derived columns live entirely inside the storage layer; the source dataframe never carries them. +The grain columns are ordinary materialized partition columns: the client computes them from `event_time` on each write and the backend registers them as partition columns through the normal table-creation path. +The source dataframe does not need to carry them. `partitioned_by` and `partition_key` are mutually exclusive. `partitioned_by` requires `event_time` to be set. ###### Partition pruning -Delta auto-derives partition predicates from the GENERATED expressions when the user filters on the source column. -Filtering on `event_time` ranges therefore prunes partitions for free on hierarchical specs: +The grain columns are real partition columns, so a filter on a grain column (for example `year == 2026`) prunes partitions natively. +A filter on an `event_time` range is rewritten into equivalent grain-column predicates by the query layer, so it prunes too on hierarchical specs: | `partitioned_by` | Prunes on `event_time` range? | Prunes on `year` / `month` / `day` filter? | | --- | --- | --- | @@ -144,11 +145,9 @@ Prefer hierarchical specs (`["year"]`, `["year", "month"]`, `["year", "month", " ###### Online feature store -By default, the derived partition columns live only in the offline storage; the online feature store does not get them. -Pass `online_partition_columns=True` to materialize them in the online row as well. - -While the online-store filter (the `onlinefs` consumer that drops `offline_only` columns from the RonDB write) is still pending, the backend rejects `partitioned_by` together with `online_enabled=true` and the default `online_partition_columns=false` to avoid writing the grain columns to RonDB by accident. -The two workarounds: keep the feature group offline-only, or set `online_partition_columns=True` to materialize the grains online explicitly. +Online-enabled feature groups do not yet support `partitioned_by`. +The online ingestion path does not exclude the offline-only grain columns from the Kafka/Avro schema, nor materialize them for the online write, so the backend rejects `partitioned_by` together with `online_enabled=true` until that work lands (tracked under a separate follow-up ticket). +Keep the feature group offline-only to use `partitioned_by`. ###### Hudi