From f0d81ebf4b9036673cc424a0252bf652b4cb0354 Mon Sep 17 00:00:00 2001
From: Jim Dowling <jim@hopsworks.ai>
Date: Thu, 21 May 2026 14:48:19 +0200
Subject: [PATCH 1/2] [HWORKS-2802] Document partitioned_by parameter on
 feature group creation https://hopsworks.atlassian.net/browse/HWORKS-2802
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add a section to docs/user_guides/fs/feature_group/create.md
describing the storage-engine-native partitioned_by parameter for
Delta feature groups. Covers:

- Usage example with create_feature_group / get_or_create_feature_group.
- The CREATE TABLE … USING DELTA … GENERATED ALWAYS AS … contract:
  the storage layer derives the partition columns; the user's
  dataframe never carries them.
- Validation rules: mutual exclusion with partition_key, requires
  event_time.
- Partition pruning table — Delta auto-derives partition predicates
  from the GENERATED expressions for hierarchical specs (year /
  year+month / year+month+day / year+month+day+hour), so
  `fg.read(start_time=..., end_time=...)` and
  `fg.filter(fg.event_time >= ...)` prune at the partition level.
  Non-hierarchical specs (e.g. ["month"], ["year","week"]) are valid
  but skip the auto-derivation — only direct predicates on the
  grain columns prune. Recommend hierarchical specs.
- Online feature store behavior: derived columns live offline-only
  by default; online_partition_columns=true opts into online
  materialization. Until the onlinefs consumer filter ships, the
  backend rejects partitioned_by + online_enabled=true with the
  default online_partition_columns=false. Document both
  workarounds.
- Hudi: partitioned_by + HUDI is rejected at creation; Hudi support
  is tracked under a separate follow-up ticket.

Signed-off-by: Jim Dowling <jim@logicalclocks.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/user_guides/fs/feature_group/create.md | 54 +++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/docs/user_guides/fs/feature_group/create.md b/docs/user_guides/fs/feature_group/create.md
index c6db36f3ef..c7c6a91d0f 100644
--- a/docs/user_guides/fs/feature_group/create.md
+++ b/docs/user_guides/fs/feature_group/create.md
@@ -102,6 +102,60 @@ MaxDirectoryItemsExceededException - The directory item limit is exceeded: limit
 
 By using partitioning the system will write the feature data in different subdirectories, thus allowing you to write 10240 files per partition.
 
+##### Time-grain partitioning with `partitioned_by` (Delta only)
+
+When the partition columns are derived from the feature group's `event_time`, the Python client can hand the backend the desired time grains and let the storage engine generate the partition columns automatically.
+Pass `partitioned_by=[...]` with one or more grains drawn from `hour`, `day`, `week`, `month`, and `year`.
+
+```python
+fg = fs.get_or_create_feature_group(
+    name="transactions",
+    version=1,
+    primary_key=["tx_id"],
+    event_time="tx_ts",
+    partitioned_by=["year", "month", "day"],
+    time_travel_format="DELTA",
+)
+fg.insert(df)  # df does not need year/month/day — Delta derives them
+```
+
+The example above is equivalent to manually decomposing `tx_ts` into three columns and passing `partition_key=["year", "month", "day"]`.
+The backend creates the table via `CREATE TABLE … USING DELTA … GENERATED ALWAYS AS …`, so the derived columns live entirely inside the storage layer; the source dataframe never carries them.
+
+`partitioned_by` and `partition_key` are mutually exclusive.
+`partitioned_by` requires `event_time` to be set.
+
+###### Partition pruning
+
+Delta auto-derives partition predicates from the GENERATED expressions when the user filters on the source column.
+Filtering on `event_time` ranges therefore prunes partitions for free on hierarchical specs:
+
+| `partitioned_by` | Prunes on `event_time` range? | Prunes on `year` / `month` / `day` filter? |
+| --- | --- | --- |
+| `["year"]` | ✅ | ✅ |
+| `["year", "month"]` | ✅ | ✅ |
+| `["year", "month", "day"]` | ✅ | ✅ |
+| `["year", "month", "day", "hour"]` | ✅ | ✅ |
+| `["month"]` (no year) | ⚠️ no — month alone is ambiguous across years | ✅ filter on month works |
+| `["year", "week"]` | ⚠️ year only — week isn't directly derivable from a date range | ✅ both columns prune |
+| `["day"]` (no year/month) | ⚠️ no — day-of-month is ambiguous | ✅ filter on day works |
+
+Prefer hierarchical specs (`["year"]`, `["year", "month"]`, `["year", "month", "day"]`) — they line up with the typical batch-pipeline access pattern and prune naturally.
+
+###### Online feature store
+
+By default, the derived partition columns live only in the offline storage; the online feature store does not get them.
+Pass `online_partition_columns=True` to materialize them in the online row as well.
+
+While the online-store filter (the `onlinefs` consumer that drops `offline_only` columns from the RonDB write) is still pending, the backend rejects `partitioned_by` together with `online_enabled=true` and the default `online_partition_columns=false` to avoid writing the grain columns to RonDB by accident.
+The two workarounds: keep the feature group offline-only, or set `online_partition_columns=True` to materialize the grains online explicitly.
+
+###### Hudi
+
+`partitioned_by` on `time_travel_format="HUDI"` feature groups is not yet supported and the backend rejects it at creation.
+Hudi needs a different mechanism (a `CustomKeyGenerator` + server-side `Transformer`) and is tracked under a separate follow-up ticket.
+Until that lands, use `time_travel_format="DELTA"` to get time-grain partitioning, or partition Hudi groups explicitly via `partition_key=["year"]` with a `year` column the upstream pipeline computes.
+
 ##### Table format
 
 When you create a feature group, you can specify the table format you want to use to store the data in your feature group by setting the `time_travel_format` parameter.

From 6b0c36317e3d698c6f82eaee2b64952b6a4267ef Mon Sep 17 00:00:00 2001
From: Jim Dowling <jim@hopsworks.ai>
Date: Sun, 31 May 2026 15:18:16 +0200
Subject: [PATCH 2/2] [HWORKS-2802] Update partitioned_by docs for the
 real-column design https://hopsworks.atlassian.net/browse/HWORKS-2802

The partitioned_by section described Delta GENERATED ALWAYS AS columns and
storage-engine-side derivation, which is no longer how it works. Document
the real design: the client derives the grain columns from event_time and
writes them as real partition columns, pruning works natively on grain
filters and via predicate translation on event_time ranges. Correct the
online-store note: online-enabled partitioned_by feature groups are
rejected entirely until HWORKS-2808, not only with the default
online_partition_columns.

Signed-off-by: Jim Dowling <jim@logicalclocks.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/user_guides/fs/feature_group/create.md | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/docs/user_guides/fs/feature_group/create.md b/docs/user_guides/fs/feature_group/create.md
index c7c6a91d0f..8197c9245f 100644
--- a/docs/user_guides/fs/feature_group/create.md
+++ b/docs/user_guides/fs/feature_group/create.md
@@ -104,8 +104,8 @@ By using partitioning the system will write the feature data in different subdir
 
 ##### Time-grain partitioning with `partitioned_by` (Delta only)
 
-When the partition columns are derived from the feature group's `event_time`, the Python client can hand the backend the desired time grains and let the storage engine generate the partition columns automatically.
-Pass `partitioned_by=[...]` with one or more grains drawn from `hour`, `day`, `week`, `month`, and `year`.
+When the partition columns are derived from the feature group's `event_time`, hand the backend the desired time grains with `partitioned_by=[...]` and the Python client derives the partition columns for you.
+Pass one or more grains drawn from `hour`, `day`, `week`, `month`, and `year`.
 
 ```python
 fg = fs.get_or_create_feature_group(
@@ -116,19 +116,20 @@ fg = fs.get_or_create_feature_group(
     partitioned_by=["year", "month", "day"],
     time_travel_format="DELTA",
 )
-fg.insert(df)  # df does not need year/month/day — Delta derives them
+fg.insert(df)  # df does not need year/month/day — the client derives them
 ```
 
 The example above is equivalent to manually decomposing `tx_ts` into three columns and passing `partition_key=["year", "month", "day"]`.
-The backend creates the table via `CREATE TABLE … USING DELTA … GENERATED ALWAYS AS …`, so the derived columns live entirely inside the storage layer; the source dataframe never carries them.
+The grain columns are ordinary materialized partition columns: the client computes them from `event_time` on each write and the backend registers them as partition columns through the normal table-creation path.
+The source dataframe does not need to carry them.
 
 `partitioned_by` and `partition_key` are mutually exclusive.
 `partitioned_by` requires `event_time` to be set.
 
 ###### Partition pruning
 
-Delta auto-derives partition predicates from the GENERATED expressions when the user filters on the source column.
-Filtering on `event_time` ranges therefore prunes partitions for free on hierarchical specs:
+The grain columns are real partition columns, so a filter on a grain column (for example `year == 2026`) prunes partitions natively.
+A filter on an `event_time` range is rewritten into equivalent grain-column predicates by the query layer, so it prunes too on hierarchical specs:
 
 | `partitioned_by` | Prunes on `event_time` range? | Prunes on `year` / `month` / `day` filter? |
 | --- | --- | --- |
@@ -144,11 +145,9 @@ Prefer hierarchical specs (`["year"]`, `["year", "month"]`, `["year", "month", "
 
 ###### Online feature store
 
-By default, the derived partition columns live only in the offline storage; the online feature store does not get them.
-Pass `online_partition_columns=True` to materialize them in the online row as well.
-
-While the online-store filter (the `onlinefs` consumer that drops `offline_only` columns from the RonDB write) is still pending, the backend rejects `partitioned_by` together with `online_enabled=true` and the default `online_partition_columns=false` to avoid writing the grain columns to RonDB by accident.
-The two workarounds: keep the feature group offline-only, or set `online_partition_columns=True` to materialize the grains online explicitly.
+Online-enabled feature groups do not yet support `partitioned_by`.
+The online ingestion path does not exclude the offline-only grain columns from the Kafka/Avro schema, nor materialize them for the online write, so the backend rejects `partitioned_by` together with `online_enabled=true` until that work lands (tracked under a separate follow-up ticket).
+Keep the feature group offline-only to use `partitioned_by`.
 
 ###### Hudi