diff --git a/docs/user_guides/fs/feature_group/create.md b/docs/user_guides/fs/feature_group/create.md index c6db36f3ef..8197c9245f 100644 --- a/docs/user_guides/fs/feature_group/create.md +++ b/docs/user_guides/fs/feature_group/create.md @@ -102,6 +102,59 @@ MaxDirectoryItemsExceededException - The directory item limit is exceeded: limit By using partitioning the system will write the feature data in different subdirectories, thus allowing you to write 10240 files per partition. +##### Time-grain partitioning with `partitioned_by` (Delta only) + +When the partition columns are derived from the feature group's `event_time`, hand the backend the desired time grains with `partitioned_by=[...]` and the Python client derives the partition columns for you. +Pass one or more grains drawn from `hour`, `day`, `week`, `month`, and `year`. + +```python +fg = fs.get_or_create_feature_group( + name="transactions", + version=1, + primary_key=["tx_id"], + event_time="tx_ts", + partitioned_by=["year", "month", "day"], + time_travel_format="DELTA", +) +fg.insert(df) # df does not need year/month/day — the client derives them +``` + +The example above is equivalent to manually decomposing `tx_ts` into three columns and passing `partition_key=["year", "month", "day"]`. +The grain columns are ordinary materialized partition columns: the client computes them from `event_time` on each write and the backend registers them as partition columns through the normal table-creation path. +The source dataframe does not need to carry them. + +`partitioned_by` and `partition_key` are mutually exclusive. +`partitioned_by` requires `event_time` to be set. + +###### Partition pruning + +The grain columns are real partition columns, so a filter on a grain column (for example `year == 2026`) prunes partitions natively. +A filter on an `event_time` range is rewritten into equivalent grain-column predicates by the query layer, so it prunes too on hierarchical specs: + +| `partitioned_by` | Prunes on `event_time` range? | Prunes on `year` / `month` / `day` filter? | +| --- | --- | --- | +| `["year"]` | ✅ | ✅ | +| `["year", "month"]` | ✅ | ✅ | +| `["year", "month", "day"]` | ✅ | ✅ | +| `["year", "month", "day", "hour"]` | ✅ | ✅ | +| `["month"]` (no year) | ⚠️ no — month alone is ambiguous across years | ✅ filter on month works | +| `["year", "week"]` | ⚠️ year only — week isn't directly derivable from a date range | ✅ both columns prune | +| `["day"]` (no year/month) | ⚠️ no — day-of-month is ambiguous | ✅ filter on day works | + +Prefer hierarchical specs (`["year"]`, `["year", "month"]`, `["year", "month", "day"]`) — they line up with the typical batch-pipeline access pattern and prune naturally. + +###### Online feature store + +Online-enabled feature groups do not yet support `partitioned_by`. +The online ingestion path does not exclude the offline-only grain columns from the Kafka/Avro schema, nor materialize them for the online write, so the backend rejects `partitioned_by` together with `online_enabled=true` until that work lands (tracked under a separate follow-up ticket). +Keep the feature group offline-only to use `partitioned_by`. + +###### Hudi + +`partitioned_by` on `time_travel_format="HUDI"` feature groups is not yet supported and the backend rejects it at creation. +Hudi needs a different mechanism (a `CustomKeyGenerator` + server-side `Transformer`) and is tracked under a separate follow-up ticket. +Until that lands, use `time_travel_format="DELTA"` to get time-grain partitioning, or partition Hudi groups explicitly via `partition_key=["year"]` with a `year` column the upstream pipeline computes. + ##### Table format When you create a feature group, you can specify the table format you want to use to store the data in your feature group by setting the `time_travel_format` parameter.