feat(spark): SparkSource query+path and pre-computed offline read for BatchFeatureView by abhijeet-dhumal · Pull Request #6440 · feast-dev/feast

abhijeet-dhumal · 2026-05-27T11:59:35Z

What this PR does / why we need it

get_historical_features() on a BatchFeatureView re-runs the full UDF on raw data every call. For embedding pipelines, that's 20–40 min of compute per training run even though features already exist from the last materialize.

Fix: Route get_historical_features() to read pre-computed parquet from batch_source.path instead of re-executing the UDF.

To support this, SparkSource now accepts query + path together:

query — raw data read during materialize()
path — write-back target and pre-computed read source for get_historical_features()

SparkSource(
    query="SELECT id, text, event_timestamp FROM bronze.documents",
    path="s3://my-bucket/feast/features/document_embeddings/",
)

Also allows BatchFeatureView with online=False, offline=True (offline-only) to skip the online validation check in get_historical_features(), so it can be used purely for training data without configuring an online store.

Falls back to live query if path doesn't exist yet (first run before any materialization).

Which issue(s) this PR fixes

N/A. Enables efficient training data retrieval for BatchFeatureView embedding pipelines without re-running UDFs.

Checks

I've made sure the tests are passing.
My commits are signed off (git commit -s)
My PR title follows conventional commits format

Testing Strategy

Unit tests — offline path routing, SparkSource constraint, graceful fallback
Manual tests — get_historical_features() reads from parquet, not UDF, after materialization

devin-ai-integration

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

…orical_features The function and its call were removed in this PR but the replacement (_apply_bfv_transformations_for_historical) lives in a separate PR (feast-dev#6440). Removing it here would silently return raw untransformed features for any BatchFeatureView with a Python UDF via the standard get_historical_features() API path (FeatureStore → passthrough_provider → SparkOfflineStore). Restoring the function and its call until feast-dev#6440 lands. Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

abhijeet-dhumal · 2026-06-01T07:48:49Z

@ntkathole May I request your review here too ?

jyejare

This PR adds support for SparkSource with combined query+path configuration and pre-computed offline reads for BatchFeatureView. The changes enable reading from materialized offline stores to avoid expensive UDF re-execution. While the feature is useful, there are several security vulnerabilities and error handling gaps that need attention.

jyejare · 2026-06-01T12:36:52Z

+            file_format = fv.batch_source.file_format or "parquet"
+            try:
+                df = spark_session.read.format(file_format).load(fv.batch_source.path)
+                df.createOrReplaceTempView(tmp_view)
+                ctx = replace(ctx, table_subquery=tmp_view)
+                new_contexts.append(ctx)


[Critical] Path traversal vulnerability in file loading

The code directly loads files from fv.batch_source.path without any validation. This creates a path traversal security vulnerability where malicious paths could access unauthorized files on the system.

Suggested:

Suggested change

file_format = fv.batch_source.file_format or "parquet"

try:

df = spark_session.read.format(file_format).load(fv.batch_source.path)

df.createOrReplaceTempView(tmp_view)

ctx = replace(ctx, table_subquery=tmp_view)

new_contexts.append(ctx)

+ # Validate and sanitize the path

+ import os

+ normalized_path = os.path.normpath(fv.batch_source.path)

+ if '..' in normalized_path or normalized_path.startswith('/'):

+ warnings.warn(f"Invalid path '{fv.batch_source.path}' for '{ctx.name}'", RuntimeWarning)

+ new_contexts.append(ctx)

+ continue

+ try:

+ df = spark_session.read.format(file_format).load(normalized_path)

+ df.createOrReplaceTempView(tmp_view)

+ ctx = replace(ctx, table_subquery=tmp_view)

+ new_contexts.append(ctx)

+ continue

Thanks for flagging this. fv.batch_source.path is a trusted configuration value set by the feature store admin at definition time — it's not runtime user input flowing from an API boundary, so path traversal in the traditional web security sense doesn't apply here.

Also, the suggested fix using os.path.normpath() would silently corrupt S3/GCS URIs (s3://bucket/path - s3:/bucket/path) since normpath collapses double slashes. Happy to add a lightweight guard for local paths only if you'd like, but I'd keep it separate from object storage paths.

Make Sense.

jyejare · 2026-06-01T12:36:52Z

+                continue
+            except Exception:
+                warnings.warn(
+                    f"Offline path '{fv.batch_source.path}' not readable for "
+                    f"'{ctx.name}'; falling back to source query.",
+                    RuntimeWarning,


[Warning] Overly broad exception handling masks specific errors

Catching all exceptions with 'except Exception:' is too broad and could mask important errors like permission issues, file corruption, or configuration problems. This makes debugging difficult and could hide security issues.
Suggested:

Suggested change

continue

except Exception:

warnings.warn(

f"Offline path '{fv.batch_source.path}' not readable for "

f"'{ctx.name}'; falling back to source query.",

RuntimeWarning,

+ except (FileNotFoundError, PermissionError) as e:

+ warnings.warn(

+ f"Offline path '{fv.batch_source.path}' not accessible for "

+ f"'{ctx.name}': {str(e)}; falling back to source query.",

+ RuntimeWarning,

+ stacklevel=2,

+ )

+ except Exception as e:

+ # Log unexpected errors but continue with fallback

+ import logging

+ logging.warning(f"Unexpected error loading '{fv.batch_source.path}': {e}")

+ warnings.warn(

+ f"Failed to load offline path for '{ctx.name}'; falling back to source query.",

+ RuntimeWarning,

+ stacklevel=2,

+ )

Good catch — updated to catch FileNotFoundError and PermissionError explicitly for the expected fallback cases, with a separate except Exception block emitting a distinct warning so unexpected errors aren't silently swallowed.

jyejare · 2026-06-01T12:36:52Z

+            if udf is not None:
+                temp_view_name = f"__feast_bfv_{ctx.name}_{uuid.uuid4().hex[:8]}"
+                spark_session.conf.set("spark.sql.runSQLOnFiles", "true")
+                raw_df = spark_session.sql(f"SELECT * FROM {ctx.table_subquery}")


[Warning] UDF execution without sandboxing poses security risks

Executing user-defined functions without proper sandboxing or validation could allow arbitrary code execution. This is a significant security risk in production environments.

Suggested:

Suggested change

if udf is not None:

temp_view_name = f"__feast_bfv_{ctx.name}_{uuid.uuid4().hex[:8]}"

spark_session.conf.set("spark.sql.runSQLOnFiles", "true")

raw_df = spark_session.sql(f"SELECT * FROM {ctx.table_subquery}")

+ # Add UDF validation and execution controls

+ if not hasattr(udf, '__call__'):

+ raise ValueError(f"Invalid UDF for {ctx.name}")

+ raw_df = spark_session.sql(f"SELECT * FROM {ctx.table_subquery}")

+ # Consider adding timeout and resource limits here

+ transformed_df = udf(raw_df)

+ transformed_df.createOrReplaceTempView(temp_view_name)

The UDF here is feature_transformation.udf registered in the Feast registry by the ML engineer at feature view definition time — same function that runs during materialize(). It's not arbitrary user input. The hasattr(udf, '__call__') check doesn't add a meaningful boundary since any Python object can implement __call__. Executor sandboxing is a cluster-level concern (Spark resource limits, executor isolation) — outside Feast's scope.

SparkSource previously required exactly one of table/query/path. This relaxes the constraint to allow query + path together: - query: used for reading raw data during materialization - path: used for offline write-back (offline=True) and as pre-computed read source in get_historical_features Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

… get_historical_features Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

ntkathole · 2026-06-03T08:14:28Z


-        query_context = _apply_bfv_transformations(
-            spark_session, feature_views, query_context
+        query_context = _apply_bfv_transformations_for_historical(


so previous _apply_bfv_transformations helper not removed from code?

ntkathole · 2026-06-03T08:15:50Z

+                )
+
+        if (
+            hasattr(fv, "feature_transformation")


Use has_transformation() and get_transformation_function() instead

ntkathole · 2026-06-03T08:21:00Z

    max_date_partition: str


+def _apply_bfv_transformations_for_historical(


I think instead of writing new helper, better approach is to extend the existing _apply_bfv_transformations with a pre-computed-path shortcut (adding the "read from parquet if offline=True and path exists" branch at the top), rather than creating a parallel function that reimplements UDF execution.

abhijeet-dhumal changed the title ~~Feat/spark bfv offline historical features~~ feat(spark): SparkSource query+path and pre-computed offline read for BatchFeatureView May 27, 2026

abhijeet-dhumal force-pushed the feat/spark-bfv-offline-historical-features branch from a98e23b to 57b2489 Compare May 27, 2026 14:50

abhijeet-dhumal marked this pull request as ready for review May 28, 2026 06:23

abhijeet-dhumal requested a review from a team as a code owner May 28, 2026 06:23

devin-ai-integration Bot reviewed May 28, 2026

View reviewed changes

Comment thread sdk/python/feast/feature_store.py

abhijeet-dhumal force-pushed the feat/spark-bfv-offline-historical-features branch from 57b2489 to e30e146 Compare May 29, 2026 08:57

abhijeet-dhumal force-pushed the feat/spark-bfv-offline-historical-features branch from e30e146 to 4316349 Compare June 1, 2026 07:48

jyejare reviewed Jun 1, 2026

View reviewed changes

abhijeet-dhumal force-pushed the feat/spark-bfv-offline-historical-features branch from 4316349 to 5312075 Compare June 2, 2026 12:45

abhijeet-dhumal requested a review from jyejare June 2, 2026 13:38

jyejare approved these changes Jun 2, 2026

View reviewed changes

abhijeet-dhumal added 5 commits June 3, 2026 13:33

feat: read from offline path in get_historical_features for BFVs

03a617e

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

fix: graceful fallback when offline path is not readable

c0c8da3

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

style: ruff format spark.py and spark_source.py

94ad0fb

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

fix: allow offline-only BatchFeatureView to skip online validation in…

9eb3d29

… get_historical_features Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

ntkathole force-pushed the feat/spark-bfv-offline-historical-features branch from 2f6910e to 9eb3d29 Compare June 3, 2026 08:03

ntkathole added the ok-to-test label Jun 3, 2026

ntkathole reviewed Jun 3, 2026

View reviewed changes

-            if udf is not None:
-                temp_view_name = f"__feast_bfv_{ctx.name}_{uuid.uuid4().hex[:8]}"
-                spark_session.conf.set("spark.sql.runSQLOnFiles", "true")
-                raw_df = spark_session.sql(f"SELECT * FROM {ctx.table_subquery}")
++                # Add UDF validation and execution controls
++                if not hasattr(udf, '__call__'):
++                    raise ValueError(f"Invalid UDF for {ctx.name}")
++                raw_df = spark_session.sql(f"SELECT * FROM {ctx.table_subquery}")
++                # Consider adding timeout and resource limits here
++                transformed_df = udf(raw_df)
++                transformed_df.createOrReplaceTempView(temp_view_name)

		max_date_partition: str


		def _apply_bfv_transformations_for_historical(

Conversation

abhijeet-dhumal commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it

Which issue(s) this PR fixes

Checks

Testing Strategy

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abhijeet-dhumal commented Jun 1, 2026

Uh oh!

jyejare left a comment

Choose a reason for hiding this comment

Uh oh!

jyejare Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

jyejare Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

jyejare Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

jyejare Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

ntkathole Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntkathole Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

ntkathole Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abhijeet-dhumal commented May 27, 2026 •

edited

Loading

jyejare Jun 1, 2026 •

edited

Loading

jyejare Jun 1, 2026 •

edited

Loading

jyejare Jun 1, 2026 •

edited

Loading

ntkathole Jun 3, 2026 •

edited

Loading