[SPARK-55583][PYTHON] Validate Arrow schema types in Python data source by Yicong-Huang · Pull Request #54362 · apache/spark

Yicong-Huang · 2026-02-18T06:41:02Z

What changes were proposed in this pull request?

This PR adds Arrow schema type validation for the pa.RecordBatch code path in Python data source reads. The fix adds a pa_schema.equals(first_element.schema) check after the existing column name validation in records_to_arrow_batches(), raising a clear DATA_SOURCE_RETURN_SCHEMA_MISMATCH error with the expected and actual Arrow schemas.

Why are the changes needed?

When a Python data source returns a pa.RecordBatch with data types that don't match the declared schema, the resulting JVM-side errors are confusing and do not indicate the root cause. For example:

IllegalArgumentException: not all nodes, buffers and variadicBufferCounts were consumed from VectorLoader.load()
UnsupportedOperationException: Cannot call the method "getUTF8String" of ArrowColumnVector$ArrowVectorAccessor

These errors give no indication that the issue is a schema type mismatch in the Python data source's read() method.

Does this PR introduce any user-facing change?

Yes. Previously, returning a pa.RecordBatch with mismatched types from a Python data source would result in cryptic JVM errors. Now it raises a clear DATA_SOURCE_RETURN_SCHEMA_MISMATCH error showing the expected and actual Arrow schemas.

How was this patch tested?

Added a test case in test_python_datasource.py::test_arrow_batch_data_source.

Was this patch authored or co-authored using generative AI tooling?

No

Yicong-Huang · 2026-02-18T06:44:27Z

cc @allisonwang-db

allisonwang-db · 2026-02-21T01:27:26Z

...t/scala/org/apache/spark/sql/execution/python/streaming/PythonStreamingDataSourceSuite.scala

-      condition = "ARROW_TYPE_MISMATCH",
+      condition = "PYTHON_STREAMING_DATA_SOURCE_RUNTIME_ERROR",
      parameters = Map(
-        "operation" -> "Python streaming data source read",
-        "outputTypes" -> "StructType\\(StructField\\(id,IntegerType,false\\)\\)",
-        "actualDataTypes" -> "StructType\\(StructField\\(id,StringType,true\\)\\)"
+        "action" -> "planPartitions",
+        "msg" -> "(?s).*DATA_SOURCE_RETURN_SCHEMA_MISMATCH.*"


What's the full error message here? Does it contain two error classes?

the full message would look like

[PYTHON_STREAMING_DATA_SOURCE_RUNTIME_ERROR] Failed when Python streaming data source perform planPartitions: PySparkRuntimeError: [DATA_SOURCE_RETURN_SCHEMA_MISMATCH] Return schema mismatch in the result from 'read' method. Expected: <expected_schema>, Found: <actual_schema>

Yes, the error contains two error classes. The outer PYTHON_STREAMING_DATA_SOURCE_RUNTIME_ERROR wraps the inner DATA_SOURCE_RETURN_SCHEMA_MISMATCH error. We use a general outer class to provide consistent error handling across all operations in streaming, while preserving the original Python error message for debugging.

Yicong-Huang changed the title ~~[SPARK-55583][PYTHON] Validate Arrow schema types in Python data source RecordBatch path~~ [SPARK-55583][PYTHON] Validate Arrow schema types in Python data source Feb 18, 2026

Yicong-Huang added 3 commits February 18, 2026 16:34

feat: validate Arrow schema types in Python data source RecordBatch path

a4e5833

test: update streaming test to expect wrapped error for schema mismatch

ef0b5f5

fix: correct action parameter in streaming test to planPartitions

1055807

Yicong-Huang force-pushed the SPARK-55583/wrap-arrow-error-python-datasource branch from 3e695b7 to 1055807 Compare February 19, 2026 00:52

allisonwang-db reviewed Feb 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[SPARK-55583][PYTHON] Validate Arrow schema types in Python data source#54362

[SPARK-55583][PYTHON] Validate Arrow schema types in Python data source#54362
Yicong-Huang wants to merge 3 commits intoapache:masterfrom
Yicong-Huang:SPARK-55583/wrap-arrow-error-python-datasource

Yicong-Huang commented Feb 18, 2026 •

edited

Loading

Uh oh!

Yicong-Huang commented Feb 18, 2026

Uh oh!

allisonwang-db Feb 21, 2026

Uh oh!

Yicong-Huang Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

Yicong-Huang commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Yicong-Huang commented Feb 18, 2026

Uh oh!

allisonwang-db Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yicong-Huang commented Feb 18, 2026 •

edited

Loading