Skip to content

Comments

[SPARK-55583][PYTHON] Validate Arrow schema types in Python data source#54362

Open
Yicong-Huang wants to merge 3 commits intoapache:masterfrom
Yicong-Huang:SPARK-55583/wrap-arrow-error-python-datasource
Open

[SPARK-55583][PYTHON] Validate Arrow schema types in Python data source#54362
Yicong-Huang wants to merge 3 commits intoapache:masterfrom
Yicong-Huang:SPARK-55583/wrap-arrow-error-python-datasource

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Feb 18, 2026

What changes were proposed in this pull request?

This PR adds Arrow schema type validation for the pa.RecordBatch code path in Python data source reads. The fix adds a pa_schema.equals(first_element.schema) check after the existing column name validation in records_to_arrow_batches(), raising a clear DATA_SOURCE_RETURN_SCHEMA_MISMATCH error with the expected and actual Arrow schemas.

Why are the changes needed?

When a Python data source returns a pa.RecordBatch with data types that don't match the declared schema, the resulting JVM-side errors are confusing and do not indicate the root cause. For example:

  • IllegalArgumentException: not all nodes, buffers and variadicBufferCounts were consumed from VectorLoader.load()
  • UnsupportedOperationException: Cannot call the method "getUTF8String" of ArrowColumnVector$ArrowVectorAccessor

These errors give no indication that the issue is a schema type mismatch in the Python data source's read() method.

Does this PR introduce any user-facing change?

Yes. Previously, returning a pa.RecordBatch with mismatched types from a Python data source would result in cryptic JVM errors. Now it raises a clear DATA_SOURCE_RETURN_SCHEMA_MISMATCH error showing the expected and actual Arrow schemas.

How was this patch tested?

Added a test case in test_python_datasource.py::test_arrow_batch_data_source.

Was this patch authored or co-authored using generative AI tooling?

No

@Yicong-Huang
Copy link
Contributor Author

cc @allisonwang-db

@Yicong-Huang Yicong-Huang changed the title [SPARK-55583][PYTHON] Validate Arrow schema types in Python data source RecordBatch path [SPARK-55583][PYTHON] Validate Arrow schema types in Python data source Feb 18, 2026
@Yicong-Huang Yicong-Huang force-pushed the SPARK-55583/wrap-arrow-error-python-datasource branch from 3e695b7 to 1055807 Compare February 19, 2026 00:52
Comment on lines -449 to +455
condition = "ARROW_TYPE_MISMATCH",
condition = "PYTHON_STREAMING_DATA_SOURCE_RUNTIME_ERROR",
parameters = Map(
"operation" -> "Python streaming data source read",
"outputTypes" -> "StructType\\(StructField\\(id,IntegerType,false\\)\\)",
"actualDataTypes" -> "StructType\\(StructField\\(id,StringType,true\\)\\)"
"action" -> "planPartitions",
"msg" -> "(?s).*DATA_SOURCE_RETURN_SCHEMA_MISMATCH.*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the full error message here? Does it contain two error classes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the full message would look like

  [PYTHON_STREAMING_DATA_SOURCE_RUNTIME_ERROR]
  Failed when Python streaming data source perform planPartitions:
  PySparkRuntimeError: [DATA_SOURCE_RETURN_SCHEMA_MISMATCH]
  Return schema mismatch in the result from 'read' method.
  Expected: <expected_schema>, Found: <actual_schema>

Yes, the error contains two error classes. The outer PYTHON_STREAMING_DATA_SOURCE_RUNTIME_ERROR wraps the inner DATA_SOURCE_RETURN_SCHEMA_MISMATCH error. We use a general outer class to provide consistent error handling across all operations in streaming, while preserving the original Python error message for debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants