[SPARK-57195][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndexOutOfBoundsException in CSV schema inference by yashtc · Pull Request #56260 · apache/spark

yashtc · 2026-06-01T23:24:03Z

What changes were proposed in this pull request?

CSV schema inference can fail with an uncaught java.lang.ArrayIndexOutOfBoundsException when a row has more columns than maxColumns. SPARK-49444 added handling for the per-line UnivocityParser.parseLine path, but the schema-inference paths that tokenize with a raw Univocity CsvParser were never covered, so they still surface the internal exception.

This PR translates that exception into a MALFORMED_CSV_RECORD error across the remaining paths:

Adds a shared UnivocityParser.parseLine(tokenizer, line) helper that converts Univocity's ArrayIndexOutOfBoundsException (raised bare or wrapped in a TextParsingException) into MALFORMED_CSV_RECORD.
Guards the streaming tokenizer UnivocityParser.convertStream, used by multiLine reads and multiLine schema inference.
Routes the non-multiLine inference path (TextInputCSVDataSource.inferFromDataset) and the single-variant-column header read through the same helper.

Why are the changes needed?

Schema inference and multiLine reads crash with an internal ArrayIndexOutOfBoundsException for input that should produce a clean MALFORMED_CSV_RECORD error. SPARK-49444 only covered UnivocityParser.parseLine; the inference paths construct a raw CsvParser and call parseLine on it directly, bypassing that handling.

Does this PR introduce any user-facing change?

Yes. When a CSV row has more columns than maxColumns during schema inference (or a multiLine read), the surfaced error changes from an internal java.lang.ArrayIndexOutOfBoundsException to MALFORMED_CSV_RECORD (SQLSTATE KD000), consistent with the non-multiLine per-row read path since SPARK-49444. This is a change relative to all released versions.

How was this patch tested?

Added unit tests in CSVSuite covering schema inference for both the multiLine and non-multiLine paths, asserting MALFORMED_CSV_RECORD instead of a raw ArrayIndexOutOfBoundsException.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), Claude Opus 4.8

…OutOfBoundsException in CSV schema inference CSV schema inference threw a raw java.lang.ArrayIndexOutOfBoundsException for a row with more columns than maxColumns. SPARK-49444 fixed the per-line parseLine path; this also covers the inference paths that tokenize with a raw CsvParser: the streaming convertStream path (multiLine) and TextInputCSVDataSource.inferFromDataset (non-multiLine), routing them through a shared UnivocityParser.parseLine helper that translates the exception to MALFORMED_CSV_RECORD.

…pt SparkRuntimeException Revert the broadened getThrowables catch back to the direct-cause check so the maxCharsPerColumn TextParsingException (SPARK-28431) is not converted to MALFORMED_CSV_RECORD. Fix the new inference tests to intercept SparkRuntimeException, which is thrown directly rather than wrapped in SparkException.

yashtc marked this pull request as draft June 1, 2026 23:24

yashtc marked this pull request as ready for review June 3, 2026 05:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57195][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndexOutOfBoundsException in CSV schema inference#56260

[SPARK-57195][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndexOutOfBoundsException in CSV schema inference#56260
yashtc wants to merge 2 commits into
apache:masterfrom
yashtc:csv-infer-parsenext-malformed

yashtc commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yashtc commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yashtc commented Jun 1, 2026 •

edited

Loading