Skip to content

[SPARK-57195][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndexOutOfBoundsException in CSV schema inference#56260

Open
yashtc wants to merge 2 commits into
apache:masterfrom
yashtc:csv-infer-parsenext-malformed
Open

[SPARK-57195][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndexOutOfBoundsException in CSV schema inference#56260
yashtc wants to merge 2 commits into
apache:masterfrom
yashtc:csv-infer-parsenext-malformed

Conversation

@yashtc
Copy link
Copy Markdown

@yashtc yashtc commented Jun 1, 2026

What changes were proposed in this pull request?

CSV schema inference can fail with an uncaught java.lang.ArrayIndexOutOfBoundsException when a row has more columns than maxColumns. SPARK-49444 added handling for the per-line UnivocityParser.parseLine path, but the schema-inference paths that tokenize with a raw Univocity CsvParser were never covered, so they still surface the internal exception.

This PR translates that exception into a MALFORMED_CSV_RECORD error across the remaining paths:

  • Adds a shared UnivocityParser.parseLine(tokenizer, line) helper that converts Univocity's ArrayIndexOutOfBoundsException (raised bare or wrapped in a TextParsingException) into MALFORMED_CSV_RECORD.
  • Guards the streaming tokenizer UnivocityParser.convertStream, used by multiLine reads and multiLine schema inference.
  • Routes the non-multiLine inference path (TextInputCSVDataSource.inferFromDataset) and the single-variant-column header read through the same helper.

Why are the changes needed?

Schema inference and multiLine reads crash with an internal ArrayIndexOutOfBoundsException for input that should produce a clean MALFORMED_CSV_RECORD error. SPARK-49444 only covered UnivocityParser.parseLine; the inference paths construct a raw CsvParser and call parseLine on it directly, bypassing that handling.

Does this PR introduce any user-facing change?

Yes. When a CSV row has more columns than maxColumns during schema inference (or a multiLine read), the surfaced error changes from an internal java.lang.ArrayIndexOutOfBoundsException to MALFORMED_CSV_RECORD (SQLSTATE KD000), consistent with the non-multiLine per-row read path since SPARK-49444. This is a change relative to all released versions.

How was this patch tested?

Added unit tests in CSVSuite covering schema inference for both the multiLine and non-multiLine paths, asserting MALFORMED_CSV_RECORD instead of a raw ArrayIndexOutOfBoundsException.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), Claude Opus 4.8

…OutOfBoundsException in CSV schema inference

CSV schema inference threw a raw java.lang.ArrayIndexOutOfBoundsException for a row with more columns than maxColumns. SPARK-49444 fixed the per-line parseLine path; this also covers the inference paths that tokenize with a raw CsvParser: the streaming convertStream path (multiLine) and TextInputCSVDataSource.inferFromDataset (non-multiLine), routing them through a shared UnivocityParser.parseLine helper that translates the exception to MALFORMED_CSV_RECORD.
@yashtc yashtc marked this pull request as draft June 1, 2026 23:24
…pt SparkRuntimeException

Revert the broadened getThrowables catch back to the direct-cause check so the maxCharsPerColumn TextParsingException (SPARK-28431) is not converted to MALFORMED_CSV_RECORD. Fix the new inference tests to intercept SparkRuntimeException, which is thrown directly rather than wrapped in SparkException.
@yashtc yashtc marked this pull request as ready for review June 3, 2026 05:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant