[SPARK-57195][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndexOutOfBoundsException in CSV schema inference#56260
Open
yashtc wants to merge 2 commits into
Open
Conversation
…OutOfBoundsException in CSV schema inference CSV schema inference threw a raw java.lang.ArrayIndexOutOfBoundsException for a row with more columns than maxColumns. SPARK-49444 fixed the per-line parseLine path; this also covers the inference paths that tokenize with a raw CsvParser: the streaming convertStream path (multiLine) and TextInputCSVDataSource.inferFromDataset (non-multiLine), routing them through a shared UnivocityParser.parseLine helper that translates the exception to MALFORMED_CSV_RECORD.
…pt SparkRuntimeException Revert the broadened getThrowables catch back to the direct-cause check so the maxCharsPerColumn TextParsingException (SPARK-28431) is not converted to MALFORMED_CSV_RECORD. Fix the new inference tests to intercept SparkRuntimeException, which is thrown directly rather than wrapped in SparkException.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
CSV schema inference can fail with an uncaught
java.lang.ArrayIndexOutOfBoundsExceptionwhen a row has more columns thanmaxColumns. SPARK-49444 added handling for the per-lineUnivocityParser.parseLinepath, but the schema-inference paths that tokenize with a raw UnivocityCsvParserwere never covered, so they still surface the internal exception.This PR translates that exception into a
MALFORMED_CSV_RECORDerror across the remaining paths:UnivocityParser.parseLine(tokenizer, line)helper that converts Univocity'sArrayIndexOutOfBoundsException(raised bare or wrapped in aTextParsingException) intoMALFORMED_CSV_RECORD.UnivocityParser.convertStream, used bymultiLinereads andmultiLineschema inference.multiLineinference path (TextInputCSVDataSource.inferFromDataset) and the single-variant-column header read through the same helper.Why are the changes needed?
Schema inference and
multiLinereads crash with an internalArrayIndexOutOfBoundsExceptionfor input that should produce a cleanMALFORMED_CSV_RECORDerror. SPARK-49444 only coveredUnivocityParser.parseLine; the inference paths construct a rawCsvParserand callparseLineon it directly, bypassing that handling.Does this PR introduce any user-facing change?
Yes. When a CSV row has more columns than
maxColumnsduring schema inference (or amultiLineread), the surfaced error changes from an internaljava.lang.ArrayIndexOutOfBoundsExceptiontoMALFORMED_CSV_RECORD(SQLSTATEKD000), consistent with the non-multiLineper-row read path since SPARK-49444. This is a change relative to all released versions.How was this patch tested?
Added unit tests in
CSVSuitecovering schema inference for both themultiLineand non-multiLinepaths, assertingMALFORMED_CSV_RECORDinstead of a rawArrayIndexOutOfBoundsException.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic), Claude Opus 4.8