branch-4.1: [fix](fe) Reject lone UTF-16 surrogates in JSONB literals (RFC 8259 §8.2) #63255#63347
Open
github-actions[bot] wants to merge 1 commit into
Open
branch-4.1: [fix](fe) Reject lone UTF-16 surrogates in JSONB literals (RFC 8259 §8.2) #63255#63347github-actions[bot] wants to merge 1 commit into
github-actions[bot] wants to merge 1 commit into
Conversation
…8.2) (#63255) ## Summary **Problem fixed:** `JsonLiteral` (Nereids/Jackson path) and `analysis.JsonLiteral` (legacy/Gson path) silently accepted lone UTF-16 surrogates (e.g. `'"\uD800"'::JSONB`) as valid JSONB literals. RFC 8259 §8.2 explicitly forbids unpaired surrogates in JSON strings because they cannot be represented as valid UTF-8. **How it was fixed:** Added a recursive `validateNoLoneSurrogate` post-parse check in both `JsonLiteral` constructors. After Jackson/Gson parses the JSON tree, the method walks all string nodes and immediately throws `AnalysisException` on any lone high or low surrogate. ## What problem does this PR solve? **Before this fix:** Passing a lone surrogate like `'"\uD800"'::JSONB` was silently accepted at the FE layer. The invalid value would be stored in the BE JSONB column. The error would only surface later — during `EXPORT`, `SELECT INTO OUTFILE`, or cross-system transfer — making it hard to diagnose. This is a data-correctness (SEV-2) issue. **After this fix:** Constructing a `JsonLiteral` with a lone surrogate immediately throws `AnalysisException: Invalid jsonb literal: JSON string contains lone high surrogate` (or `lone low surrogate`), giving the user a clear error at write time. ## Behavior change | Scenario | Before | After | |---|---|---| | `'"\uD800"'::JSONB` | Accepted silently | AnalysisException at parse time | | `INSERT INTO t VALUES (1, '"\uD800"')` | Stored in BE, may fail on export | AnalysisException at FE | | `'"\uD83D\uDE00"'::JSONB` (valid pair 😀) | Accepted | Still accepted (no change) | | `'"hello"'::JSONB` (plain ASCII) | Accepted | Still accepted (no change) | ## Why both paths? Doris has two `JsonLiteral` implementations: - **Nereids** (`fe-core`): uses Jackson `ObjectMapper.readTree` — Jackson accepts lone surrogates by default - **Legacy** (`fe-catalog`, `analysis`): uses Gson `JsonParser.parse` — Gson also accepts lone surrogates by default Both needed the same fix to ensure consistent behavior regardless of which query path is used. ## Release note JSONB literal expressions now reject strings containing lone UTF-16 surrogates (e.g. `'"\uD800"'::JSONB`) with an `AnalysisException` at parse time, conforming to RFC 8259 §8.2. Previously such literals were silently accepted, which could cause errors during export or cross-system data transfer. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-picked from #63255