Skip to content

branch-4.1: [fix](fe) Reject lone UTF-16 surrogates in JSONB literals (RFC 8259 §8.2) #63255#63347

Open
github-actions[bot] wants to merge 1 commit into
branch-4.1from
auto-pick-63255-branch-4.1
Open

branch-4.1: [fix](fe) Reject lone UTF-16 surrogates in JSONB literals (RFC 8259 §8.2) #63255#63347
github-actions[bot] wants to merge 1 commit into
branch-4.1from
auto-pick-63255-branch-4.1

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

Cherry-picked from #63255

…8.2) (#63255)

## Summary

**Problem fixed:** `JsonLiteral` (Nereids/Jackson path) and
`analysis.JsonLiteral` (legacy/Gson path) silently accepted lone UTF-16
surrogates (e.g. `'"\uD800"'::JSONB`) as valid JSONB literals. RFC 8259
§8.2 explicitly forbids unpaired surrogates in JSON strings because they
cannot be represented as valid UTF-8.

**How it was fixed:** Added a recursive `validateNoLoneSurrogate`
post-parse check in both `JsonLiteral` constructors. After Jackson/Gson
parses the JSON tree, the method walks all string nodes and immediately
throws `AnalysisException` on any lone high or low surrogate.

## What problem does this PR solve?

**Before this fix:** Passing a lone surrogate like `'"\uD800"'::JSONB`
was silently accepted at the FE layer. The invalid value would be stored
in the BE JSONB column. The error would only surface later — during
`EXPORT`, `SELECT INTO OUTFILE`, or cross-system transfer — making it
hard to diagnose. This is a data-correctness (SEV-2) issue.

**After this fix:** Constructing a `JsonLiteral` with a lone surrogate
immediately throws `AnalysisException: Invalid jsonb literal: JSON
string contains lone high surrogate` (or `lone low surrogate`), giving
the user a clear error at write time.

## Behavior change

| Scenario | Before | After |
|---|---|---|
| `'"\uD800"'::JSONB` | Accepted silently | AnalysisException at parse
time |
| `INSERT INTO t VALUES (1, '"\uD800"')` | Stored in BE, may fail on
export | AnalysisException at FE |
| `'"\uD83D\uDE00"'::JSONB` (valid pair 😀) | Accepted | Still accepted
(no change) |
| `'"hello"'::JSONB` (plain ASCII) | Accepted | Still accepted (no
change) |

## Why both paths?

Doris has two `JsonLiteral` implementations:
- **Nereids** (`fe-core`): uses Jackson `ObjectMapper.readTree` —
Jackson accepts lone surrogates by default
- **Legacy** (`fe-catalog`, `analysis`): uses Gson `JsonParser.parse` —
Gson also accepts lone surrogates by default

Both needed the same fix to ensure consistent behavior regardless of
which query path is used.

## Release note

JSONB literal expressions now reject strings containing lone UTF-16
surrogates (e.g. `'"\uD800"'::JSONB`) with an `AnalysisException` at
parse time, conforming to RFC 8259 §8.2. Previously such literals were
silently accepted, which could cause errors during export or
cross-system data transfer.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot requested a review from yiguolei as a code owner May 18, 2026 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant