Skip to content

chore(qwp): simplify the wire protocol to one version with inline schemas#39

Merged
bluestreak01 merged 10 commits into
mainfrom
mt_remove-schema-ref
Jun 9, 2026
Merged

chore(qwp): simplify the wire protocol to one version with inline schemas#39
bluestreak01 merged 10 commits into
mainfrom
mt_remove-schema-ref

Conversation

@mtopolnik

@mtopolnik mtopolnik commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

What

Removes the QWP schema-reference mechanism and the protocol-v2 designation from the Java client. Column schemas now always travel inline on the wire, and the client runs a single protocol version.

This is the client (Phase 2) half of a tandem change; the server half is questdb/questdb#7200. The QWP wire format is breaking, but QWP is beta so there is no compatibility window to preserve.

Why

QWP shipped with a schema-reference optimisation: a sender could transmit a column schema once, receive a schema id, and reference that id on later frames. Store-and-forward retry made this unworkable -- a recorded frame can replay against a fresh server connection that never saw the registering frame, leaving the reference dangling. Senders already stopped using reference mode in production, so the wire kept paying for a mechanism nobody used.

Wire format

Ingress table block (sender -> server):

before: name | row_count | col_count | mode | schema_id | [cols if FULL]
after:  name | row_count | col_count | columns (always inline)

Egress RESULT_BATCH (server -> client): the schema rides only the first batch of a query (batch_seq == 0); continuation batches carry rows and bind against the schema parsed there.

The per-frame version byte stays at offset 4, hardcoded to 1 and validated == 1. The X-QWP-Max-Version handshake header and negotiation machinery stay so a future version bump can re-introduce a range.

Notable changes

  • Decoder: the connection-scoped schema registry is replaced by per-query schema state, invalidated when a new query starts. A continuation batch that arrives before its schema-bearing batch_seq == 0 is now rejected rather than binding rows to a stale schema.
  • API removal (breaking): Sender.builder().maxSchemasPerConnection(...) and the max_schemas_per_connection connection-string option are gone. Code or config strings that reference them now fail -- with a compile error and an "unknown configuration key" error respectively.
  • SERVER_INFO is always delivered post-upgrade (no v2 gate), so the failover role filter no longer has a v1 "no role available" branch.

Trade-offs

  • Every query result re-ships its column schema in batch 0, even when the same client re-runs the same query. The old per-connection registry deduped this; the cost is now a few dozen bytes plus a fresh column-name String per column on each batch 0.
  • The maxSchemasPerConnection builder method disappears; external callers that set it will no longer compile. Acceptable for beta.

Test plan

  • mvn -pl core test (client unit + decoder-hardening suites)
  • Schema-reference and schema-id-exhaustion tests removed (mechanism gone); other decoder-hardening coverage (truncation, corrupt zstd, oversized varints) retained.
  • SelfSufficientFramesTest trimmed to its symbol-dict-delta invariant.
  • End-to-end multi-batch / wide-schema coverage lives in the server repo's cutlass-test-suite, which runs this client against the matching server branch.

Coordination

Tandem with questdb/questdb#7200. The wire format is breaking; merge the two together. This PR's branch is named to match the server branch so the client CI builds against the server PR instead of master.

🤖 Generated with Claude Code

mtopolnik and others added 3 commits June 2, 2026 16:48
Mirror the server-side QWP simplification on the Java client. The wire
format drops the per-table-block schema mode byte and schema id; column
schemas always travel inline. Egress RESULT_BATCH carries the schema only
on the first batch (batch_seq == 0); continuation batches reuse it.

Ingress encode: QwpColumnWriter writes columns inline (no mode byte or
schema_id), QwpWebSocketEncoder and QwpUdpSender drop the useSchemaRef
parameter, and QwpWebSocketSender drops the schema-id allocator and the
maxSchemasPerConnection plumbing across its connect overloads.
QwpTableBuffer loses its schemaId field.

Egress decode: QwpResultBatchDecoder parses the inline schema only on
batch_seq == 0 into a reused per-query schema, replacing the connection
schema registry; QwpEgressIoThread invalidates it when a new query starts.
QwpQueryClient always reads SERVER_INFO now that the version gate is gone.

Constants collapse to a single VERSION = 1, and CACHE_RESET keeps only
RESET_MASK_DICT. The Sender builder loses maxSchemasPerConnection() and the
max_schemas_per_connection connect-string key (a breaking change; QWP is
beta). Client unit tests updated to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-up doc cleanup after removing QWP schema references and protocol
v2. No logic changes.

Reword comments that still described the removed two-version,
schema-reference world: the CACHE_RESET doc no longer advertises a
schema-cache mask bit, SERVER_INFO is "every connection" rather than
"every v2 connection", and the capability-trailer comments are
version-neutral instead of naming a nonexistent v2.0/v2.1.

Expand the QwpResultBatchDecoder querySchema comment to record two
consequences of dropping the schema registry: every query's batch 0 now
re-parses the inline schema and allocates a column-name String per
column, and the single schema slot is correct only while the IoThread
runs one query at a time (pipelining would need request_id keying).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Un-glue eight connect() parameter lists in QwpWebSocketSender where
removing the maxSchemasPerConnection parameter had collapsed two
arguments onto a single line with stray indentation. No behaviour
change -- whitespace only.

Repoint three Javadoc/comment references that pointed at the deleted
docs/qwp/*.md design docs to their published equivalents under
https://questdb.com/docs/ -- in QwpColumnBatch, ColumnView, and
WebSocketResponse.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mtopolnik and others added 5 commits June 5, 2026 14:21
Add client-module coverage for the per-query schema state machine that
replaced the connection-scoped registry when schema-reference mode was
removed, plus the decoder and connect edge cases the change exposed:

- testResetQuerySchemaRejectsContinuationFromPriorQuery drives
  resetQuerySchema() and asserts a continuation no longer binds to the
  prior query's schema (the IoThread call site was at 0% coverage).
- testSchemaSwitchesAcrossQueriesWithDifferentColumnCount and
  testSchemaSlotsReusedAcrossQueriesWithDifferentTypes exercise the
  ensureQuerySchema grow/shrink branches, querySchema.size() on
  continuation batches, and pooled-slot reuse across type changes.
- testEmptyResultSetCarriesSchema covers a result with the schema
  present but zero rows (per-column null flags, no bodies).
- testColumnCountOutOfRangeIsRejected pins the surviving column_count
  guard, and testVersionTwoIsRejected pins the version-2 rejection the
  flattened version check introduced.
- testWalk_ServerInfoTimeoutIsTransportNotTerminal verifies the walk
  treats a post-upgrade SERVER_INFO timeout as a transport error and
  falls through to a healthy node.

Also convert a raw NUL byte and raw U+FFFF bytes in a char[] fixture to
\u escapes. The stray NUL (pre-existing) made the source read as binary
data, so grep and similar tooling silently skipped the whole file.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
QwpResultBatchDecoder.close() now clears querySchemaValid alongside the
other connection-scoped state it tears down. The decoder is single-use
per connection today, so this is future-proofing: if a decoder instance
were ever pooled and reused across connections, a surviving
querySchemaValid would let the first continuation batch (batch_seq > 0)
on the reused decoder bind rows to the previous connection's schema.

No behaviour change in the current single-use lifecycle.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three test comments survived the schema-reference / protocol-v2 removal
with stale wording. One was actively contradictory: the fall-through
reset test claimed the test server "doesn't emit" SERVER_INFO directly
above the setSendServerInfo(true) call the scenario now relies on -- once
the rejection clears, the rehabilitated node must send the frame because
the client always expects it.

Reword that comment to explain why target=any keeps the test focused on
the fall-through reset rather than the role filter, and drop the obsolete
"v2 server" / "v2 SERVER_INFO" qualifiers from the WalkTracker class
javadoc and the getServerInfo unit-test note. Also refresh the decoder
hardening class javadoc, which still listed "growing the schema registry
without bound" as a guarded scenario; that guard is gone, so it now names
the surviving guard (out-of-range column_count).

Comment-only changes; no behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The connect-walk role filter rejects an endpoint whose SERVER_INFO
frame advertises a role the target= filter excludes, but no in-repo
test exercised that branch: the existing REPLICA tests all drive the
421 X-QuestDB-Role header path, and the only test that sent a real
SERVER_INFO role used target=any, so matchesTarget(info.getRole(),
target) was never reached. Removing the v1-mismatch branch left this
surviving branch uncovered.

Add two tests against TestWebSocketServer with a clean 101 upgrade and
setSendServerInfo(true), so the role is carried only by the decoded
SERVER_INFO frame (a clean 101 ignores the upgrade-time role header):

- a single REPLICA endpoint under target=primary throws
  QwpRoleMismatchException with the observed role taken from the frame;
- a REPLICA-then-PRIMARY endpoint pair under target=primary skips the
  replica and binds the primary.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mtopolnik

Copy link
Copy Markdown
Contributor Author

Review: PR #39refactor(qwp): simplify the wire protocol to one version with inline schemas

Level 3 (full mission-critical pass: change-surface map, 9 parallel agents + manual lead verification, per-finding source confirmation). Reviewed in the java-questdb-client submodule repo.

Verdict: Approve with minor changes. No Critical correctness, concurrency, or resource-leak issues. The decoder state-machine rewrite (connection-scoped schema registry → per-query inline schema) is sound, leak-safe, and correctly confined to the single IO thread. Module test-compiles cleanly; agents ran the affected suites green. The actionable items are PR-metadata accuracy and a missing regression test for the headline breaking change.

Critical

None.

Moderate

M1 — PR description misstates the error for the removed config key (in-diff: description).
The body claims removed config strings fail "with ... an 'unknown option' error." The actual behavior on both paths is "unknown configuration key":

  • Egress QwpQueryClient.fromConfigIllegalArgumentException("unknown configuration key: max_schemas_per_connection") (QwpQueryClient.java:624).
  • Ingress Sender.fromConfigLineSenderException("unknown configuration key [key=max_schemas_per_connection]") (Sender.java:3309).

Fix the description wording so it matches what users will actually see. (The behavior itself is correct and consistent across both paths — only the description is wrong.)

M2 — Title type understates a breaking change (in-diff: title/labels).
refactor(qwp): signals a no-behavior-change cleanup, but this PR changes the wire format and removes a public builder method (Sender.builder().maxSchemasPerConnection) and a connection-string option. The squash title lands on the default branch and feeds release notes. Recommend feat(qwp)!: (or a BREAKING CHANGE: footer) to mark the public-API/wire break. Note: this submodule repo only carries GitHub's default label set (no Compatibility/ILP/QWP labels), so "enhancement" is the best available — consider creating a breaking/compatibility label if the team wants these flagged here.

M3 — No regression test for the removed max_schemas_per_connection key (in-diff: tests).
The old tests (testMaxSchemasPerConnection*, testWsConfigString_withMaxSchemasPerConnection, and the max_schemas_per_connection=65535 entry in testIngressOnlyKeysSilentlyAcceptedOnEgress) were deleted with nothing replacing them. The PR's headline breaking behavior — the key now rejects — has zero coverage on either path. Add an egress reject test to QwpQueryClientFromConfigTest and an ingress one to LineSenderBuilderWebSocketTest, asserting the exact message. An exact-message assertion would also have caught M1.

Minor

m4 — testEmptyResultSetCarriesSchema under-verifies its own claim. It decodes an empty batch_seq==0 (row_count 0, 2 columns) and asserts column count/names, but never feeds a follow-up continuation to prove the schema was actually captured (querySchemaValid set) and bindable. The production behavior is correct (the schema parse + querySchemaValid = true at QwpResultBatchDecoder.java:488 runs regardless of row_count), but the test would still pass if a future change failed to set the flag. One extra continuation decode closes the gap.

m5 — The new close() reset of querySchemaValid (QwpResultBatchDecoder.java:211) is untested. resetQuerySchema()'s effect is covered, but the close()-path reset is not. It's defensive (the decoder is single-use per connection today), so low priority, but it's net-new behavior the PR introduced.

m6 — getServerInfo() Javadoc imprecision (PR rewrote this exact comment). The doc (QwpQueryClient.java:1046-1049) says it returns null "if the client is not connected," but close() (762–820) doesn't null serverInfo — only cleanupFailedConnect() does (line 1577). So after a normal close(), the accessor returns the stale last-connected value while connected==false. This is a pre-existing imprecision, but since the PR rewrote this Javadoc, consider tightening it (e.g., "null before the first successful connect").

m7 — Cosmetic comment staleness (optional cleanup).

  • QwpConstants.java:28,68 — "QWP v1 binary protocol" / "Magic bytes for QWP v1" read as stale version references now that v1/v2 collapsed to a single VERSION (the magic family "QWP1" is unchanged; reword to drop "v1").
  • QwpResultBatchDecoder.java:92-93 — "post CACHE_RESET re-registration" now refers only to symbol-dict rebuild, not schema re-registration.
  • QwpResultBatchDecoder.java:100-109 — the conn-dict comment block is now separated from the fields it documents by the inserted querySchema block (pre-existing structure, widened by this PR).

Design / compatibility notes (not defects, but call out at merge)

  • The client now always blocks on receiveServerInfoSync() (the negotiatedQwpVersion >= VERSION_2 gate is gone), making it hard-dependent on the tandem server (#7200) always emitting SERVER_INFO post-upgrade. Against any server that completes the 101 upgrade but doesn't send SERVER_INFO, every endpoint stalls for serverInfoTimeoutMs then fails the walk as a transport error. This is by-design for the breaking, merge-together change and is covered by testWalk_ServerInfoTimeoutIsTransportNotTerminal — but it's the central behavioral cliff. Confirm the server half truly always emits SERVER_INFO on egress before merging the pair.
  • Continuation batches (batch_seq>0) no longer carry column_count and the schema-mode byte is gone. A version-1 server that violates the new continuation format would silently mis-decode rows rather than erroring (the removed SCHEMA_MODE path had an explicit "unknown schema mode" reject). Acceptable: the per-frame version byte (!= VERSION reject) gates the wire and the tandem PR defines the format — but it's a real reduction in wire self-description, relying entirely on the version contract.

Downgraded (false positives / non-issues, after source verification)

  • querySchemaValid should be isQuerySchemaValid (boolean naming). Dropped — this module's instance-field booleans consistently use predicate suffixes (columnDefsCacheValid, failoverEnabled, creditEnabled, gorillaMode); the is/has rule is applied at the method level here. Renaming would break local consistency.
  • Stale {@code VERSION_1}/{@code MAX_SUPPORTED_VERSION} in QwpResultBatchDecoderHardeningTest.java:186. Dropped — it's prose in a {@code} block deliberately describing the pre-collapse range check to explain why v2 is now rejected; not a symbol link, doesn't affect compilation, reads correctly.
  • "cleanupFailedConnect never nulls serverInfo" (Agent 10). Dropped — it does, at line 1577. The valid part of that finding (normal close() not nulling it) is captured in m6.
  • "Duplicate truncated batch_seq==0 leaves a partial schema with querySchemaValid still true." Dropped — unreachable on the integrated path: querySchemaValid = true is the last statement after all throw-capable parsing, and a decode failure makes the IO thread set currentQueryDone=true and stop the receive loop, so no continuation is processed after a failure within the same query.
  • Per-query column-name String allocation is a hot-path regression. Dropped to a documented trade-off — batch_seq==0 occurs once per query (not per batch); colNameScratch already avoids the throwaway byte[]; the cost is off the per-row/per-batch path and is the explicitly stated trade-off of dropping schema ids.
  • Cross-connection stale state / leak from removing sender schema fields. Dropped — the removed nextSchemaId/maxSentSchemaId only governed the deleted reference mode; symbol-dict correctness depends on confirmedMaxId=-1 + full inline schema per frame, unaffected. QwpEgressColumnInfo holds no native memory; retention is strictly less than the old registry.

Summary

  • One-line verdict: Approve with minor changes — no blocking correctness/concurrency/leak issues; fix the PR description wording (M1), reconsider the title type (M2), and add the missing config-rejection regression test (M3).
  • Regressions/tradeoffs: every query now re-ships its inline schema on batch 0 (a few dozen bytes + one String per column per query) — documented and off the hot path. Wire self-description on continuation frames is reduced. The client is now hard-dependent on the tandem server always sending SERVER_INFO.
  • Draft findings: ~13 raised across agents; 10 verified/retained (3 Moderate, 4 Minor, plus 2 design notes), 6 dropped as false positives after source verification.
  • In-diff vs out-of-diff: all 10 retained findings are in-diff; 0 out-of-diff. This is legitimate, not an under-run: Agent 9 ran a repo-wide grep over every changed symbol — including the long positional QwpWebSocketSender.connect(...) arg chains (verified no arg-order shift past the dropped maxSchemasPerConnection), the reflection caller in SegmentedNativeBufferWriterTest, and the public-API removal — and found every callsite SAFE. The parent server repo has no references to the removed client symbols. The PR is a self-contained refactor that updated all its own callsites, and the module compiles.

@mtopolnik mtopolnik changed the title refactor(qwp): simplify the wire protocol to one version with inline schemas chore(qwp): simplify the wire protocol to one version with inline schemas Jun 8, 2026
mtopolnik and others added 2 commits June 8, 2026 10:06
The max_schemas_per_connection connection-string option was removed
along with the QWP schema-reference mechanism, but the tests that
exercised it were deleted with nothing asserting the new behaviour.
The key's removal is a breaking change, so its rejection deserves
explicit coverage on both transports.

Add a test on each fromConfig path that the key now fails:

- Egress (QwpQueryClient.fromConfig): asserts the exact
  IllegalArgumentException "unknown configuration key:
  max_schemas_per_connection". This restores the coverage lost when
  the key was dropped from testIngressOnlyKeysSilentlyAcceptedOnEgress,
  pinning that egress now rejects the key rather than silently
  accepting it.
- Ingress (Sender.fromConfig): asserts the LineSenderException
  "unknown configuration key [key=max_schemas_per_connection]".

The exact-message assertions also pin the wording so a future refactor
that tweaks the diagnostic must update the tests deliberately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mtopolnik

Copy link
Copy Markdown
Contributor Author

[PR Coverage check]

😍 pass : 38 / 43 (88.37%)

file detail

path covered line new line coverage
🔵 io/questdb/client/cutlass/qwp/client/QwpEgressIoThread.java 0 1 00.00%
🔵 io/questdb/client/cutlass/qwp/client/QwpQueryClient.java 3 5 60.00%
🔵 io/questdb/client/cutlass/qwp/client/QwpWebSocketSender.java 4 6 66.67%
🔵 io/questdb/client/cutlass/qwp/client/QwpWebSocketEncoder.java 4 4 100.00%
🔵 io/questdb/client/cutlass/qwp/client/QwpUdpSender.java 2 2 100.00%
🔵 io/questdb/client/cutlass/qwp/client/QwpResultBatchDecoder.java 23 23 100.00%
🔵 io/questdb/client/cutlass/qwp/client/QwpColumnWriter.java 2 2 100.00%

@bluestreak01 bluestreak01 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve (Level 3 review).

No Critical, correctness, concurrency, or resource-leak issues. The decoder rewrite (connection-scoped schemaRegistry → per-query inline querySchema + querySchemaValid) is sound, leak-safe, and correctly confined to the single I/O thread (resetQuerySchema() runs per query before any decode; a decode error sets currentQueryDone=true so no continuation can bind a stale schema). Module compiles; affected suites pass. All changed-symbol callsites in this repo and the parent questdb repo are SAFE — parent-repo hits for maxSchemasPerConnection/SCHEMA_MODE_REFERENCE/MAX_SUPPORTED_VERSION are the server's own io.questdb.cutlass.qwp.* symbols (tandem #7200), not callers of the removed client API.

Non-blocking items:

  • Title: chore(qwp): understates a breaking wire + public-API change (removes Sender.builder().maxSchemasPerConnection(...) + the max_schemas_per_connection option). Since the squash title lands on main and feeds release notes, prefer feat(qwp)!: or add a BREAKING CHANGE: footer.
  • Nits: stale comment "post CACHE_RESET re-registration" (QwpResultBatchDecoder.java:93); getServerInfo() Javadoc says null "if not connected" but a clean close() leaves serverInfo non-null; testEmptyResultSetCarriesSchema doesn't feed a continuation to prove querySchemaValid was set; the close()-path querySchemaValid reset is untested.

Note vs the earlier review: the latest commits resolved its M1 (description now says "unknown configuration key", which is accurate) and M3 (egress + ingress reject tests now exist with exact-message assertions).

Design dependency to confirm at merge: the client now always blocks on receiveServerInfoSync(), so it is hard-dependent on the tandem server (#7200) always emitting SERVER_INFO post-upgrade — merge the pair together.

@bluestreak01 bluestreak01 merged commit 77709a3 into main Jun 9, 2026
12 checks passed
@bluestreak01 bluestreak01 deleted the mt_remove-schema-ref branch June 9, 2026 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants