Skip to content

[WIP]: Integrate Stored Procedure and Query Plan request routing to Gateway V2 endpoint#47759

Open
jeet1995 wants to merge 59 commits into
Azure:mainfrom
jeet1995:AzCosmos_GatewayV2_QueryPlanSupport
Open

[WIP]: Integrate Stored Procedure and Query Plan request routing to Gateway V2 endpoint#47759
jeet1995 wants to merge 59 commits into
Azure:mainfrom
jeet1995:AzCosmos_GatewayV2_QueryPlanSupport

Conversation

@jeet1995

@jeet1995 jeet1995 commented Jan 21, 2026

Copy link
Copy Markdown
Member

Summary

This PR routes Gateway V2 thin-client query/stored-procedure paths and related query-plan metadata, with validation focused on thin-client parity against Direct TCP and on diagnostics correctness for hybrid/full-text query responses.

Validation performed

Latest validation on this branch

Area Command / scope Result
Core SDK build mvn --batch-mode -f sdk\cosmos\azure-cosmos\pom.xml install -DskipTests ... PASS
Query feature header unit test mvn --batch-mode --fail-at-end -f sdk\cosmos\azure-cosmos-tests\pom.xml test "-Dtest=QueryPlanRetrieverSupportedFeaturesTest" PASS
Focused hybrid/full-text diagnostics regression mvn --batch-mode --fail-at-end -f sdk\cosmos\azure-cosmos-tests\pom.xml verify -Pthinclient "-Dit.test=ThinClientQueryE2ETest#testFullTextScoreRanking+testHybridSearchGatewayVsThinClient" ... PASSTests run: 2, Failures: 0, Errors: 0, Skipped: 0
Full thin-client query parity suite mvn --batch-mode --fail-at-end -f sdk\cosmos\azure-cosmos-tests\pom.xml verify -Pthinclient "-Dit.test=ThinClientQueryE2ETest" ... PASSTests run: 81, Failures: 0, Errors: 0, Skipped: 0

The thin-client runs used:

-Pthinclient "-DCOSMOS.THINCLIENT_ENABLED=true" "-DCOSMOS.HTTP2_ENABLED=true" "-Dmaven.wagon.http.pool=false"

Query test methodology

ThinClientQueryE2ETest runs each query shape against the same data through two paths:

  1. Direct TCP — baseline path to backend partition replicas.
  2. Gateway V2 thin client — system-under-test path through the thin-client proxy.

The tests assert:

  1. Thin-client diagnostics include a request targeting the :10250 proxy endpoint.
  2. Direct and thin-client result counts match.
  3. Direct and thin-client result contents match, preserving order for ordered queries.

Query coverage validated

The passing 81-test thin-client query suite covers:

  • filters and projections
  • ORDER BY, DISTINCT, TOP, OFFSET / LIMIT
  • aggregates and GROUP BY
  • JOIN, EXISTS, LIKE, BETWEEN
  • string, math, type, array, and conditional functions
  • vector search
  • full-text ranking
  • hybrid RRF(...) queries
  • multi-range EPK routing
  • continuation-token behavior

Diagnostics regression covered

The focused 2-test run validates the diagnostics fix for hybrid/full-text paths where SDK code builds a synthetic final FeedResponse from internal component query responses. The fix propagates component query client-side request statistics into that final response so endpoint diagnostics still show the core response path.

Query feature header validation

QueryPlanRetrieverSupportedFeaturesTest verifies Java now advertises the safe CountIf query feature while intentionally not advertising:

  • ListAndSetAggregate — Java does not yet implement MAKELIST / MAKESET aggregation support.
  • HybridSearchSkipOrderByRewrite — enabling this currently fails the Java thin-client hybrid query validation against the staging account with a backend 400 / SC1001 syntax error.

Existing thin-client E2E test classes in this PR

Test class Coverage
ThinClientQueryE2ETest Query parity, endpoint diagnostics, full-text, hybrid, vector, continuation
ThinClientChangeFeedE2ETest forFullRange(), forLogicalPartition(), incremental change feed
ThinClientPointOperationE2ETest CRUD + Patch, bulk, batch
ThinClientStoredProcedureE2ETest Stored procedure execute, no-PK error, PartitionKey.NONE
PartitionKeyInternalTest Client-side conversion of PartitionKeyInternal ranges to sorted EPK ranges

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995 jeet1995 changed the title Az cosmos gateway v2 query plan support [Gateway V2 / DO NOT MERGE]: Integrate Stored Procedure and Query Plan request routing to a Gateway V2 endpoint. Jan 29, 2026
@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995 jeet1995 changed the title [Gateway V2 / DO NOT MERGE]: Integrate Stored Procedure and Query Plan request routing to a Gateway V2 endpoint. [Gateway V2][DO NOT MERGE]: Integrate Stored Procedure and Query Plan request routing to a Gateway V2 endpoint. Jan 30, 2026
@BeforeClass(groups = {"thinclient"})
public void beforeClass() {
// If running locally, uncomment these lines
// System.setProperty("COSMOS.THINCLIENT_ENABLED", "true");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need these? probably can clean up now

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least COSMOS.THINCLIENT_ENABLED is always needed from an environment variable perspective (I'll clean the HTTP/2 enabled environment variable).


if (allDocs != null && !allDocs.isEmpty()) {
for (ObjectNode doc : allDocs) {
String id = doc.get(ID_FIELD).asText();

@xinlian12 xinlian12 Jan 30, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems can use the following deleteDocuments() method below. And also not sure how many docs need to clean each time, using bulk can be faster

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the old ThinClientE2ETest.java has been removed and replaced with separate test classes (ThinClientQueryE2ETest, ThinClientPointOperationE2ETest, ThinClientChangeFeedE2ETest, ThinClientStoredProcedureE2ETest). Cleanup uses bulkDelete via executeBulkOperations in @AfterClass.

for (Map.Entry<String, AggregateOperator> aliasToAggregate : aggregateAliasToAggregateType.entrySet()) {
String alias = aliasToAggregate.getKey();
AggregateOperator aggregateOperator = null;
Object aliasAggregateOperator = aliasToAggregate.getValue();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defined but not used?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up in subsequent refactoring — the SingleGroupAggregator change is no longer in the diff.

requestHeaders);
queryPlanRequest.useGatewayMode = true;

// queryPlanRequest.useGatewayMode = true;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in subsequent cleanup.

}

// 3. Execute SELECT * FROM C WHERE c.id = @id query
String query = "SELECT * FROM c WHERE c.id = @id";

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious -> does full text search etc supported?does not see the tests here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure - (likewise about vector search / hybrid search). I'll confirm with the proxy service team.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should work because there is no specific logic in Gateway

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a pipeline running all the query tests in Gateway mode with thin client - let's sync-up at our JVM meeting today.

@xinlian12 xinlian12 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@@ -46,80 +54,360 @@
public class ThinClientE2ETest {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a though: Why not use the existing tests and use the clientBuilders pattern instead and add a thinproxy clientbuilder. In this way, we can slowly add this provider to as many existing tests as we want and have them run on thinproxy?

@jeet1995 jeet1995 Jan 30, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbhaskar - sounds good. I'll make this change.

try {
// Use Jackson to deserialize using PartitionKeyInternal's custom deserializer
return Utils.getSimpleObjectMapper().treeToValue(node, PartitionKeyInternal.class);
} catch (Exception e) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catch specific exeption types? like JsonProcessingException

@FabianMeiswinkel FabianMeiswinkel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks

@xinlian12

Copy link
Copy Markdown
Member

Review complete (39:11)

⚠️ Synthesis step failed — review findings may be incomplete or missing.

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, test-coverage | ✗ synthesis

jeet1995 added a commit that referenced this pull request Jun 4, 2026
…teway V2 (#48787)

* Add ReadConsistencyStrategy RNTBD header and enable for all connection modes (#48094)

- Add ReadConsistencyStrategy RNTBD header (0x00F0, String) to RntbdConstants
  and RntbdRequestHeaders for thin client proxy propagation
- Replace Gateway-mode warn+ignore with client-side GLOBAL_STRONG validation
  that works across all modes (direct, gateway, thin client)
- Update ReadConsistencyStrategy javadoc to reflect all-modes support
- Add unit tests for RNTBD token encoding and round-trip
- Add E2E tests for thin client and compute gateway ReadConsistencyStrategy

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix: prevent x-ms-consistency-level rewrite in gateway mode when RCS is set (#48094)

RxGatewayStoreModel.applySessionToken() called RequestHelper.getReadConsistencyStrategyToUse()
which had a side-effect of rewriting x-ms-consistency-level header (e.g., LATEST_COMMITTED
mapped to BoundedStaleness). The compute gateway rejected this because BoundedStaleness is
stricter than the Session account default.

Fix: Use a copy of the headers map so the original x-ms-consistency-level is preserved.
Gateway/proxy now sees:
- x-ms-consistency-level: Session (original, unchanged)
- x-ms-cosmos-read-consistency-strategy: LatestCommitted (RCS intent)

Verified E2E: LATEST_COMMITTED, EVENTUAL, SESSION, client-level RCS all return 200.
GLOBAL_STRONG correctly throws BadRequestException on Session account.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix: strip x-ms-consistency-level when ReadConsistencyStrategy is set (#48094)

Compute gateway rejects requests containing both x-ms-consistency-level and
x-ms-cosmos-read-consistency-strategy headers. When RCS is non-DEFAULT, remove
the consistency-level header — RCS takes precedence.

Applied to both client-level and request-options-level RCS paths in
RxDocumentClientImpl.getRequestHeaders().

Verified E2E against test4 compute gateway (swkrish-session, Session account):
LATEST_COMMITTED, EVENTUAL, SESSION, client-level RCS all return 200.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix: change ReadConsistencyStrategy RNTBD token type from String to Byte (#48094)

The proxy expects ReadConsistencyStrategy as a Byte enum, not a String:
  Eventual=1, Session=2, LatestCommitted=3, GlobalStrong=4

With String type, the proxy couldn't parse the RNTBD frame and hung.
With Byte type, thin client reads work correctly through the proxy.

Added RntbdReadConsistencyStrategy enum to RntbdConstants matching
the proxy's ReadConsistencyStrategy.h enum values.

Verified E2E against test4 thin client proxy (swkrish-session):
SESSION, EVENTUAL, LATEST_COMMITTED all return 200 with tc=true.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Update E2E tests: dynamic database/container creation, use TestConfigurations (#48094)

Removed hardcoded database/container names. Tests now:
- Create a unique database and container in @BeforeClass
- Clean up in @afterclass
- Use TestConfigurations.HOST/MASTER_KEY from cosmos-v4.properties
- Use /pk as partition key path

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix E2E tests for serverless accounts: remove throughput parameter (#48094)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Comprehensive E2E tests for ReadConsistencyStrategy across all request option types (#48094)

Cover all 5 surfaces: ItemRequestOptions, QueryRequestOptions,
ChangeFeedRequestOptions, ReadManyRequestOptions, CosmosClientBuilder (client-level).
Plus write-ignored, GLOBAL_STRONG validation, and CL+RCS precedence tests.

Tests use dynamic database/container creation, TestConfigurations, serverless-safe.
Follows ThinClientTestBase pattern from PR #47759.

Verified E2E: 21/21 PASS (10 thin client + 11 gateway V1) against
swkrish-session (test4, Session, North Europe).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix test-resources.json: bump Cosmos DB API version to 2023-04-15

The Cosmos DB RP now rejects API version 2022-08-15 for accounts
with EnableNoSQLVectorSearch capability, requiring 2023-04-15 or later.

Error: 'Please use api version 2022-02-15-preview or 2023-04-15 or later'
ActivityId: 3e0f30d8-548a-408f-9827-09c3b81a7166

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Test: revert to 2022-08-15 with EnableNoSQLVectorSearch removed

Verifying that the API version rejection was caused by EnableNoSQLVectorSearch.
This commit will be reverted after pipeline verification.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix test-resources.json: bump API version to 2023-04-15 for EnableNoSQLVectorSearch (#48094)

Cosmos DB RP now requires API version 2023-04-15+ for accounts with
EnableNoSQLVectorSearch capability. Confirmed by testing:
- 2022-08-15 + EnableNoSQLVectorSearch = 400 BadRequest
- 2022-08-15 without EnableNoSQLVectorSearch = passes
- 2023-04-15 + EnableNoSQLVectorSearch = passes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Test: re-verify 2022-08-15 + EnableNoSQLVectorSearch (expect 400)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Centralize consistency flag contention resolution in RxGatewayStoreModel

- Add resolveEffectiveConsistencyHeaders() in RxGatewayStoreModel that
  strips x-ms-consistency-level when ReadConsistencyStrategy wins.
  Called before wrapInHttpRequest — affects both GW V1 (HTTP) and
  GW V2/ThinClientStoreModel (RNTBD).

- Fix contention bug in RxDocumentClientImpl.getRequestHeaders():
  options.getConsistencyLevel() no longer re-adds CL header when RCS
  is already present (Option A guard).

- Rules: request-level RCS > client-level RCS; RCS > ConsistencyLevel.
  Only one consistency header survives on the wire.

- 10 unit tests (ConsistencyFlagContentionTest): both-set, request-ctx
  priority, header-level, DEFAULT transparent, null, idempotency.

- 4 new E2E tests: request-level contention and request-level RCS
  override for both GW V1 and GW V2 paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add spy-wire tests to verify ReadConsistencyStrategy headers on the wire (#48094)

Verify actual HTTP headers sent by the SDK using the SpyClientUnderTest
(Mockito spy on HttpClient) pattern from RequestHeadersSpyWireTest.

5 new tests:
- Request-level RCS: x-ms-cosmos-read-consistency-strategy on wire, CL stripped
- Client-level RCS: same header verification via builder-level config
- Both RCS + CL: RCS wins, CL stripped (contention resolution)
- DEFAULT RCS: no RCS header emitted (transparent)
- Write with client RCS: no RCS header on write operations

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix RNTBD header tests for Byte type + add ThinClient RNTBD spy-wire tests (#48094)

- Fix existing RntbdReadConsistencyStrategyHeaderTests: update token type
  assertions from String to Byte, matching the proxy-compatible encoding
  (Eventual=0x01, Session=0x02, LatestCommitted=0x03, GlobalStrong=0x04)

- Add 7 new RNTBD spy-wire tests that simulate ThinClientStoreModel.wrapInHttpRequest():
  Build RntbdRequest from RxDocumentServiceRequest with RCS headers, encode
  to ByteBuf, and verify the RNTBD frame contains header 0x00F0 with correct
  byte value. Covers all 4 RCS strategies, absent RCS, and CL-stripped scenario.

- Fix GLOBAL_STRONG E2E test: disable with TODO — BadRequestException from
  validateReadConsistencyStrategy() is swallowed by availability strategy and
  does not propagate to the caller. Validation works (unit tests prove it).

- Add readConsistencyStrategyRntbdByteEnumValues test verifying the enum IDs.

All 42 tests pass: 11 GW E2E + 10 contention unit + 21 RNTBD unit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Extract RCS HTTP spy-wire tests into standalone serverless-safe class (#48094)

Move 5 ReadConsistencyStrategy HTTP spy-wire tests from RequestHeadersSpyWireTest
(which extends TestSuiteBase and requires provisioned throughput) into a new
standalone ReadConsistencyStrategyHttpSpyWireTest class that creates its own
serverless-safe resources (no throughput). This allows the spy-wire tests to
run on serverless accounts (e.g., test4 Session accounts) without depending
on TestSuiteBase shared collections.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Rename RCS abbreviation to readConsistencyStrategy throughout (#48094)

Replace all occurrences of the RCS abbreviation with the full
readConsistencyStrategy name in comments, variable names, method
names, test method names, and string literals across 8 files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add spy-wire, mock-pipeline, and thin-client GLOBAL_STRONG tests (#48094)

- Spy-wire: request-level readConsistencyStrategy overrides client-level on wire
- Spy-wire: client-level DEFAULT does not leak header to wire
- RxGatewayStoreModel mock: readConsistencyStrategy survives applySessionToken
  + resolveEffectiveConsistencyHeaders pipeline, CL stripped
- Thin client E2E: GLOBAL_STRONG on Session account throws BadRequestException

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix unused strategy parameter in RntbdReadConsistencyStrategyHeaderTests (#48094)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Remove test-output build artifacts from tracking (#48094)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Replace FQN GatewayConnectionConfig with import (#48094)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Replace FQN HttpRequest with import in RntbdReadConsistencyStrategyHeaderTests (#48094)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Eliminate simulated resolveEffectiveConsistencyHeaders in tests (#48094)

Extract static resolveEffectiveConsistencyHeaders(headers, readConsistencyStrategy)
from RxGatewayStoreModel so tests call the real production code instead of
duplicated simulation logic. Removes ~60 lines of test-only copies that could
silently diverge from the production method.

- ConsistencyFlagContentionTest: calls RxGatewayStoreModel.resolveEffectiveConsistencyHeaders directly
- RntbdReadConsistencyStrategyHeaderTests: delegates to the same method
- RxGatewayStoreModel: private instance method delegates to public static overload

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add operation policy E2E tests and document RntbdReadConsistencyStrategy tests (#48094)

- Gateway E2E: readConsistencyStrategy set via CosmosOperationPolicy propagates
  end-to-end and is reflected in CosmosDiagnostics
- Gateway E2E: operation policy readConsistencyStrategy overrides request-level
- RntbdReadConsistencyStrategyHeaderTests: add class-level Javadoc explaining
  purpose, consistency headers decision matrix, and test regions; move data
  providers to top; remove duplicate data provider; add region comments

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add thin client operation policy E2E tests for readConsistencyStrategy (#48094)

- Thin client E2E: readConsistencyStrategy set via CosmosOperationPolicy
  propagates through RNTBD proxy and is reflected in CosmosDiagnostics
- Thin client E2E: operation policy readConsistencyStrategy overrides
  request-level readConsistencyStrategy

Mirrors the Gateway V1 operation policy tests for coverage parity.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix RxGatewayStoreModelTest mock verification for retry behavior (#48094)

ConnectTimeoutException triggers retries, so httpClient.send() is called
multiple times. Use atLeastOnce() and capture the last request.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Use Gateway V1/V2 terminology in customer-facing docs (#48094)

Update CHANGELOG and ReadConsistencyStrategy Javadoc to say
"Direct, Gateway V1, and Gateway V2" instead of "all connection modes".
Clarify that Gateway V1 uses HTTP headers and Gateway V2 uses RNTBD headers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix CHANGELOG/Javadoc: Direct mode was already supported before this PR (#48094)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add readAllItems E2E tests for readConsistencyStrategy on Gateway V1 and V2 (#48094)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Replace HashMap copy in applySessionToken with pure isEffectiveSessionConsistency (#48094)

applySessionToken only needs to know if the effective consistency is Session.
The previous approach called RequestHelper.getReadConsistencyStrategyToUse()
which mutates x-ms-consistency-level (a Direct-mode telemetry concern), requiring
a defensive HashMap copy on every non-master request.

New approach: a pure-read isEffectiveSessionConsistency() that checks
requestContext.readConsistencyStrategy > header readConsistencyStrategy >
header consistencyLevel > account default, with no side-effects and no copy.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Refactoring

* Consolidate ReadConsistencyStrategy test coverage into unified spy wire test

- Combine RntbdReadConsistencyStrategyHeaderTests metadata tests (headerId, headerType, headerNotRequired) into single readConsistencyStrategyTokenMetadata test
- Rename test regions to no-contention vs contention for clarity
- Add client-level vs request-level RCS comment explaining resolution precedence
- Merge idempotency and no-op tests from deleted ConsistencyFlagContentionTest
- Delete ConsistencyFlagContentionTest (redundant with RntbdHeaderTests)
- Delete ReadConsistencyStrategyHttpSpyWireTest (replaced by unified test)
- Remove RCS test from RxGatewayStoreModelTest (covered by unified test)
- Create ReadConsistencyStrategySpyWireTest: unified spy wire test covering both Gateway V1 (HTTP headers) and Gateway V2 (RNTBD frame via RntbdRequest.decode)
- V2 tests use proper RNTBD frame parsing instead of brute-force byte scanning
- V2 routing assertion validates thin client proxy endpoint (:10250)
- Use TestConfigurations instead of inline System.getProperty/getenv
- Use lazy accessor pattern for ImplementationBridgeHelpers
- Use TestSuiteBase.createCollection for serverless-safe container creation
- Remove unnecessary CosmosItemSerializer.DEFAULT_SERIALIZER
- Use thinclient test group consistent with existing E2E tests
- Eliminate all abbreviations (RCS, CL, isV2) in favor of full names

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Refactoring

* Fix unified E2E test: standalone pattern without TestSuiteBase inheritance

- Remove TestSuiteBase inheritance to avoid @BeforeSuite shared container init
  which fails on serverless accounts (provisioned throughput not supported)
- Use TestSuiteBase.createCollection as static utility for no-throughput containers
- Set COSMOS.THINCLIENT_ENABLED system property in @BeforeClass for V2 detection
- Configure HTTP/2 via GatewayConnectionConfig.setHttp2ConnectionConfig for V2 builder
- Probe V2 availability at startup via diagnostics endpoint check
- All 30 tests pass (15 V1 + 15 V2) against swkrish-session

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Move ReadConsistencyStrategy test files to com.azure.cosmos.rx package

- Move ReadConsistencyStrategySpyWireTest and ReadConsistencyStrategyE2ETest to com.azure.cosmos.rx for access to TestSuiteBase protected utilities
- Keep RntbdReadConsistencyStrategyHeaderTests in directconnectivity.rntbd (requires package-private RntbdToken/RntbdTokenType)
- Fix imports for all moved files
- E2E test uses TestSuiteBase.assertThinClientEndpointUsed via same-package access

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Remove v2Available/gatewayV2Available gates and probe logic

- V2 routing is proven by :10250 endpoint assertion on each request, not by upfront probe
- Set COSMOS.THINCLIENT_ENABLED in @BeforeClass without save/restore (CI handles it)
- DataProvider always returns both V1 and V2 modes
- Remove hasThinClientEndpoint helper from E2E (use TestSuiteBase.assertThinClientEndpointUsed)
- Remove unused imports (CosmosDiagnosticsContext, CosmosDiagnosticsRequestInfo, Collection)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Rename E2E and spy wire tests to call out gateway scope

- ReadConsistencyStrategyE2ETest → GatewayReadConsistencyStrategyE2ETest
- ReadConsistencyStrategySpyWireTest → GatewayReadConsistencyStrategySpyWireTest

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Replace brute-force RNTBD byte scanners with RntbdRequest.decode in unit tests

- RntbdReadConsistencyStrategyHeaderTests now uses RntbdRequest.decode() for all
  RNTBD frame assertions, matching the spy wire test approach
- Eliminates false-positive risk from byte-pattern scanning in binary frames
- Removes containsRntbdHeaderWithByte, containsRntbdHeaderId, containsRntbdHeaderWithAnyValue

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Refactoring

* Improve ReadConsistencyStrategy javadoc for customer readability

- Document precedence: RCS > ConsistencyLevel, request-level > client-level
- Note write operations always use DEFAULT
- List all connection mode support (Direct, Gateway, Gateway V2)
- Tighten enum value descriptions
- Remove internal details (HTTP header, RNTBD header)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Refactoring

* Add Direct mode to unified E2E test for ReadConsistencyStrategy

- Add Direct connection mode as third transport in DataProvider
- All 45 tests pass (15 scenarios × 3 modes: GatewayV1, GatewayV2, Direct)
- Enables backend telemetry comparison across all three transports

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address review: treat DEFAULT as transparent, add SESSION E2E tests

- Fix isEffectiveSessionConsistency to treat RCS 'Default' as transparent
  (fall through to CL/account default instead of returning false)
- Fix resolveEffectiveConsistencyHeaders to strip 'Default' sentinel header
- Prevent 'Default' header from being written in query/changefeed paths
  (DocumentQueryExecutionContextBase, ChangeFeedQueryImpl)
- Add warning log for unknown ReadConsistencyStrategy values in RNTBD switch
- Add SESSION E2E tests: query, readAll, changeFeed, readMany, client-level

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Refactoring

* Refactoring

* Fixing tests.

* Revert ReadConsistencyStrategy.java javadoc to match Azure/main

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add SYNC NOTE linking resolveEffectiveConsistencyHeaders and isEffectiveSessionConsistency

Both methods independently implement the same RCS > CL priority chain.
Added cross-referencing Javadoc @link so future editors know the two
must stay in sync.

Items 1-4 (query/changefeed DEFAULT guards, isEffectiveSessionConsistency
DEFAULT filter, resolveEffectiveConsistencyHeaders DEFAULT stripping) were
already correctly implemented in the current code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address review: enum-switch, session-consistency tests, javadoc fixes (#48094)

- Replace string-switch with enum-based parsing in addReadConsistencyStrategy (RntbdRequestHeaders)
- Add 13 isolated unit tests for isEffectiveSessionConsistency covering all 4 priority branches
- Fix token ID javadoc typo: 0x00F0 -> 0x00FE in test classes
- Document ConsistencyLevel stripping as Java-specific behavior in resolveEffectiveConsistencyHeaders

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Extract shared resolveEffectiveReadConsistencyStrategy to eliminate dual priority logic (#48094)

Addresses review comment on dual priority logic divergence risk.
Extracts resolveEffectiveReadConsistencyStrategy() as single source of truth
for 'which RCS wins?' — consumed by both resolveEffectiveConsistencyHeaders
(header mutation) and isEffectiveSessionConsistency (session-token decision).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Validate ReadConsistencyStrategy regardless of header vs requestContext source

Move the validateReadConsistencyStrategy guard in RequestHelper out of the
header-only branch so it also runs when the strategy is set programmatically
via requestContext.readConsistencyStrategy. Without this, a non-Strong account
issuing a GLOBAL_STRONG read via the requestContext path bypasses the
client-side BadRequest and surfaces an INVALID_RESULT (500 / 20910) from the
Direct quorum reader.

Skip readItem_globalStrong_invalidAccount_throwsBadRequest at runtime when
the account default consistency is already STRONG, since GLOBAL_STRONG is
valid in that case and the BadRequest assertion does not apply. Validated
across a 9-run consistency matrix on thin-client-multi-writer-ci and
thin-client-multi-region-ci.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add LATEST_COMMITTED and GLOBAL_STRONG coverage to applySessionToken tests

Address review feedback (xinlian12, #48787): the
readConsistencyStrategySessionTokenProvider data provider only exercised
SESSION / EVENTUAL / DEFAULT strategies, leaving the quorum-read strategies
(LATEST_COMMITTED, GLOBAL_STRONG) untested. A regression that incorrectly
attached a client-maintained session token to a quorum read would have gone
unnoticed.

Adds 11 new rows covering:
- Request-level RCS=LATEST_COMMITTED against SESSION / EVENTUAL / STRONG accounts
- Request-level RCS=GLOBAL_STRONG against a STRONG account
- Client-header RCS=LatestCommitted / GlobalStrong
- Priority interactions: quorum RCS beating client-header SESSION, and the
  reverse case where request-level RCS=SESSION beats a client-header
  LATEST_COMMITTED
- Account default=STRONG without any RCS override (quorum read, no token)
- Account default=STRONG overridden by request CL=SESSION (token applied)

All rows pass: 23 / 23 in applySessionTokenWithReadConsistencyStrategy.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Soften thread-safety javadoc on resolveEffectiveConsistencyHeaders

Address review feedback (xinlian12, #48787): the previous comment claimed
that concurrent idempotent HashMap mutations are 'safe', which is a
misleading generalization. HashMap is not thread-safe even for logically
idempotent operations.

Rewrite the comment to describe the actual situation: clones share the
header map by reference, but the only mutation performed here (a single
remove of CONSISTENCY_LEVEL when a non-DEFAULT ReadConsistencyStrategy is
effective) is convergent across all concurrent callers and never adds keys,
so no resize is triggered. Point to the correct fix (deep-copy in clone())
if stronger guarantees are ever required, rather than adding synchronization
on this hot path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Deep-copy headers map in RxDocumentServiceRequest.clone() to make hedged clones race-free

Availability-strategy / hedging clones previously shared the same HashMap
reference with the parent request. resolveEffectiveConsistencyHeaders mutates
the map (remove(CONSISTENCY_LEVEL) + put(READ_CONSISTENCY_STRATEGY, ...)) on
every request. When two hedged clones race through that method on the same
map at the HashMap resize threshold (~12 entries -- typical for a Cosmos
request), the concurrent put can trigger resize while the other thread is
mid-insert. Failure modes per JLS / HashMap impl: orphaned inserts in the
discarded array, null reads of unrelated keys (e.g. SESSION_TOKEN), or NPE
walking a partially-relinked chain.

Fix: defensive-copy headers at clone time, matching the pattern already used
for requestContext and faultInjectionRequestContext just below. Each clone
now mutates its own map; no synchronization needed.

Also tightened the Thread-safety javadoc on resolveEffectiveConsistencyHeaders
to reflect the new guarantee (safe by isolation rather than 'convergent
mutation on shared map').

Verified: applySessionTokenWithReadConsistencyStrategy 23/23 PASS.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Reformat resolveEffectiveConsistencyHeaders javadoc with proper HTML structure

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Reorder RCS resolution methods caller->callee; tighten helper to private static

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add focused unit test for resolveEffectiveConsistencyHeaders DEFAULT strip

Covers the canonicalization safety net that protects both GW V1 (HTTP) and GW V2 (RNTBD via ThinClientStoreModel) from emitting a stale DEFAULT ReadConsistencyStrategy header on the wire. Catches regressions faster than the existing GatewayReadConsistencyStrategySpyWireTest end-to-end coverage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Split GLOBAL_STRONG client-side validation into Breaking Changes section in CHANGELOG

Calls out the new client-side fast-fail (BadRequestException) for GLOBAL_STRONG on non-Strong-consistency accounts as a potentially breaking change for callers that previously relied on the implicit downgrade.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Revert RequestHelper validation relocation to keep this PR focused on Gateway V1/V2

The Gateway V1/V2 path added in this PR uses RxGatewayStoreModel.resolveEffectiveConsistencyHeaders and does not flow through RequestHelper.getReadConsistencyStrategyToUse. The GLOBAL_STRONG fast-fail behavior the PR ships is already enforced at upstream call sites (RxDocumentClientImpl, DocumentQueryExecutionContextBase, ChangeFeedQueryImpl). The relocated validation in RequestHelper closes a narrow defense-in-depth gap in the Direct path that pre-dates this PR; track it as a separate fix to avoid risking a regression on the Direct hot path (ConsistencyReader / Writer / QuorumReader / BarrierRequestHelper) as part of this change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add unit test for RxDocumentServiceRequest.clone() header isolation

Addresses PR review comment #3341956318. Verifies that mutating the cloned
request's header map does not affect the original — guarding against the
HashMap-corruption race fixed in clone() where hedged availability-strategy
clones previously shared the parent's header map.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Update ReadConsistencyStrategy Javadoc to reflect cross-mode support

The previous note claimed the strategy only worked in direct mode; with this PR it propagates over Gateway as well.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Move GLOBAL_STRONG fast-fail validation from Breaking Changes to Other Changes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Trim GLOBAL_STRONG CHANGELOG entry wording

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Cosmetic: drop redundant toString and relocate Logger field

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address PR #48787 review: CHANGELOG placement, RNTBD invariant test, setter javadoc

- Move ReadConsistencyStrategy CHANGELOG entries from shipped 4.80.0 to 4.81.0-beta.1 (Unreleased)
- Add invariant test that iterates ReadConsistencyStrategy.values() and asserts every non-DEFAULT value emits a non-zero RNTBD byte; adding a new enum value will fail this test until RntbdRequestHeaders switch is updated
- Add javadoc to 5 setReadConsistencyStrategy setters (CosmosItemRequestOptions, CosmosQueryRequestOptions, CosmosChangeFeedRequestOptions, CosmosReadManyRequestOptions, CosmosReadManyByPartitionKeysRequestOptions) describing cross-mode support and client-side GLOBAL_STRONG fast-fail

Follow-up issue #49370 filed for spy-wire test coverage on Query/ChangeFeed/readMany paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Extend GatewayReadConsistencyStrategySpyWireTest with Query, ChangeFeed and readManyByPartitionKeys coverage

Adds spy-wire assertions on the V1 gateway and V2 thin-client paths for the three feed-style operations that route through the thin client: Query, incremental ChangeFeed, and readManyByPartitionKeys. Mirrors the existing point-read coverage with request-level ReadConsistencyStrategy, default (no-header) and ReadConsistencyStrategy-vs-ConsistencyLevel contention scenarios. ChangeFeed contention is intentionally omitted because CosmosChangeFeedRequestOptions does not expose setConsistencyLevel.

Addresses review comment #48787 (comment).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix spy-wire test setup bugs

- Initialize spy client's operationPolicies field via reflection to avoid
  NPE on ChangeFeed path (RxDocumentClientImpl bypasses Builder init).
- Pre-slice ByteBuf to exactly expectedLength bytes before RntbdRequest.decode
  so header decoder stops at the frame boundary instead of reading into payload.
- Filter query-plan precursor requests (x-ms-cosmos-is-query-plan-request: True)
  and accept GET (change-feed) in addition to POST (query/readMany) for V1
  feed-request matching.

Spy-wire suite: 11 failures -> 0. All 29 tests pass on thin-client-multi-region-ci.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Pin Http2ConnectionConfig.enabled in V1 spy to make CI deterministic

GatewayConnectionConfig's default constructor installs a non-null
Http2ConnectionConfig with enabled=null, which falls back to the
global COSMOS.HTTP2_ENABLED system property at runtime. CI sets that
property to true (sdk/cosmos/tests.yml), which silently flipped the
'V1' spy clients in GatewayReadConsistencyStrategySpyWireTest to the
thin-client path. The test's V1-shaped request assertions
(GET /docs/{id}, POST /docs without id) then could not match the
captured wire requests (POST to thin proxy :10250), producing 9-11
'Expected a document read request' and 'expected LatestCommitted but
was null' failures across all V1 test methods.

Fix: in createSpyClient(rcs, http2Enabled) explicitly call
setHttp2ConnectionConfig(new Http2ConnectionConfig().setEnabled(http2Enabled)).
This pins V1 to HTTP/1.1 and V2 to HTTP/2 regardless of the JVM-wide
property, making the test deterministic in CI and locally.

Verified locally: 29/29 tests pass with -DCOSMOS.HTTP2_ENABLED=true
-DCOSMOS.THINCLIENT_ENABLED=true, which is the exact CI configuration
that previously reproduced the failures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jeet1995 and others added 17 commits June 9, 2026 21:31
…robe gate

Adds an EndpointOrchestrator that fans out POST /connectivity-probe to
every thin-client regional endpoint after each topology refresh. SDK
only routes data-plane traffic to thin-client (Gateway V2) when all
regional probes succeed across N consecutive refresh cycles
(configurable via COSMOS.THINCLIENT_PROBE_FAILURE_THRESHOLD, default 2);
otherwise traffic falls back to Gateway V1 at the next refresh
boundary. No mid-flight fallback.

Caveats:
- Probe wiring is skipped entirely for Direct mode and when HTTP/2 is
  not configured; controlled by RxDocumentClientImpl.useThinClient.
- QueryPlan, metadata reads, and AllVersionsAndDeletes change feed
  continue to route through Compute Gateway (Gateway V1).
- Probe failures are absorbed inside the orchestrator and the trigger
  is fire-and-forget on the GEM scheduler, so probe issues can never
  trip CosmosClient initialization or fail a topology refresh.
- EndpointOrchestrator implements Closeable and is closed from
  GlobalEndpointManager.close() so no further probes are issued after
  client shutdown.

THINCLIENT_ENABLED now defaults to true; opt out via
COSMOS.THINCLIENT_ENABLED=false or COSMOS.THINCLIENT_PROBE_ENABLED=false.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts:
#	sdk/cosmos/azure-cosmos/CHANGELOG.md
…in lifecycle, cancellable in-flight probes

- EndpointOrchestrator: fold body-drain into probe Mono via flatMap+then(perProbeTimeout) so a slow/hanging response body cannot leak resources outside the cycle budget (Copilot #1, deep-review #3, jeet HIGH-2 minor).

- EndpointOrchestrator: add single-flight CAS (cycleInProgress) plus monotonic cycle id; closed-check inside applyCycleResult drops late results so a post-close cycle cannot mutate health state (deep-review #1+#2, jeet HIGH-2).

- EndpointOrchestrator: re-evaluate closed/feature-flag/endpoints at subscription time via Mono.defer so GEM.close() cancellation is honored before any HTTP I/O is issued.

- GlobalEndpointManager: retain probe Disposable in AtomicReference; close() now disposes the in-flight probe subscription so probe work cannot outlive the GEM/CosmosClient (Copilot #2, deep-review #2).

- CHANGELOG: moved entry to unreleased 4.82.0-beta.1, reworded to honestly describe optimistic startup, N=2 RED-to-fallback hysteresis, and Direct-mode/metadata exclusion (Copilot #3, deep-review #4).

Tests: 45 unit tests pass (EndpointOrchestratorTests + ConfigsTests + ThinClientProbeWiringTests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- LocationCache.getThinClientRegionalEndpoints now walks both read and write region endpoint maps so single-master write-region failures still flip the probe gate.
- EndpointOrchestrator.forceUnhealthy(reason) provides a non-HTTP path to flip the gate; GlobalEndpointManager calls it when topology says thin-client is eligible but no regional endpoint resolves.
- Symmetric hysteresis: new COSMOS.THINCLIENT_PROBE_RECOVERY_THRESHOLD (default 1) so operators can require N consecutive GREEN cycles before flipping back to proxy.
- Extracted RxDocumentClientImpl.useThinClientStoreModel(...) body into package-private static shouldUseThinClientStoreModel for direct unit testability; added ThinClientRoutingGateTests covering 9 routing paths.
- EndpointOrchestratorTests.stubResponse now returns Mono.empty() to avoid Unpooled.EMPTY_BUFFER refCnt underflow across multiple probe calls.
- Removed unused locals; added recoveryThresholdRequiresMultipleGreenCycles, forceUnhealthy_flipsGateToRedWithoutRunningProbe, forceUnhealthy_onClosedOrchestrator_isNoOp tests.

All 57 unit tests in the touched files pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cleanup, fix gwV2Cto and ThinClient user-agent assertions

- GlobalEndpointManager: convert thin-client probe trigger to a Mono<Void>
  chained into the topology-refresh reactor pipeline (replaces fire-and-forget
  subscribe). Removes thinClientProbeDisposable field and its close() handling
  since cancellation now propagates through the outer subscription.
- EndpointProbeClient/EndpointProbeClientTests/ThinClientProbeWiringTests:
  replace inline FQNs with imports (java.io.Closeable, java.util.List,
  java.net.ConnectException, com.azure.cosmos.implementation.http.HttpHeaders).
- ClientConfigDiagnosticsTest: compute gwV2Cto dynamically from
  Configs.isThinClientEnabled() so assertions remain valid after the default
  flip to true.
- ConfigsTests: update default-threshold assertions from 2 to 1 to match
  DEFAULT_THINCLIENT_PROBE_FAILURE_THRESHOLD=1.
- UserAgentContainerTest.UserAgentIntegration: expect '|F4' suffix because
  the ThinClient feature flag (1 << 2) is now included by default.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
UserAgentSuffixTest.validateUserAgentSuffix and
CosmosDiagnosticsTest.generateHttp2OptedInUserAgentIfRequired:
include UserAgentFeatureFlags.ThinClient in computed |F<hex> suffix
when COSMOS.THINCLIENT_ENABLED is true (now default after Gateway V2
default enablement). Mirrors RxDocumentClientImpl.addUserAgentSuffix +
UserAgentContainer.setFeatureEnabledFlagsAsSuffix behavior.

SinglePartitionDocumentQueryTest.querySinglePartitionDocuments:
spy on both gateway-proxy and thin-proxy and assert exactly one
invocation. Previous code only spied on the proxy implied by
useThinClient() config intent, which races with the probe-healthy
gate -- routing AND's intent with isProxyProbeHealthy() so on first
cycle the request may go through gateway even when thin-client is
configured.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The test installs an iptables DROP on thin-client port 10250 to verify
that Http2PingHandler closes the broken connection after consecutive
PING ACK timeouts and the recovery request uses a new connection on
the same regional endpoint.

After default Gateway V2 enablement, the connectivity probe also fires
HTTP/2 POSTs to port 10250 on every account refresh. With iptables
dropping that port, the probe trips proxyHealthy=false, useThinClient
StoreModel() returns false, and the data plane request routes through
Gateway V1 on port 443 -- which iptables is not dropping. Result:
the PING handler never fires, the warm-up and recovery requests use
the same gateway channel, and the assertion 'recovery channel must
differ from initial' fails (both ended up as 77af2e47 on build 6419227).

Set COSMOS.THINCLIENT_PROBE_ENABLED=false in beforeClass so the probe
short-circuits to a no-op, EndpointProbeClient.proxyHealthy stays
optimistically true, and the data plane request actually flows over
port 10250 where the iptables DROP can take effect.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Build 6424287 surfaced two new failure patterns:

  1. CosmosNotFoundTests.performBulkOnDeletedContainerWithGatewayV2 (45
     failures) - asserts substatus 1003 from the thin-client routing
     path, but observed 0 because the data plane was routed to Gateway
     V1 instead of the proxy.
  2. PerPartitionCircuitBreakerE2ETests.*Gateway (26 failures) -
     TestSuiteBase.assertThinClientEndpointUsed could not find any
     request whose endpoint contained ':10250/', i.e. nothing actually
     went to the thin-client proxy.

Both patterns trace to the same source: the new connectivity probe is
enabled by default, the proxy-side /connectivity-probe endpoint is not
deployed in every CI test account yet, and the default failure
threshold is 1. So after the first probe cycle the SDK marks the proxy
unhealthy and routes data plane traffic to Gateway V1, which breaks
tests that explicitly assert thin-client routing.

Disable the probe by default in TestSuiteBase's static initializer
(only when the property is not already set), so all E2E tests inherit
deterministic, configuration-driven routing. Tests that exercise the
probe itself (EndpointProbeClientTests, ThinClientProbeWiringTests)
set the property explicitly in @BeforeMethod and are not affected.

Also drop the now-redundant per-class override in Http2PingKeepaliveTest
- the base class disables it, and the test's @afterclass clear would
otherwise re-enable the probe for any subsequent E2E test sharing the
JVM.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…PartitionCircuitBreakerE2ETests

Companion to the prior revert. The revert undid the global TestSuiteBase probe

disable (which masked production behaviour). This commit adds the necessary

per-class disable to the two test classes whose assertions explicitly require

thinclient routing: CosmosNotFoundTests (thinclient group) and

PerPartitionCircuitBreakerE2ETests (fi-thinclient-multi-master group). Both

clear the property in their @afterclass. Http2PingKeepaliveTest already has

its own disable (restored by the revert). Production callers continue to get

the connectivity probe ON by default with the production failure threshold.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… into AzCosmos_GatewayV2_QueryPlanSupport

# Conflicts:
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdConstants.java
Covers 5 new scenarios in ThinClientRoutingGateTests:
- ExecuteStoredProcedure on a StoredProcedure resource routes to thin client
- Non-execute StoredProcedure ops (Create) route to Gateway V1
- OperationType.QueryPlan routes to thin client
- QueryPlan returns false when probe is unhealthy
- ExecuteStoredProcedure returns false when probe is unhealthy

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reverts the merge of jeet1995/thin-client-probe-flow that was brought
into this PR earlier, while keeping the branch up-to-date with
upstream/main. Net diff vs upstream/main is now only the QueryPlan and
StoredProcedure Gateway V2 routing changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
No-content merge to mark upstream/main as in this branch's history so
GitHub computes the PR diff against current upstream rather than an old
merge-base. Tree is unchanged from the previous commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995 jeet1995 marked this pull request as ready for review June 12, 2026 22:36
Copilot AI review requested due to automatic review settings June 12, 2026 22:36
@jeet1995 jeet1995 requested review from a team and kirankumarkolli as code owners June 12, 2026 22:36
@jeet1995 jeet1995 changed the title [Gateway V2]: Integrate Stored Procedure and Query Plan request routing to Gateway V2 endpoint [WIP]: Integrate Stored Procedure and Query Plan request routing to Gateway V2 endpoint Jun 12, 2026
@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates QueryPlan and ExecuteJavaScript (stored procedure) request routing into the Gateway V2 thin client proxy path, adds RNTBD header tokens needed by the proxy for query feature negotiation, and updates query-plan handling to support the proxy’s PartitionKeyInternal-form queryRanges by converting them client-side into sorted EPK ranges. It also adds a substantial thin-client E2E test suite and supporting test infrastructure.

Changes:

  • Route OperationType.QueryPlan and stored procedure execution (ResourceType.StoredProcedure + OperationType.ExecuteJavaScript) through thin client mode when enabled.
  • Support proxy query plan responses returning queryRanges as PartitionKeyInternal JSON arrays, converting them into sorted EPK hex ranges client-side.
  • Add RNTBD tokens and header mappings (SupportedQueryFeatures, QueryVersion) required for the thin client proxy to interpret query plan capability negotiation; expand thin-client E2E tests and test setup (including optional AAD auth).

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ThinClientStoreModel.java Allow QueryPlan requests without a resolved PK range when wrapping into RNTBD HTTP request.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentServiceRequest.java Add helper to identify stored procedure execution requests.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java Expand thin-client routing decisions to include QueryPlan and stored procedure execution; expose useThinClient(request) via IDocumentQueryClient.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/routing/PartitionKeyInternalHelper.java Add convertToSortedEpkRanges to translate proxy PK-internal ranges to sorted EPK string ranges.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/QueryPlanRetriever.java Pass DocumentCollection into query plan retrieval and construct PartitionedQueryExecutionInfo with PK definition in thin-client mode.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/QueryInfo.java Handle thin-client responses where empty strings are used instead of null in a map field.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/PartitionedQueryExecutionInfo.java Support dual queryRanges wire formats (EPK strings vs PK-internal arrays) and convert on demand.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/IDocumentQueryClient.java Add useThinClient(RxDocumentServiceRequest) capability query for QueryPlanRetriever.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/HybridSearchDocumentQueryExecutionContext.java Capture diagnostics request statistics for hybrid search aggregation flows.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/DocumentQueryExecutionContextFactory.java Plumb DocumentCollection through to query plan retrieval.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/JsonSerializable.java Add helper to coerce empty-string map values to null when deserializing.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdRequestHeaders.java Map HTTP headers to new RNTBD tokens for query plan negotiation (SupportedQueryFeatures, QueryVersion).
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdRequestFrame.java Map OperationType.QueryPlan to a corresponding RNTBD operation type.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdConstants.java Add RNTBD operation type and request header token definitions for QueryPlan negotiation.
sdk/cosmos/azure-cosmos-tests/THINCLIENT_TEST_MATRIX.md Document thin-client E2E test coverage matrix for QueryPlan support.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/ThinClientTestBase.java Introduce shared thin-client E2E setup and endpoint assertion helpers.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/ThinClientStoredProcedureE2ETest.java Add thin-client E2E tests for stored procedure execution.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/ThinClientQueryE2ETest.java Add broad direct-vs-thin-client query parity E2E tests (including vector/full-text/hybrid).
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/ThinClientPointOperationE2ETest.java Add thin-client E2E tests for CRUD/Patch, bulk, and batch operations.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/ThinClientChangeFeedE2ETest.java Add thin-client E2E tests for change feed across ranges and partitions.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/TestSuiteBase.java Add optional AAD auth plumbing and a thin-client builder data provider; apply credential consistently.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/ThinClientE2ETest.java Remove older thin client E2E sanity test in favor of the new test suite structure.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/QueryPlanRetrieverSupportedFeaturesTest.java Add unit test validating supported query features string includes expected flags.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/directconnectivity/PartitionKeyInternalTest.java Add unit tests for PK-internal-to-EPK range conversion.
sdk/cosmos/azure-cosmos-tests/pom.xml Ensure thin-client test profile sets required system properties (thin client + HTTP/2) and supports AAD auth toggle.

Comment on lines 16 to +18
import com.azure.cosmos.implementation.routing.PartitionKeyInternal;
import com.azure.cosmos.implementation.routing.PartitionKeyInternalHelper;
import com.azure.cosmos.implementation.routing.Range;
Comment on lines +318 to +322
* @param queryRangesProperty the name of the JSON property containing the ranges array
* @param queryPlanJson the raw query plan JSON containing PartitionKeyInternal ranges
* @param partitionKeyDefinition the container's partition key definition
* @return sorted list of EPK hex string ranges; empty list if the property is absent
*/
Comment on lines +60 to 65
PartitionedQueryExecutionInfo(ObjectNode content, RequestTimeline queryPlanRequestTimeline, PartitionKeyDefinition partitionKeyDefinition) {
super(content);
this.queryPlanRequestTimeline = queryPlanRequestTimeline;
this.expectedQueryRangesFormat = QueryRangesFormat.PARTITION_KEY_INTERNAL_ARRAY;
this.partitionKeyDefinition = partitionKeyDefinition;
}
@jeet1995

Copy link
Copy Markdown
Member Author

@sdkReviewAgent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants