[WIP] Default thin-client to enabled and add proxy connectivity-probe gate#49437
[WIP] Default thin-client to enabled and add proxy connectivity-probe gate#49437jeet1995 wants to merge 55 commits into
Conversation
… QueryPlan proxy routing Add RNTBD token mappings for x-ms-cosmos-supported-query-features (0x002B) and x-ms-cosmos-query-version (0x002C) so the thin client proxy can read these values from the RNTBD body when processing QueryPlan requests. Previously these headers were only set as HTTP headers by QueryPlanRetriever and were lost when QueryPlan was routed through the proxy path, since ThinClientStoreModel serializes requests as RNTBD (not HTTP headers). IDs match server-side proxy definitions per ADO PR 1982503. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add testThinClientChangeFeedFullRange covering FeedRange.forFullRange() across multiple partition keys, and testThinClientChangeFeedPartitionKey covering FeedRange.forLogicalPartition with exact doc count + PK validation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Documents all 59 thin client E2E tests across query (50), point operations (3), change feed (3), and stored procedures (3) with SQL, query features covered, and known account-side blockers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… QueryPlan proxy routing Add RNTBD token mappings for x-ms-cosmos-supported-query-features (0x00F0) and x-ms-cosmos-query-version (0x00F1) so the thin client proxy can read these values from the RNTBD body when processing QueryPlan requests. IDs are provisional (0x00F0, 0x00F1) — must be coordinated with server-side proxy team. See ADO PR 1982503 for the proxy-side design. Note: The design doc listed 0x002B/0x002C but those are already assigned to PartitionKey/PartitionKeyRangeId in the Java SDK. Using 0x00F0/0x00F1 to avoid ID collision until final server-side IDs are assigned. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…BD instructions - Fix testGetCurrentDateTime: assert ISO 8601 format instead of exact match (gateway and proxy return slightly different timestamps) - Add DefaultAzureCredential support via COSMOS.USE_AAD_AUTH system property for accounts with disableLocalAuth=true - Add RNTBD class reference as .github/instructions/rntbd.instructions.md - Add pom.xml system properties for THINCLIENT_ENABLED, HTTP2_ENABLED, USE_AAD_AUTH - Add beforeSuiteReuse mode for degraded accounts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Switch baseline from Gateway V1 to Direct TCP to avoid JVM config interference (THINCLIENT_ENABLED/HTTP2_ENABLED affect Gateway V1) - Assert :10250 endpoint only on Gateway V2 results (not baseline) - Rename helpers: assertDirectAndThinClientMatch (was gateway) - Document seedTestData schema in Javadoc - Remove 'Expected to fail' comments (account has vector search enabled) - Clean up class/method Javadoc Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- LocationCache.getThinClientRegionalEndpoints now walks both read and write region endpoint maps so single-master write-region failures still flip the probe gate. - EndpointOrchestrator.forceUnhealthy(reason) provides a non-HTTP path to flip the gate; GlobalEndpointManager calls it when topology says thin-client is eligible but no regional endpoint resolves. - Symmetric hysteresis: new COSMOS.THINCLIENT_PROBE_RECOVERY_THRESHOLD (default 1) so operators can require N consecutive GREEN cycles before flipping back to proxy. - Extracted RxDocumentClientImpl.useThinClientStoreModel(...) body into package-private static shouldUseThinClientStoreModel for direct unit testability; added ThinClientRoutingGateTests covering 9 routing paths. - EndpointOrchestratorTests.stubResponse now returns Mono.empty() to avoid Unpooled.EMPTY_BUFFER refCnt underflow across multiple probe calls. - Removed unused locals; added recoveryThresholdRequiresMultipleGreenCycles, forceUnhealthy_flipsGateToRedWithoutRunningProbe, forceUnhealthy_onClosedOrchestrator_isNoOp tests. All 57 unit tests in the touched files pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Second batch of fixes pushed in 3f1b1be. Per-comment mapping:
Plus three new EndpointOrchestratorTests: recoveryThresholdRequiresMultipleGreenCycles, forceUnhealthy_flipsGateToRedWithoutRunningProbe, forceUnhealthy_onClosedOrchestrator_isNoOp. Validated: |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Third batch addressed in 66fca70 — quick summary:
|
…cleanup, fix gwV2Cto and ThinClient user-agent assertions - GlobalEndpointManager: convert thin-client probe trigger to a Mono<Void> chained into the topology-refresh reactor pipeline (replaces fire-and-forget subscribe). Removes thinClientProbeDisposable field and its close() handling since cancellation now propagates through the outer subscription. - EndpointProbeClient/EndpointProbeClientTests/ThinClientProbeWiringTests: replace inline FQNs with imports (java.io.Closeable, java.util.List, java.net.ConnectException, com.azure.cosmos.implementation.http.HttpHeaders). - ClientConfigDiagnosticsTest: compute gwV2Cto dynamically from Configs.isThinClientEnabled() so assertions remain valid after the default flip to true. - ConfigsTests: update default-threshold assertions from 2 to 1 to match DEFAULT_THINCLIENT_PROBE_FAILURE_THRESHOLD=1. - UserAgentContainerTest.UserAgentIntegration: expect '|F4' suffix because the ThinClient feature flag (1 << 2) is now included by default. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Pushed Round 4 / earlier feedback (re-confirmed):
Round 5 (new): Round 6 (new): Round 7 (new):
Verification:
|
|
/azp run java - cosmos - tests |
|
/azp run java - cosmos - spark |
|
/azp run java - cosmos - kafka |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
UserAgentSuffixTest.validateUserAgentSuffix and CosmosDiagnosticsTest.generateHttp2OptedInUserAgentIfRequired: include UserAgentFeatureFlags.ThinClient in computed |F<hex> suffix when COSMOS.THINCLIENT_ENABLED is true (now default after Gateway V2 default enablement). Mirrors RxDocumentClientImpl.addUserAgentSuffix + UserAgentContainer.setFeatureEnabledFlagsAsSuffix behavior. SinglePartitionDocumentQueryTest.querySinglePartitionDocuments: spy on both gateway-proxy and thin-proxy and assert exactly one invocation. Previous code only spied on the proxy implied by useThinClient() config intent, which races with the probe-healthy gate -- routing AND's intent with isProxyProbeHealthy() so on first cycle the request may go through gateway even when thin-client is configured. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Build 6419227 triage + fixesPushed commit Distinct failure patterns identified
Fixes in
|
The test installs an iptables DROP on thin-client port 10250 to verify that Http2PingHandler closes the broken connection after consecutive PING ACK timeouts and the recovery request uses a new connection on the same regional endpoint. After default Gateway V2 enablement, the connectivity probe also fires HTTP/2 POSTs to port 10250 on every account refresh. With iptables dropping that port, the probe trips proxyHealthy=false, useThinClient StoreModel() returns false, and the data plane request routes through Gateway V1 on port 443 -- which iptables is not dropping. Result: the PING handler never fires, the warm-up and recovery requests use the same gateway channel, and the assertion 'recovery channel must differ from initial' fails (both ended up as 77af2e47 on build 6419227). Set COSMOS.THINCLIENT_PROBE_ENABLED=false in beforeClass so the probe short-circuits to a no-op, EndpointProbeClient.proxyHealthy stays optimistically true, and the data plane request actually flows over port 10250 where the iptables DROP can take effect. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Follow-up: Http2PingKeepaliveTest fix in
|
|
/azp run java - cosmos - tests |
|
/azp run java - cosmos - spark |
|
/azp run java - cosmos - kafka |
|
Azure Pipelines successfully started running 1 pipeline(s). |
2 similar comments
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Build 6424287 surfaced two new failure patterns:
1. CosmosNotFoundTests.performBulkOnDeletedContainerWithGatewayV2 (45
failures) - asserts substatus 1003 from the thin-client routing
path, but observed 0 because the data plane was routed to Gateway
V1 instead of the proxy.
2. PerPartitionCircuitBreakerE2ETests.*Gateway (26 failures) -
TestSuiteBase.assertThinClientEndpointUsed could not find any
request whose endpoint contained ':10250/', i.e. nothing actually
went to the thin-client proxy.
Both patterns trace to the same source: the new connectivity probe is
enabled by default, the proxy-side /connectivity-probe endpoint is not
deployed in every CI test account yet, and the default failure
threshold is 1. So after the first probe cycle the SDK marks the proxy
unhealthy and routes data plane traffic to Gateway V1, which breaks
tests that explicitly assert thin-client routing.
Disable the probe by default in TestSuiteBase's static initializer
(only when the property is not already set), so all E2E tests inherit
deterministic, configuration-driven routing. Tests that exercise the
probe itself (EndpointProbeClientTests, ThinClientProbeWiringTests)
set the property explicitly in @BeforeMethod and are not affected.
Also drop the now-redundant per-class override in Http2PingKeepaliveTest
- the base class disables it, and the test's @afterclass clear would
otherwise re-enable the probe for any subsequent E2E test sharing the
JVM.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Build 6424287 went from 49+/2 distinct test failures down to two new test-only failure patterns, both rooted in the same cause — the probe gate (enabled by default, threshold=1) is flipping data-plane routing from the thin-client proxy to Gateway V1 in test accounts whose proxy has not yet deployed the /connectivity-probe endpoint. New failures (both fixed in ad9e3df):
Fix: Disable the probe by default in No production impact — these are test-environment changes. The probe still defaults to Remaining single-shot failures (likely environmental, not probe-related):
Commit: ad9e3df |
…teBase" This reverts commit ad9e3df.
…PartitionCircuitBreakerE2ETests Companion to the prior revert. The revert undid the global TestSuiteBase probe disable (which masked production behaviour). This commit adds the necessary per-class disable to the two test classes whose assertions explicitly require thinclient routing: CosmosNotFoundTests (thinclient group) and PerPartitionCircuitBreakerE2ETests (fi-thinclient-multi-master group). Both clear the property in their @afterclass. Http2PingKeepaliveTest already has its own disable (restored by the revert). Production callers continue to get the connectivity probe ON by default with the production failure threshold. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update on the CI-failure fix for build 6424287My earlier comment proposed disabling the connectivity probe globally in What I just pushed (commits
Each This keeps the rest of the suite honest about default behaviour while preventing the three classes that need a deterministic proxy route from failing while the proxy-side |
… into AzCosmos_GatewayV2_QueryPlanSupport # Conflicts: # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdConstants.java
Covers 5 new scenarios in ThinClientRoutingGateTests: - ExecuteStoredProcedure on a StoredProcedure resource routes to thin client - Non-execute StoredProcedure ops (Create) route to Gateway V1 - OperationType.QueryPlan routes to thin client - QueryPlan returns false when probe is unhealthy - ExecuteStoredProcedure returns false when probe is unhealthy Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Adds an
EndpointOrchestratorthat fans outPOST /connectivity-probeto every thin-client regional endpoint after each topology refresh. The SDK only routes data-plane traffic through thin-client (Gateway V2) when all regional probes return HTTP 200 across N consecutive refresh cycles; otherwise traffic falls back to Gateway V1 at the next refresh boundary. No mid-flight fallback.COSMOS.THINCLIENT_ENABLEDnow defaults totrue. The new probe gate makes that safe by closing thin-client routing automatically if the proxy fleet is unreachable.Gating caveats
Http2ConnectionConfigis configured and effectively enabled.useThinClientStoreModelpredicate.CosmosClientinitialization or fail a topology refresh.EndpointOrchestratorimplementsCloseableand is closed fromGlobalEndpointManager.close(); no further probes are issued after client shutdown.Configuration
COSMOS.THINCLIENT_ENABLEDtrue(wasfalse)COSMOS.THINCLIENT_PROBE_ENABLEDtrueCOSMOS.THINCLIENT_PROBE_FAILURE_THRESHOLD2COSMOS.THINCLIENT_PROBE_PATH/connectivity-probeTests
EndpointOrchestrator(hysteresis, RED/GREEN flips, no-op gates).Configstests for the new properties (parse, fallback, invalid input).ThinClientProbeWiringTestsfor GEM integration (probe fires on refresh, healthy default, threshold flip, region discovery viaLocationCache).mvn -Punitonazure-cosmos-tests.ThinClientE2ETestcontinues to pass against a live multi-region thin-client account.Changelog
Single entry added under
4.81.0-beta.1-> Other Changes.