Skip to content

[WIP] Default thin-client to enabled and add proxy connectivity-probe gate#49437

Open
jeet1995 wants to merge 55 commits into
Azure:mainfrom
jeet1995:jeet1995/thin-client-probe-flow
Open

[WIP] Default thin-client to enabled and add proxy connectivity-probe gate#49437
jeet1995 wants to merge 55 commits into
Azure:mainfrom
jeet1995:jeet1995/thin-client-probe-flow

Conversation

@jeet1995

Copy link
Copy Markdown
Member

Summary

Adds an EndpointOrchestrator that fans out POST /connectivity-probe to every thin-client regional endpoint after each topology refresh. The SDK only routes data-plane traffic through thin-client (Gateway V2) when all regional probes return HTTP 200 across N consecutive refresh cycles; otherwise traffic falls back to Gateway V1 at the next refresh boundary. No mid-flight fallback.

COSMOS.THINCLIENT_ENABLED now defaults to true. The new probe gate makes that safe by closing thin-client routing automatically if the proxy fleet is unreachable.

Gating caveats

  • Direct mode: probe is not wired at all.
  • HTTP/2 required: probe is not wired unless Http2ConnectionConfig is configured and effectively enabled.
  • Metadata / QueryPlan / AllVersionsAndDeletes: continue to route through Compute Gateway (Gateway V1) via the existing useThinClientStoreModel predicate.
  • Init-safe: probe wiring + trigger are guarded with try/catch and fire-and-forget, so a probe issue can never trip CosmosClient initialization or fail a topology refresh.
  • Close-safe: EndpointOrchestrator implements Closeable and is closed from GlobalEndpointManager.close(); no further probes are issued after client shutdown.

Configuration

System property Default Notes
COSMOS.THINCLIENT_ENABLED true (was false) Master opt-out.
COSMOS.THINCLIENT_PROBE_ENABLED true Per-cycle bypass; orchestrator stays optimistic when off.
COSMOS.THINCLIENT_PROBE_FAILURE_THRESHOLD 2 Consecutive RED cycles before flipping unhealthy.
COSMOS.THINCLIENT_PROBE_PATH /connectivity-probe

Tests

  • 8 unit tests for EndpointOrchestrator (hysteresis, RED/GREEN flips, no-op gates).
  • 9 Configs tests for the new properties (parse, fallback, invalid input).
  • 5 new ThinClientProbeWiringTests for GEM integration (probe fires on refresh, healthy default, threshold flip, region discovery via LocationCache).
  • All 44 unit tests pass with mvn -Punit on azure-cosmos-tests.
  • Existing ThinClientE2ETest continues to pass against a live multi-region thin-client account.

Changelog

Single entry added under 4.81.0-beta.1 -> Other Changes.

jeet1995 and others added 30 commits January 20, 2026 18:20
… QueryPlan proxy routing

Add RNTBD token mappings for x-ms-cosmos-supported-query-features (0x002B)
and x-ms-cosmos-query-version (0x002C) so the thin client proxy can read
these values from the RNTBD body when processing QueryPlan requests.

Previously these headers were only set as HTTP headers by QueryPlanRetriever
and were lost when QueryPlan was routed through the proxy path, since
ThinClientStoreModel serializes requests as RNTBD (not HTTP headers).

IDs match server-side proxy definitions per ADO PR 1982503.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add testThinClientChangeFeedFullRange covering FeedRange.forFullRange()
across multiple partition keys, and testThinClientChangeFeedPartitionKey
covering FeedRange.forLogicalPartition with exact doc count + PK validation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Documents all 59 thin client E2E tests across query (50), point operations (3),
change feed (3), and stored procedures (3) with SQL, query features covered,
and known account-side blockers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… QueryPlan proxy routing

Add RNTBD token mappings for x-ms-cosmos-supported-query-features (0x00F0)
and x-ms-cosmos-query-version (0x00F1) so the thin client proxy can read
these values from the RNTBD body when processing QueryPlan requests.

IDs are provisional (0x00F0, 0x00F1) — must be coordinated with server-side
proxy team. See ADO PR 1982503 for the proxy-side design.

Note: The design doc listed 0x002B/0x002C but those are already assigned to
PartitionKey/PartitionKeyRangeId in the Java SDK. Using 0x00F0/0x00F1 to
avoid ID collision until final server-side IDs are assigned.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…BD instructions

- Fix testGetCurrentDateTime: assert ISO 8601 format instead of exact match
  (gateway and proxy return slightly different timestamps)
- Add DefaultAzureCredential support via COSMOS.USE_AAD_AUTH system property
  for accounts with disableLocalAuth=true
- Add RNTBD class reference as .github/instructions/rntbd.instructions.md
- Add pom.xml system properties for THINCLIENT_ENABLED, HTTP2_ENABLED, USE_AAD_AUTH
- Add beforeSuiteReuse mode for degraded accounts

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Switch baseline from Gateway V1 to Direct TCP to avoid JVM config
  interference (THINCLIENT_ENABLED/HTTP2_ENABLED affect Gateway V1)
- Assert :10250 endpoint only on Gateway V2 results (not baseline)
- Rename helpers: assertDirectAndThinClientMatch (was gateway)
- Document seedTestData schema in Javadoc
- Remove 'Expected to fail' comments (account has vector search enabled)
- Clean up class/method Javadoc

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- LocationCache.getThinClientRegionalEndpoints now walks both read and write region endpoint maps so single-master write-region failures still flip the probe gate.
- EndpointOrchestrator.forceUnhealthy(reason) provides a non-HTTP path to flip the gate; GlobalEndpointManager calls it when topology says thin-client is eligible but no regional endpoint resolves.
- Symmetric hysteresis: new COSMOS.THINCLIENT_PROBE_RECOVERY_THRESHOLD (default 1) so operators can require N consecutive GREEN cycles before flipping back to proxy.
- Extracted RxDocumentClientImpl.useThinClientStoreModel(...) body into package-private static shouldUseThinClientStoreModel for direct unit testability; added ThinClientRoutingGateTests covering 9 routing paths.
- EndpointOrchestratorTests.stubResponse now returns Mono.empty() to avoid Unpooled.EMPTY_BUFFER refCnt underflow across multiple probe calls.
- Removed unused locals; added recoveryThresholdRequiresMultipleGreenCycles, forceUnhealthy_flipsGateToRedWithoutRunningProbe, forceUnhealthy_onClosedOrchestrator_isNoOp tests.

All 57 unit tests in the touched files pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

Second batch of fixes pushed in 3f1b1be. Per-comment mapping:

Reviewer Comment Fix
@xinlian12 LocationCache.java:150 — getThinClientRegionalEndpoints only walked read regions, so single-master write-region failures wouldn't flip the probe gate Fix A — collectThinClientEndpoints helper now walks both read+write maps
@xinlian12 GlobalEndpointManager.java:463 — silent return when thin-client eligible but no endpoint resolves Fix B — new EndpointOrchestrator.forceUnhealthy(reason) public method; GEM calls it instead of bailing
@xinlian12 EndpointOrchestrator.java:299 — asymmetric recovery (single GREEN flips back after N REDs) Fix C — new COSMOS.THINCLIENT_PROBE_RECOVERY_THRESHOLD (default 1, symmetric with failure threshold)
@xinlian12 EndpointOrchestratorTests.java:239 — stubResponse used Unpooled.EMPTY_BUFFER singleton causing refCnt underflow across cycles Fix D — body now returns Mono.empty()
@xinlian12 RxDocumentClientImpl.java:9006 — routing-gate logic untestable from outside Fix F — extracted package-private static shouldUseThinClientStoreModel(boolean,boolean,boolean,RxDocumentServiceRequest); added ThinClientRoutingGateTests with 9 tests covering all-true, probe-unhealthy fallback, flag-off, no-read-locations, non-Document, query, batch, AllVersionsAndDeletes CF (→ gateway), incremental CF (→ proxy)
Copilot EndpointOrchestratorTests.java:112 — unused locals Fix E — removed greenByEndpoint/greenOrchestrator/redOnly

Plus three new EndpointOrchestratorTests: recoveryThresholdRequiresMultipleGreenCycles, forceUnhealthy_flipsGateToRedWithoutRunningProbe, forceUnhealthy_onClosedOrchestrator_isNoOp.

Validated: mvn -Punit verify against the four touched test classes — 57/57 pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

Third batch addressed in 66fca70 — quick summary:

Comment File Status
3385316064 EndpointOrchestrator.java body-drain Already addressed in 3f1b1be. Current code (lines 258-271) drains via
esponse.body().doOnNext(...).then(Mono.just(result)).timeout(...).doFinally(...).onErrorResume(...) — fully chained into the returned Mono, no dangling .subscribe(). The suggested .flatMap(b -> b.ignoreElement().thenReturn(result)) is functionally equivalent; sticking with the current shape to minimize churn.
3385316067 CHANGELOG wording Fixed in 66fca70. Clarified that the recovery threshold is configurable (COSMOS.THINCLIENT_PROBE_RECOVERY_THRESHOLD, default 1) and pointed out the symmetric-hysteresis tuning knob.
3385316069 GlobalEndpointManager probe Disposable Already addressed in 3f1b1be. Disposable is stored in thinClientProbeDisposable (AtomicReference<Disposable>, line 54), swapped with prior-disposable cleanup on every trigger (line 490), and disposed in close() (line 206) so it cannot outlive the client.

…cleanup, fix gwV2Cto and ThinClient user-agent assertions

- GlobalEndpointManager: convert thin-client probe trigger to a Mono<Void>
  chained into the topology-refresh reactor pipeline (replaces fire-and-forget
  subscribe). Removes thinClientProbeDisposable field and its close() handling
  since cancellation now propagates through the outer subscription.
- EndpointProbeClient/EndpointProbeClientTests/ThinClientProbeWiringTests:
  replace inline FQNs with imports (java.io.Closeable, java.util.List,
  java.net.ConnectException, com.azure.cosmos.implementation.http.HttpHeaders).
- ClientConfigDiagnosticsTest: compute gwV2Cto dynamically from
  Configs.isThinClientEnabled() so assertions remain valid after the default
  flip to true.
- ConfigsTests: update default-threshold assertions from 2 to 1 to match
  DEFAULT_THINCLIENT_PROBE_FAILURE_THRESHOLD=1.
- UserAgentContainerTest.UserAgentIntegration: expect '|F4' suffix because
  the ThinClient feature flag (1 << 2) is now included by default.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

Pushed 8a602fbc04b addressing the latest review batch:

Round 4 / earlier feedback (re-confirmed):

  1. ✅ CHANGELOG: concise, focused on default Gateway V2 enablement.
  2. ✅ LocationCache: ALL-or-NOTHING — probe only when every thin-client endpoint resolves; partial maps flip the gate red.
  3. DEFAULT_THINCLIENT_PROBE_FAILURE_THRESHOLD = 1.
  4. ✅ Renamed EndpointOrchestratorEndpointProbeClient; lean DiagnosticsSnapshot (last-state + lastUpdatedAt only).

Round 5 (new):
5. ✅ ClientConfigDiagnosticsTest: assertions for gwV2Cto now derive the value dynamically from Configs.isThinClientEnabled() so they remain valid after the default flip. UserAgentContainerTest.UserAgentIntegration: updated to expect the |F4 suffix because the ThinClient user-agent flag is now included by default.

Round 6 (new):
6. ✅ GlobalEndpointManager: probe is now part of the reactor chain. triggerThinClientProbeCycle() (fire-and-forget .subscribe(...)) replaced with runThinClientProbeCycleMono(): Mono<Void>; the three trigger sites (forceRefresh path, refreshLocationPrivateAsync prefix, and the inner shouldRefreshEndpoints branch) all chain via .flatMap(...)/.then(Mono.defer(...)). thinClientProbeDisposable field and its close() handling removed — cancellation propagates through the outer subscription disposed by backgroundRefreshDisposable.dispose(). runProbeCycle already absorbs per-probe errors and has a per-probe timeout, so chaining is safe.

Round 7 (new):
7. ✅ Removed inline FQNs in favor of imports:

  • EndpointProbeClient: java.io.Closeable, java.util.List
  • EndpointProbeClientTests: java.net.ConnectException
  • ThinClientProbeWiringTests: com.azure.cosmos.implementation.http.HttpHeaders

Verification:

  • mvn -o install -pl :azure-cosmos -am -DskipTests=true → SUCCESS
  • mvn -o verify -Punit -Dit.test=EndpointProbeClientTests,ConfigsTests,ThinClientProbeWiringTests,ThinClientRoutingGateTests,ClientConfigDiagnosticsTest,UserAgentContainerTest → BUILD SUCCESS (all tests green)

@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - spark

@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - kafka

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

UserAgentSuffixTest.validateUserAgentSuffix and
CosmosDiagnosticsTest.generateHttp2OptedInUserAgentIfRequired:
include UserAgentFeatureFlags.ThinClient in computed |F<hex> suffix
when COSMOS.THINCLIENT_ENABLED is true (now default after Gateway V2
default enablement). Mirrors RxDocumentClientImpl.addUserAgentSuffix +
UserAgentContainer.setFeatureEnabledFlagsAsSuffix behavior.

SinglePartitionDocumentQueryTest.querySinglePartitionDocuments:
spy on both gateway-proxy and thin-proxy and assert exactly one
invocation. Previous code only spied on the proxy implied by
useThinClient() config intent, which races with the probe-healthy
gate -- routing AND's intent with isProxyProbeHealthy() so on first
cycle the request may go through gateway even when thin-client is
configured.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

Build 6419227 triage + fixes

Pushed commit 4a31ea0c addressing the dominant CI/live test failures.

Distinct failure patterns identified

# Pattern Count Stages Root cause
1 UserAgentSuffixTest::validateUserAgentSuffix ends-with mismatch (`... F4` appended after expected suffix) 12 Public_LiveTest (StrongSession), Public_Cosmos_Live_Test_ThinClient
2 CosmosDiagnosticsTest same `... F4` mismatch 37 Public_LiveTest (StrongSession), Public_Cosmos_Live_Test_ThinClient
3 SinglePartitionDocumentQueryTest::querySinglePartitionDocuments Mockito thinClientStoreModel.processMessage wanted but not invoked 1 Http2 Query Race with probe-healthy gate. Test spied on the proxy implied by useThinClient() (config intent), but actual routing AND's intent with globalEndpointManager.isProxyProbeHealthy() → on first refresh cycle request may legitimately go through gateway.
4 Http2PingKeepaliveTest::inFlightReadRetriesInSameRegionAfterPingClose — recovery channel ID 77af2e47 equal to initial channel ID 1 Http2 NetworkFault Not addressed in this commit. Recovery request reused the same connection after PING timeout. Independent fault-injection test; unclear if probe is keeping connection alive via HTTP/2 multiplexing or pre-existing flake. Will investigate separately.

Fixes in 4a31ea0c

  1. UserAgentSuffixTest.validateUserAgentSuffix — when Configs.isThinClientEnabled() is true, OR UserAgentFeatureFlags.ThinClient.getValue() into the computed featureValue mask before appending the single |F<hex> suffix. Mirrors production RxDocumentClientImpl.addUserAgentSuffix + UserAgentContainer.setFeatureEnabledFlagsAsSuffix.
  2. CosmosDiagnosticsTest.generateHttp2OptedInUserAgentIfRequired — same fix in symmetric helper.
  3. SinglePartitionDocumentQueryTest.querySinglePartitionDocuments — spy on BOTH the gateway proxy AND the thin proxy (when thin-client is configured), and assert gatewayInvocations + thinInvocations == 1. Robust to probe-gate routing.

Why not reorder UserAgentContainer to put |F<hex> before the user suffix

Existing unit test UserAgentContainerTest::userAgentContainerSetSuffixWithFeatureEnablementFlags already pins the format <base> <userSuffix>|F<hex>. PPCB / PPAF feature flags have shipped under that contract for releases. The right fix is to update test helpers, which had only partial flag-awareness (Http2 only).

Remaining

  • Investigate Http2PingKeepaliveTest::inFlightReadRetriesInSameRegionAfterPingClose (separate commit if it turns out to be a probe-side regression).

The test installs an iptables DROP on thin-client port 10250 to verify
that Http2PingHandler closes the broken connection after consecutive
PING ACK timeouts and the recovery request uses a new connection on
the same regional endpoint.

After default Gateway V2 enablement, the connectivity probe also fires
HTTP/2 POSTs to port 10250 on every account refresh. With iptables
dropping that port, the probe trips proxyHealthy=false, useThinClient
StoreModel() returns false, and the data plane request routes through
Gateway V1 on port 443 -- which iptables is not dropping. Result:
the PING handler never fires, the warm-up and recovery requests use
the same gateway channel, and the assertion 'recovery channel must
differ from initial' fails (both ended up as 77af2e47 on build 6419227).

Set COSMOS.THINCLIENT_PROBE_ENABLED=false in beforeClass so the probe
short-circuits to a no-op, EndpointProbeClient.proxyHealthy stays
optimistically true, and the data plane request actually flows over
port 10250 where the iptables DROP can take effect.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

Follow-up: Http2PingKeepaliveTest fix in ccc7c39d

Root cause

Http2PingKeepaliveTest::inFlightReadRetriesInSameRegionAfterPingClose installs an iptables DROP rule on port 10250 (the thin-client proxy port) to blackhole PING ACKs. It expects Http2PingHandler to close the broken connection after 2 consecutive PING timeouts, and the recovery request to land on a NEW TCP connection on the SAME regional endpoint.

After default Gateway V2 enablement, the connectivity probe also fires HTTP/2 POSTs to port 10250 on every account refresh. Sequence of events on build 6419227:

  1. Test installs iptables -A OUTPUT -p tcp --dport 10250 -j DROP.
  2. Next probe cycle's POST to port 10250 times out → EndpointProbeClient.proxyHealthy → false (default failure threshold = 1).
  3. useThinClientStoreModel() ANDs intent with isProxyProbeHealthy() → returns false.
  4. Data-plane request routes through Gateway V1 on port 443 instead — which iptables is not dropping.
  5. Warm-up and recovery requests reuse the same gateway connection (77af2e47).
  6. PING handler never fires; assertion recoveryChannelId != initialChannelId fails.

Fix

Set COSMOS.THINCLIENT_PROBE_ENABLED=false in @BeforeClass so the probe short-circuits to a no-op, EndpointProbeClient.proxyHealthy stays optimistically true, and the data-plane request actually flows over port 10250 where the iptables DROP can take effect.

Cleared in @AfterClass.

Why this isn't a production regression

In production, the same scenario — port 10250 blackholed but port 443 reachable — should fall back to Gateway V1. That's the designed probe behavior. The test is exercising a specific code path (Http2PingHandler close + ClientRetryPolicy H3 branch + same-region recovery) that requires the request to actually attempt port 10250. Disabling the probe in the test isolates the path under test without changing production semantics.

@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - spark

@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - kafka

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

2 similar comments
@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Build 6424287 surfaced two new failure patterns:

  1. CosmosNotFoundTests.performBulkOnDeletedContainerWithGatewayV2 (45
     failures) - asserts substatus 1003 from the thin-client routing
     path, but observed 0 because the data plane was routed to Gateway
     V1 instead of the proxy.
  2. PerPartitionCircuitBreakerE2ETests.*Gateway (26 failures) -
     TestSuiteBase.assertThinClientEndpointUsed could not find any
     request whose endpoint contained ':10250/', i.e. nothing actually
     went to the thin-client proxy.

Both patterns trace to the same source: the new connectivity probe is
enabled by default, the proxy-side /connectivity-probe endpoint is not
deployed in every CI test account yet, and the default failure
threshold is 1. So after the first probe cycle the SDK marks the proxy
unhealthy and routes data plane traffic to Gateway V1, which breaks
tests that explicitly assert thin-client routing.

Disable the probe by default in TestSuiteBase's static initializer
(only when the property is not already set), so all E2E tests inherit
deterministic, configuration-driven routing. Tests that exercise the
probe itself (EndpointProbeClientTests, ThinClientProbeWiringTests)
set the property explicitly in @BeforeMethod and are not affected.

Also drop the now-redundant per-class override in Http2PingKeepaliveTest
- the base class disables it, and the test's @afterclass clear would
otherwise re-enable the probe for any subsequent E2E test sharing the
JVM.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

Build 6424287 went from 49+/2 distinct test failures down to two new test-only failure patterns, both rooted in the same cause — the probe gate (enabled by default, threshold=1) is flipping data-plane routing from the thin-client proxy to Gateway V1 in test accounts whose proxy has not yet deployed the /connectivity-probe endpoint.

New failures (both fixed in ad9e3df):

  1. CosmosNotFoundTests.performBulkOnDeletedContainerWithGatewayV2 (45 failures, log 1986) — asserts response substatus 1003 from the thin-client routing path, observed 0 because requests went to Gateway V1.
  2. PerPartitionCircuitBreakerE2ETests.*Gateway (26 failures, log 2002) — TestSuiteBase.assertThinClientEndpointUsed could not find any request whose endpoint contained :10250/.

Fix: Disable the probe by default in TestSuiteBase's static initializer (only when the property is not already set), so all E2E tests inherit deterministic configuration-driven routing. Dedicated probe tests (EndpointProbeClientTests, ThinClientProbeWiringTests) set the property explicitly in @BeforeMethod and are unaffected. The per-class override in Http2PingKeepaliveTest is now redundant and was removed (its @AfterClass clear would have re-enabled the probe for any later E2E test sharing the JVM).

No production impact — these are test-environment changes. The probe still defaults to true in production via the DEFAULT_THINCLIENT_PROBE_ENABLED config; customers running with the property unset will get the probe enabled.

Remaining single-shot failures (likely environmental, not probe-related):

  • CosmosTracerTest.cosmosAsyncDatabase ThreadTimeoutException at 40s (log 4742) — Direct TCP mode, probe doesn't gate Direct routing. Watching next run.
  • DocumentQuerySpyWireContentTest.before_DocumentQuerySpyWireContentTest 429 RequestRateTooLarge (log 2251) — @BeforeClass setup throttling on shared test account; user-agent in the error correctly shows |F4 suffix confirming user-agent helper fix is working.
  • OrderbyDocumentQueryTest.before_OrderbyDocumentQueryTest 404 "Collection is not yet available for read" (log 2696) — proxy-side propagation race on freshly created collection.

Commit: ad9e3df

jeet1995 and others added 2 commits June 11, 2026 18:22
…PartitionCircuitBreakerE2ETests

Companion to the prior revert. The revert undid the global TestSuiteBase probe

disable (which masked production behaviour). This commit adds the necessary

per-class disable to the two test classes whose assertions explicitly require

thinclient routing: CosmosNotFoundTests (thinclient group) and

PerPartitionCircuitBreakerE2ETests (fi-thinclient-multi-master group). Both

clear the property in their @afterclass. Http2PingKeepaliveTest already has

its own disable (restored by the revert). Production callers continue to get

the connectivity probe ON by default with the production failure threshold.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

Update on the CI-failure fix for build 6424287

My earlier comment proposed disabling the connectivity probe globally in TestSuiteBase. That was the wrong call --- the probe is ON by default in production (default failure threshold = 1), and tests should reflect production behaviour rather than mask it.

What I just pushed (commits e381f3a4a26 revert + da6c6983b80):

  1. Reverted the global probe-disable in TestSuiteBase so production-equivalent defaults are restored for the broad test population.
  2. Added a per-class probe-disable only to the three test classes whose assertions explicitly require the data plane to land on the proxy:
    • CosmosNotFoundTests (thinclient group) --- asserts OWNER_RESOURCE_NOT_EXISTS / assertThinClientEndpointUsed.
    • PerPartitionCircuitBreakerE2ETests (fi-thinclient-multi-master group) --- asserts assertThinClientEndpointUsed for gateway-mode traffic.
    • Http2PingKeepaliveTest --- installs iptables DROP on port 10250, which would also defeat the probe before the PING handler fires. (Disable restored by the revert.)

Each @BeforeClass sets COSMOS.THINCLIENT_PROBE_ENABLED=false before client construction and the corresponding @AfterClass clears it so the JVM's other test classes are not polluted. Every other test class continues to run with the production default (probe ON, threshold = 1).

This keeps the rest of the suite honest about default behaviour while preventing the three classes that need a deterministic proxy route from failing while the proxy-side /connectivity-probe endpoint is still rolling out across CI accounts.

jeet1995 and others added 2 commits June 11, 2026 22:47
… into AzCosmos_GatewayV2_QueryPlanSupport

# Conflicts:
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdConstants.java
Covers 5 new scenarios in ThinClientRoutingGateTests:
- ExecuteStoredProcedure on a StoredProcedure resource routes to thin client
- Non-execute StoredProcedure ops (Create) route to Gateway V1
- OperationType.QueryPlan routes to thin client
- QueryPlan returns false when probe is unhealthy
- ExecuteStoredProcedure returns false when probe is unhealthy

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants