From 2e43e143a047920e8c0d636a4377afa0a3c9a3d6 Mon Sep 17 00:00:00 2001 From: tvaron3 Date: Tue, 26 May 2026 16:12:46 -0700 Subject: [PATCH 01/15] [Cosmos Spark] Surface empty change feed pages to avoid end-to-end timeout MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Spark partition tasks reading change feed (or executing cross-partition queries) against a sparse workload could hit OperationCancelledException ("End-to-end timeout hit when trying to retrieve the next page") at the connector's 65-second per-operation end-to-end timeout. Root cause: with the default emptyPagesAllowed=false, ParallelDocumentQueryExecutionContext and ChangeFeedFetcher swallow empty / 304 pages internally — a single producer-side nextPage() call can keep draining many sub-feedRanges before emitting one non-empty page. For sparse workloads the cumulative time blows the per-operation timeout. Fix: * Spark ItemsPartitionReader (query path) calls setAllowEmptyPages(true) on the CosmosQueryRequestOptions so the SDK's existing emptyPagesAllowed plumbing applies. * New internal-only emptyPagesAllowed flag on CosmosChangeFeedRequestOptionsImpl (default false; behavior unchanged for all other callers) plumbed through Paginator. getChangeFeedQueryResultAsObservable into ChangeFeedFetcher. nextPageInternal. When the flag is true, both 304 branches return Mono.just(r) so empty pages bubble up to the iterator. Surfaced via new package-private bridge accessor CosmosChangeFeedRequestOptionsAccessor.{get,set}AllowEmptyPages. * ChangeFeedFetcher.isFullyDrained no longer short-circuits to true on noChanges responses (it now consults only continuation.isDone()), which removes the load-bearing reEnableShouldFetchMoreForRetry() pattern that was previously needed to undo a base-class decision. * Spark ChangeFeedPartitionReader opts into the new flag via the bridge accessor. * CosmosChangeFeedRequestOptions.withCosmosPagedFluxOptions now also propagates emptyPagesAllowed when the paged-flux pull mechanism supplies a continuation token (the freshly-built impl would otherwise silently lose the flag — comment added flagging the broader drift hazard). Tests: * New ChangeFeedFetcherEmptyPagesTest (5 unit tests): exercises the isFullyDrained behavior change and asserts that nextPageInternal surfaces noChanges responses individually when the flag is true and swallows them via repeatWhenEmpty when the flag is false. * New CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest (3 unit tests): locks in the flag propagation through withCosmosPagedFluxOptions. * Extended TransientIOErrorsRetryingIteratorSpec with a regression test that drains hundreds of leading empty pages followed by data without hitting the end-to-end timeout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../azure-cosmos-spark_3-3_2-12/CHANGELOG.md | 1 + .../azure-cosmos-spark_3-4_2-12/CHANGELOG.md | 1 + .../azure-cosmos-spark_3-5_2-12/CHANGELOG.md | 1 + .../azure-cosmos-spark_3-5_2-13/CHANGELOG.md | 1 + .../spark/ChangeFeedPartitionReader.scala | 6 + .../cosmos/spark/ItemsPartitionReader.scala | 8 + ...ransientIOErrorsRetryingIteratorSpec.scala | 62 +++++ ...equestOptionsWithPagedFluxOptionsTest.java | 103 +++++++ .../ChangeFeedFetcherEmptyPagesTest.java | 263 ++++++++++++++++++ sdk/cosmos/azure-cosmos/CHANGELOG.md | 1 + .../implementation/ChangeFeedQueryImpl.java | 1 + .../CosmosChangeFeedRequestOptionsImpl.java | 11 + .../ImplementationBridgeHelpers.java | 4 + .../query/ChangeFeedFetcher.java | 29 +- .../implementation/query/Paginator.java | 2 + .../CosmosChangeFeedRequestOptions.java | 25 ++ 16 files changed, 515 insertions(+), 4 deletions(-) create mode 100644 sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest.java create mode 100644 sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java diff --git a/sdk/cosmos/azure-cosmos-spark_3-3_2-12/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_3-3_2-12/CHANGELOG.md index 5045991f3542..c92ea17be091 100644 --- a/sdk/cosmos/azure-cosmos-spark_3-3_2-12/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos-spark_3-3_2-12/CHANGELOG.md @@ -8,6 +8,7 @@ #### Bugs Fixed * Improved partition planning performance for change feed with large number of feed ranges. - See [PR 49086](https://github.com/Azure/azure-sdk-for-java/pull/49086) +* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for query and change feed reads against workloads that produce long runs of empty pages (for example a cross-partition query that effectively performs a full-table scan and returns only a few documents). The connector now opts into the SDK's `emptyPagesAllowed` behavior so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. #### Other Changes diff --git a/sdk/cosmos/azure-cosmos-spark_3-4_2-12/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_3-4_2-12/CHANGELOG.md index 777177708547..30978ce05fa5 100644 --- a/sdk/cosmos/azure-cosmos-spark_3-4_2-12/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos-spark_3-4_2-12/CHANGELOG.md @@ -8,6 +8,7 @@ #### Bugs Fixed * Improved partition planning performance for change feed with large number of feed ranges. - See [PR 49086](https://github.com/Azure/azure-sdk-for-java/pull/49086) +* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for query and change feed reads against workloads that produce long runs of empty pages (for example a cross-partition query that effectively performs a full-table scan and returns only a few documents). The connector now opts into the SDK's `emptyPagesAllowed` behavior so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. #### Other Changes diff --git a/sdk/cosmos/azure-cosmos-spark_3-5_2-12/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_3-5_2-12/CHANGELOG.md index 1360614308a3..149c59057bfe 100644 --- a/sdk/cosmos/azure-cosmos-spark_3-5_2-12/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos-spark_3-5_2-12/CHANGELOG.md @@ -8,6 +8,7 @@ #### Bugs Fixed * Improved partition planning performance for change feed with large number of feed ranges. - See [PR 49086](https://github.com/Azure/azure-sdk-for-java/pull/49086) +* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for query and change feed reads against workloads that produce long runs of empty pages (for example a cross-partition query that effectively performs a full-table scan and returns only a few documents). The connector now opts into the SDK's `emptyPagesAllowed` behavior so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. #### Other Changes diff --git a/sdk/cosmos/azure-cosmos-spark_3-5_2-13/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_3-5_2-13/CHANGELOG.md index 6674b6f8bb74..325de232cce7 100644 --- a/sdk/cosmos/azure-cosmos-spark_3-5_2-13/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos-spark_3-5_2-13/CHANGELOG.md @@ -8,6 +8,7 @@ #### Bugs Fixed * Improved partition planning performance for change feed with large number of feed ranges. - See [PR 49086](https://github.com/Azure/azure-sdk-for-java/pull/49086) +* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for query and change feed reads against workloads that produce long runs of empty pages (for example a cross-partition query that effectively performs a full-table scan and returns only a few documents). The connector now opts into the SDK's `emptyPagesAllowed` behavior so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. #### Other Changes diff --git a/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedPartitionReader.scala b/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedPartitionReader.scala index 5d83a5139ef2..e5e0cce7fa3b 100644 --- a/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedPartitionReader.scala +++ b/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedPartitionReader.scala @@ -214,6 +214,12 @@ private case class ChangeFeedPartitionReader .setEndLSN(options, this.partition.endLsn.get) } + // Bubble empty pages up to the iterator so the per-page end-to-end timeout + // applies to each individual page rather than being exceeded by serial + // empty-page drains inside ChangeFeedFetcher. + ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper.getCosmosChangeFeedRequestOptionsAccessor + .setAllowEmptyPages(options, true) + options.setCustomItemSerializer(itemDeserializer) } diff --git a/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ItemsPartitionReader.scala b/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ItemsPartitionReader.scala index 44027bafbe7e..7739485bbc91 100644 --- a/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ItemsPartitionReader.scala +++ b/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ItemsPartitionReader.scala @@ -45,6 +45,14 @@ private case class ItemsPartitionReader .getCosmosQueryRequestOptionsAccessor .disallowQueryPlanRetrieval(new CosmosQueryRequestOptions()) + // Bubble empty pages up to the iterator so the per-page end-to-end timeout + // applies to each individual page rather than being exceeded by serial + // empty-page drains inside ParallelDocumentQueryExecutionContext. + ImplementationBridgeHelpers + .CosmosQueryRequestOptionsHelper + .getCosmosQueryRequestOptionsAccessor + .setAllowEmptyPages(queryOptions, true) + private val readConfig = CosmosReadConfig.parseCosmosReadConfig(config) ThroughputControlHelper.populateThroughputControlGroupName( ImplementationBridgeHelpers diff --git a/sdk/cosmos/azure-cosmos-spark_3/src/test/scala/com/azure/cosmos/spark/TransientIOErrorsRetryingIteratorSpec.scala b/sdk/cosmos/azure-cosmos-spark_3/src/test/scala/com/azure/cosmos/spark/TransientIOErrorsRetryingIteratorSpec.scala index b8400fdd3eff..9185e7529d35 100644 --- a/sdk/cosmos/azure-cosmos-spark_3/src/test/scala/com/azure/cosmos/spark/TransientIOErrorsRetryingIteratorSpec.scala +++ b/sdk/cosmos/azure-cosmos-spark_3/src/test/scala/com/azure/cosmos/spark/TransientIOErrorsRetryingIteratorSpec.scala @@ -180,6 +180,68 @@ class TransientIOErrorsRetryingIteratorSpec extends UnitSpec with BasicLoggingTr factoryCallCount.get shouldEqual 1 } + "TransientIOErrors" should "drain long runs of empty pages without hitting the end-to-end timeout" in { + // Regression test for the empty-page drain scenario: when the SDK is configured with + // emptyPagesAllowed=true the iterator must surface many consecutive empty + // pages without busy-waiting beyond the per-page end-to-end timeout. Even + // with hundreds of empty pages followed by real data, the iterator should + // return all real rows. + val emptyLeadingPages = 200 + val realPages = 5 + val totalPages = emptyLeadingPages + realPages + val iterator = new TransientIOErrorsRetryingIterator( + continuationToken => generateMockedCosmosPagedFluxWithEmptyPrefix( + continuationToken, totalPages, emptyLeadingPages), + pageSize, + 1, + None, + None + ) + iterator.maxRetryIntervalInMs = 5 + + // 2 producers (Left/Right) each emit realPages * pageSize rows + iterator.count(_ => true) shouldEqual (realPages * pageSize * 2) + } + + private def generateMockedCosmosPagedFluxWithEmptyPrefix + ( + continuationToken: String, + initialPageCount: Int, + leadingEmptyPageCount: Int + ) = { + + val leftProducer = generateFeedResponseFluxWithEmptyPrefix( + "Left", initialPageCount, leadingEmptyPageCount, Option.apply(continuationToken)) + val rightProducer = generateFeedResponseFluxWithEmptyPrefix( + "Right", initialPageCount, leadingEmptyPageCount, Option.apply(continuationToken)) + val toBeMerged = Array(leftProducer, rightProducer).toIterable.asJava + val mergedFlux = Flux.mergeSequential(toBeMerged, 1, 2) + UtilBridgeInternal.createCosmosPagedFlux(_ => mergedFlux) + } + + private def generateFeedResponseFluxWithEmptyPrefix + ( + prefix: String, + pageCount: Int, + leadingEmptyPageCount: Int, + requestContinuationToken: Option[String] + ): Flux[FeedResponse[SparkRowItem]] = { + + // generateFeedResponse uses documentStartIndex=-1 as the "emit an empty page" sentinel. + val emptyPageSentinel = -1 + val firstDataPageStartIndex = 1 + + val responses = Array.range(1, pageCount + 1) + .map(i => generateFeedResponse( + prefix, + i, + if (i <= leadingEmptyPageCount) emptyPageSentinel else firstDataPageStartIndex)) + .filter(response => requestContinuationToken.isEmpty || + requestContinuationToken.get < response.getContinuationToken) + + Flux.fromArray(responses) + } + private val objectMapper = new ObjectMapper @throws[JsonProcessingException] diff --git a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest.java b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest.java new file mode 100644 index 000000000000..ebb23fafbac1 --- /dev/null +++ b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest.java @@ -0,0 +1,103 @@ +// Copyright (c) Microsoft Corporation. All rights reserved. +// Licensed under the MIT License. + +package com.azure.cosmos.implementation; + +import com.azure.cosmos.implementation.changefeed.common.ChangeFeedMode; +import com.azure.cosmos.implementation.changefeed.common.ChangeFeedStartFromInternal; +import com.azure.cosmos.implementation.changefeed.common.ChangeFeedStateV1; +import com.azure.cosmos.implementation.feedranges.FeedRangeEpkImpl; +import com.azure.cosmos.models.CosmosChangeFeedRequestOptions; +import com.azure.cosmos.models.ModelBridgeInternal; +import org.testng.annotations.Test; + +import static org.assertj.core.api.Assertions.assertThat; + +/** + * Unit tests for the paged-flux pull continuation path on + * {@link CosmosChangeFeedRequestOptions#withCosmosPagedFluxOptions(CosmosPagedFluxOptions)} (package-visible via + * {@link ModelBridgeInternal#getEffectiveChangeFeedRequestOptions(CosmosChangeFeedRequestOptions, CosmosPagedFluxOptions)}). + * + *

That method silently builds a brand-new {@code CosmosChangeFeedRequestOptionsImpl} when the caller supplies a + * continuation token via {@link CosmosPagedFluxOptions}, so any field NOT explicitly copied is dropped. These tests + * lock in the propagation of fields whose loss would silently break a feature. + */ +public class CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest { + + @Test(groups = { "unit" }) + public void emptyPagesAllowed_isPropagated_whenContinuationTokenSupplied() { + // GIVEN a CosmosChangeFeedRequestOptions with emptyPagesAllowed=true (the value the Spark connector sets) + CosmosChangeFeedRequestOptions src = CosmosChangeFeedRequestOptions + .createForProcessingFromBeginning(FeedRangeEpkImpl.forFullRange()); + ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper + .getCosmosChangeFeedRequestOptionsAccessor() + .setAllowEmptyPages(src, true); + + // AND a continuation token supplied via the paged-flux pull mechanism + CosmosPagedFluxOptions pagedFluxOptions = new CosmosPagedFluxOptions(); + pagedFluxOptions.setRequestContinuation(buildContinuationToken()); + + // WHEN computing the effective options + CosmosChangeFeedRequestOptions effective = ModelBridgeInternal + .getEffectiveChangeFeedRequestOptions(src, pagedFluxOptions); + + // THEN emptyPagesAllowed must be preserved on the freshly-built impl + assertThat(ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper + .getCosmosChangeFeedRequestOptionsAccessor() + .getAllowEmptyPages(effective)) + .describedAs("emptyPagesAllowed must survive the paged-flux pull continuation rebuild") + .isTrue(); + } + + @Test(groups = { "unit" }) + public void emptyPagesAllowedFalse_isPropagated_whenContinuationTokenSupplied() { + // The default value should also round-trip cleanly (sanity check that we're not just hard-coding true). + CosmosChangeFeedRequestOptions src = CosmosChangeFeedRequestOptions + .createForProcessingFromBeginning(FeedRangeEpkImpl.forFullRange()); + + CosmosPagedFluxOptions pagedFluxOptions = new CosmosPagedFluxOptions(); + pagedFluxOptions.setRequestContinuation(buildContinuationToken()); + + CosmosChangeFeedRequestOptions effective = ModelBridgeInternal + .getEffectiveChangeFeedRequestOptions(src, pagedFluxOptions); + + assertThat(ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper + .getCosmosChangeFeedRequestOptionsAccessor() + .getAllowEmptyPages(effective)) + .describedAs("emptyPagesAllowed default (false) must survive the paged-flux pull continuation rebuild") + .isFalse(); + } + + @Test(groups = { "unit" }) + public void emptyPagesAllowed_isPreserved_whenNoContinuationTokenSupplied() { + // No continuation → withCosmosPagedFluxOptions returns `this` unchanged. + CosmosChangeFeedRequestOptions src = CosmosChangeFeedRequestOptions + .createForProcessingFromBeginning(FeedRangeEpkImpl.forFullRange()); + ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper + .getCosmosChangeFeedRequestOptionsAccessor() + .setAllowEmptyPages(src, true); + + CosmosPagedFluxOptions pagedFluxOptions = new CosmosPagedFluxOptions(); + pagedFluxOptions.setMaxItemCount(50); + + CosmosChangeFeedRequestOptions effective = ModelBridgeInternal + .getEffectiveChangeFeedRequestOptions(src, pagedFluxOptions); + + assertThat(ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper + .getCosmosChangeFeedRequestOptionsAccessor() + .getAllowEmptyPages(effective)) + .isTrue(); + } + + private static String buildContinuationToken() { + // Build a real ChangeFeedState so we can serialize a valid (base64-encoded) continuation token. + // We use the state's own toString() which round-trips through createForProcessingFromContinuation. + ChangeFeedStateV1 state = new ChangeFeedStateV1( + "someContainerRid", + FeedRangeEpkImpl.forFullRange(), + ChangeFeedMode.INCREMENTAL, + ChangeFeedStartFromInternal.createFromBeginning(), + null); + return state.toString(); + } +} diff --git a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java new file mode 100644 index 000000000000..d8db52fa5d54 --- /dev/null +++ b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java @@ -0,0 +1,263 @@ +// Copyright (c) Microsoft Corporation. All rights reserved. +// Licensed under the MIT License. + +package com.azure.cosmos.implementation.query; + +import com.azure.cosmos.implementation.Document; +import com.azure.cosmos.implementation.DocumentClientRetryPolicy; +import com.azure.cosmos.implementation.DocumentServiceRequestContext; +import com.azure.cosmos.implementation.GlobalEndpointManager; +import com.azure.cosmos.implementation.IRetryPolicyFactory; +import com.azure.cosmos.implementation.OperationType; +import com.azure.cosmos.implementation.ResourceType; +import com.azure.cosmos.implementation.RxDocumentClientImpl; +import com.azure.cosmos.implementation.RxDocumentServiceRequest; +import com.azure.cosmos.implementation.ShouldRetryResult; +import com.azure.cosmos.implementation.changefeed.common.ChangeFeedState; +import com.azure.cosmos.implementation.faultinjection.FaultInjectionRequestContext; +import com.azure.cosmos.implementation.feedranges.FeedRangeContinuation; +import com.azure.cosmos.implementation.feedranges.FeedRangeEpkImpl; +import com.azure.cosmos.implementation.feedranges.FeedRangeInternal; +import com.azure.cosmos.implementation.perPartitionCircuitBreaker.GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker; +import com.azure.cosmos.models.FeedResponse; +import com.azure.cosmos.models.ModelBridgeInternal; +import org.mockito.Mockito; +import org.testng.annotations.Test; +import reactor.core.publisher.Mono; +import reactor.test.StepVerifier; + +import java.lang.reflect.Field; +import java.lang.reflect.Method; +import java.util.HashMap; +import java.util.Map; +import java.util.concurrent.atomic.AtomicInteger; +import java.util.function.Function; +import java.util.function.Supplier; + +import static org.assertj.core.api.Assertions.assertThat; +import static org.mockito.ArgumentMatchers.any; +import static org.mockito.ArgumentMatchers.anyInt; +import static org.mockito.Mockito.doNothing; +import static org.mockito.Mockito.mock; +import static org.mockito.Mockito.when; + +/** + * Unit tests for {@link ChangeFeedFetcher} covering the {@code emptyPagesAllowed} behavior change + * introduced for IcM 51000001033272. + * + *

The two behaviors locked in here are: + *

    + *
  1. {@code isFullyDrained} consults only the continuation (no {@code noChanges} short-circuit), + * which is what allows {@code emptyPagesAllowed=true} to surface empty pages without first + * having to call {@code reEnableShouldFetchMoreForRetry()} to undo a base-class decision.
  2. + *
  3. When {@code emptyPagesAllowed=true}, {@code nextPageInternal} returns + * {@code Mono.just(noChangesResponse)} instead of {@code Mono.empty()} so the empty pages + * bubble up to the caller (Spark connector) where the per-page end-to-end timeout applies to + * each individual page rather than being exceeded by serial empty-page drains.
  4. + *
+ */ +public class ChangeFeedFetcherEmptyPagesTest { + + @Test(groups = { "unit" }) + public void isFullyDrained_noChangesResponseWithUnfinishedContinuation_returnsFalse() throws Exception { + // GIVEN a ChangeFeedFetcher and a noChanges response with continuation not yet done + FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); + when(continuation.isDone()).thenReturn(false); + ChangeFeedFetcher fetcher = newFetcher(continuation, /* emptyPagesAllowed */ true); + + FeedResponse noChangesResponse = changeFeedNoChanges("token-A"); + + // WHEN + boolean drained = invokeIsFullyDrained(fetcher, noChangesResponse); + + // THEN: the previous implementation returned true here (and relied on + // reEnableShouldFetchMoreForRetry() in nextPageInternal to undo it). The new + // implementation consults only the continuation. + assertThat(drained).isFalse(); + } + + @Test(groups = { "unit" }) + public void isFullyDrained_noChangesResponseWithFinishedContinuation_returnsTrue() throws Exception { + FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); + when(continuation.isDone()).thenReturn(true); + ChangeFeedFetcher fetcher = newFetcher(continuation, /* emptyPagesAllowed */ true); + + FeedResponse noChangesResponse = changeFeedNoChanges("token-B"); + + assertThat(invokeIsFullyDrained(fetcher, noChangesResponse)).isTrue(); + } + + @Test(groups = { "unit" }) + public void isFullyDrained_realResponseWithFinishedContinuation_returnsTrue() throws Exception { + FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); + when(continuation.isDone()).thenReturn(true); + ChangeFeedFetcher fetcher = newFetcher(continuation, /* emptyPagesAllowed */ false); + + FeedResponse dataResponse = changeFeedDataPage("token-C", new Document()); + + assertThat(invokeIsFullyDrained(fetcher, dataResponse)).isTrue(); + } + + @Test(groups = { "unit" }) + public void nextPage_emptyPagesAllowedTrue_surfacesNoChangesPagesIndividually() { + // Scenario: 3 consecutive noChanges (304) pages from the same sub-feedRange, + // followed by a real data page. With emptyPagesAllowed=true, each of the 4 + // physical responses must surface as its own Mono emission so the Spark + // iterator can drain them under the per-page end-to-end timeout window. + FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); + when(continuation.isDone()).thenReturn(false); + // handleChangeFeedNotModified returns RETRY_NOW only for actual noChanges responses + // (matching the real FeedRangeCompositeContinuationImpl behavior). + when(continuation.handleChangeFeedNotModified(any())).thenAnswer(invocation -> { + FeedResponse rsp = invocation.getArgument(0); + return ModelBridgeInternal.noChanges(rsp) ? ShouldRetryResult.RETRY_NOW : ShouldRetryResult.noRetry(); + }); + + FeedResponse noChanges1 = changeFeedNoChanges("t1"); + FeedResponse noChanges2 = changeFeedNoChanges("t2"); + FeedResponse noChanges3 = changeFeedNoChanges("t3"); + FeedResponse data = changeFeedDataPage("t4", new Document()); + + FeedResponse[] script = new FeedResponse[] { noChanges1, noChanges2, noChanges3, data }; + AtomicInteger callIndex = new AtomicInteger(); + Function>> executeFunc = + req -> Mono.just(script[callIndex.getAndIncrement()]); + + ChangeFeedFetcher fetcher = + newFetcherWithExecuteFunc(continuation, /* emptyPagesAllowed */ true, executeFunc); + + // Drive the fetcher across 4 nextPage() invocations and assert each one surfaces. + StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == noChanges1).verifyComplete(); + StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == noChanges2).verifyComplete(); + StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == noChanges3).verifyComplete(); + StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == data).verifyComplete(); + + assertThat(callIndex.get()).describedAs("executeFunc should have been called once per surfaced page").isEqualTo(4); + } + + @Test(groups = { "unit" }) + public void nextPage_emptyPagesAllowedFalse_swallowsNoChangesPagesUntilData() { + // Same scenario, but with the default emptyPagesAllowed=false. The + // 3 noChanges responses should be swallowed via repeatWhenEmpty and + // only the data page should surface from a SINGLE nextPage() call. + FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); + when(continuation.isDone()).thenReturn(false); + when(continuation.handleChangeFeedNotModified(any())).thenAnswer(invocation -> { + FeedResponse rsp = invocation.getArgument(0); + return ModelBridgeInternal.noChanges(rsp) ? ShouldRetryResult.RETRY_NOW : ShouldRetryResult.noRetry(); + }); + + FeedResponse noChanges1 = changeFeedNoChanges("t1"); + FeedResponse noChanges2 = changeFeedNoChanges("t2"); + FeedResponse noChanges3 = changeFeedNoChanges("t3"); + FeedResponse data = changeFeedDataPage("t4", new Document()); + + FeedResponse[] script = new FeedResponse[] { noChanges1, noChanges2, noChanges3, data }; + AtomicInteger callIndex = new AtomicInteger(); + Function>> executeFunc = + req -> Mono.just(script[callIndex.getAndIncrement()]); + + ChangeFeedFetcher fetcher = + newFetcherWithExecuteFunc(continuation, /* emptyPagesAllowed */ false, executeFunc); + + // A single nextPage() should internally drain all 3 noChanges responses and + // emit the data response. + StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == data).verifyComplete(); + + assertThat(callIndex.get()).describedAs("executeFunc should be called once per physical fetch").isEqualTo(4); + } + + // ---- helpers ---- + + private static FeedResponse changeFeedNoChanges(String continuationToken) { + return FeedResponseBuilder.changeFeedResponseBuilder(Document.class) + .withContinuationToken(continuationToken) + .lastChangeFeedPage() + .build(); + } + + private static FeedResponse changeFeedDataPage(String continuationToken, Document... docs) { + return FeedResponseBuilder.changeFeedResponseBuilder(Document.class) + .withContinuationToken(continuationToken) + .withResults(docs) + .build(); + } + + private static ChangeFeedFetcher newFetcher(FeedRangeContinuation continuation, boolean emptyPagesAllowed) { + return newFetcherWithExecuteFunc(continuation, emptyPagesAllowed, req -> Mono.empty()); + } + + private static ChangeFeedFetcher newFetcherWithExecuteFunc( + FeedRangeContinuation continuation, + boolean emptyPagesAllowed, + Function>> executeFunc) { + + RxDocumentClientImpl client = mock(RxDocumentClientImpl.class); + IRetryPolicyFactory resetSessionTokenRetryPolicy = mock(IRetryPolicyFactory.class); + DocumentClientRetryPolicy noOpRetryPolicy = mock(DocumentClientRetryPolicy.class); + when(noOpRetryPolicy.shouldRetry(any())).thenReturn(Mono.just(ShouldRetryResult.noRetry())); + when(resetSessionTokenRetryPolicy.getRequestPolicy(any())).thenReturn(noOpRetryPolicy); + when(client.getResetSessionTokenRetryPolicy()).thenReturn(resetSessionTokenRetryPolicy); + ChangeFeedState changeFeedState = mock(ChangeFeedState.class); + FeedRangeInternal feedRange = FeedRangeEpkImpl.forFullRange(); + when(changeFeedState.getContinuation()).thenReturn(continuation); + when(changeFeedState.getFeedRange()).thenReturn(feedRange); + doNothing().when(changeFeedState).populateRequest(any(RxDocumentServiceRequest.class), anyInt()); + + Supplier createRequestFunc = ChangeFeedFetcherEmptyPagesTest::stubRequest; + + Map requestOptionProperties = new HashMap<>(); + GlobalEndpointManager gem = mock(GlobalEndpointManager.class); + GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker gpe = + mock(GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker.class); + when(gpe.isPerPartitionLevelCircuitBreakingApplicable(any())).thenReturn(false); + + return new ChangeFeedFetcher<>( + client, + createRequestFunc, + executeFunc, + changeFeedState, + requestOptionProperties, + /* top */ -1, + /* maxItemCount */ 100, + /* isSplitHandlingDisabled */ true, + /* completeAfterAllCurrentChangesRetrieved */ false, + emptyPagesAllowed, + /* endLSN */ null, + /* operationContext */ null, + gem, + gpe, + /* diagnosticsClientContext */ null); + } + + private static RxDocumentServiceRequest stubRequest() { + RxDocumentServiceRequest request = mock(RxDocumentServiceRequest.class); + when(request.getIsNameBased()).thenReturn(true); + when(request.getResourceAddress()).thenReturn("dbs/db1/colls/coll1"); + when(request.getOperationType()).thenReturn(OperationType.ReadFeed); + when(request.getResourceType()).thenReturn(ResourceType.Document); + DocumentServiceRequestContext requestContext = new DocumentServiceRequestContext(); + try { + Field f = RxDocumentServiceRequest.class.getField("requestContext"); + f.set(request, requestContext); + } catch (NoSuchFieldException | IllegalAccessException e) { + throw new RuntimeException("Failed to inject requestContext into mock", e); + } + try { + Field f = RxDocumentServiceRequest.class.getField("faultInjectionRequestContext"); + f.set(request, new FaultInjectionRequestContext()); + } catch (NoSuchFieldException | IllegalAccessException e) { + throw new RuntimeException("Failed to inject faultInjectionRequestContext into mock", e); + } + return request; + } + + private static boolean invokeIsFullyDrained( + ChangeFeedFetcher fetcher, + FeedResponse response) throws Exception { + + Method m = Fetcher.class.getDeclaredMethod("isFullyDrained", boolean.class, FeedResponse.class); + m.setAccessible(true); + return (Boolean) m.invoke(fetcher, true, response); + } +} diff --git a/sdk/cosmos/azure-cosmos/CHANGELOG.md b/sdk/cosmos/azure-cosmos/CHANGELOG.md index 9ac8342c42b7..feaa31ee82e9 100644 --- a/sdk/cosmos/azure-cosmos/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos/CHANGELOG.md @@ -8,6 +8,7 @@ #### Breaking Changes #### Bugs Fixed +* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for change feed reads on workloads that produce long runs of empty pages (e.g. cross-partition change feed iterations returning very few results across many sub-feedRanges). Empty change feed pages are now surfaced to the caller when the new internal `emptyPagesAllowed` request option is set on `CosmosChangeFeedRequestOptions`, so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains inside `ChangeFeedFetcher`. The Cosmos Spark connector opts into this behavior automatically; the default behavior of `CosmosChangeFeedRequestOptions` is unchanged. #### Other Changes * Replaced per-client `Schedulers.newSingle()` schedulers in `GlobalEndpointManager` and `GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker` with shared `BoundedElastic` schedulers in `CosmosSchedulers` to prevent thread count from scaling linearly with client/tenant count. - See [PR 49062](https://github.com/Azure/azure-sdk-for-java/pull/49062) diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java index c8bca12fcbf4..5507acf6d705 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java @@ -121,6 +121,7 @@ public Flux> executeAsync() { this.options.getMaxPrefetchPageCount(), ModelBridgeInternal.getChangeFeedIsSplitHandlingDisabled(this.options), this.options.isCompleteAfterAllCurrentChangesRetrieved(), + changeFeedOptionsAccessor().getAllowEmptyPages(this.options), changeFeedOptionsAccessor() .getEndLSN(this.options), changeFeedOptionsAccessor() diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsImpl.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsImpl.java index 451f466a8e4b..81c4bf2b1858 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsImpl.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsImpl.java @@ -53,6 +53,7 @@ public final class CosmosChangeFeedRequestOptionsImpl implements OverridableRequ private boolean completeAfterAllCurrentChangesRetrieved; private Long endLSN; private ReadConsistencyStrategy readConsistencyStrategy; + private boolean emptyPagesAllowed; public CosmosChangeFeedRequestOptionsImpl(CosmosChangeFeedRequestOptionsImpl toBeCloned) { if (toBeCloned.continuationState != null) { @@ -80,6 +81,7 @@ public CosmosChangeFeedRequestOptionsImpl(CosmosChangeFeedRequestOptionsImpl toB this.keywordIdentifiers = toBeCloned.keywordIdentifiers; this.completeAfterAllCurrentChangesRetrieved = toBeCloned.completeAfterAllCurrentChangesRetrieved; this.endLSN = toBeCloned.endLSN; + this.emptyPagesAllowed = toBeCloned.emptyPagesAllowed; } public CosmosChangeFeedRequestOptionsImpl( @@ -185,6 +187,15 @@ public void setQuotaInfoEnabled(boolean quotaInfoEnabled) { this.quotaInfoEnabled = quotaInfoEnabled; } + public boolean isEmptyPagesAllowed() { + return this.emptyPagesAllowed; + } + + public CosmosChangeFeedRequestOptionsImpl setEmptyPagesAllowed(boolean emptyPagesAllowed) { + this.emptyPagesAllowed = emptyPagesAllowed; + return this; + } + public void setDiagnosticsThresholds( CosmosDiagnosticsThresholds operationSpecificThresholds) { diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java index c48555496c29..46ed7e744224 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java @@ -444,6 +444,10 @@ public interface CosmosChangeFeedRequestOptionsAccessor { void setPartitionKeyDefinition(CosmosChangeFeedRequestOptions changeFeedRequestOptions, PartitionKeyDefinition partitionKeyDefinition); Map getProperties(CosmosChangeFeedRequestOptions changeFeedRequestOptions); CosmosChangeFeedRequestOptions disableSplitHandling(CosmosChangeFeedRequestOptions changeFeedRequestOptions); + + boolean getAllowEmptyPages(CosmosChangeFeedRequestOptions changeFeedRequestOptions); + + void setAllowEmptyPages(CosmosChangeFeedRequestOptions changeFeedRequestOptions, boolean emptyPagesAllowed); } } diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java index d39f5dfc059c..3fe08d88de74 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java @@ -45,6 +45,7 @@ private static ImplementationBridgeHelpers.FeedResponseHelper.FeedResponseAccess private final Supplier createRequestFunc; private final Supplier feedRangeContinuationRetryPolicySupplier; private final boolean completeAfterAllCurrentChangesRetrieved; + private final boolean emptyPagesAllowed; private final Long endLSN; public ChangeFeedFetcher( @@ -57,6 +58,7 @@ public ChangeFeedFetcher( int maxItemCount, boolean isSplitHandlingDisabled, boolean completeAfterAllCurrentChangesRetrieved, + boolean emptyPagesAllowed, Long endLSN, OperationContextAndListenerTuple operationContext, GlobalEndpointManager globalEndpointManager, @@ -85,6 +87,7 @@ public ChangeFeedFetcher( diagnosticsClientContext); this.createRequestFunc = createRequestFunc; this.completeAfterAllCurrentChangesRetrieved = completeAfterAllCurrentChangesRetrieved; + this.emptyPagesAllowed = emptyPagesAllowed; this.endLSN = endLSN; } @@ -135,6 +138,14 @@ private Mono> nextPageInternal(DocumentClientRetryPolicy retryPo if (ModelBridgeInternal.noChanges(r)) { // if we have reached here, it means we have got 304 for the current feedRange, // but we need to continue drain the changes from other sub-feedRange + if (this.emptyPagesAllowed) { + // Surface the empty page to the caller instead of swallowing it via + // repeatWhenEmpty. isFullyDrained does not flip shouldFetchMore off + // for noChanges responses (it consults only continuation.isDone()), + // so the outer Paginator generate-loop will issue the next nextPage() + // call to drain the remaining sub-feedRanges. + return Mono.just(r); + } this.reEnableShouldFetchMoreForRetry(); return Mono.empty(); } @@ -146,6 +157,13 @@ private Mono> nextPageInternal(DocumentClientRetryPolicy retryPo // not all continuations have been drained yet // repeat with the next continuation + if (this.emptyPagesAllowed) { + // Surface the empty/304 page to the caller instead of swallowing it via + // repeatWhenEmpty. isFullyDrained does not flip shouldFetchMore off for + // noChanges responses, so Paginator will continue to drain remaining + // sub-feedRanges via the next nextPage() call. + return Mono.just(r); + } this.reEnableShouldFetchMoreForRetry(); return Mono.empty(); } @@ -170,10 +188,13 @@ protected String applyServerResponseContinuation( @Override protected boolean isFullyDrained(boolean isChangeFeed, FeedResponse response) { - if (ModelBridgeInternal.noChanges(response)) { - return true; - } - + // Note: we intentionally do NOT short-circuit to true on noChanges. The + // original logic returned true for noChanges and then nextPageInternal + // had to undo that via reEnableShouldFetchMoreForRetry() before either + // surfacing or swallowing the page via repeatWhenEmpty. Consulting the + // continuation directly is both simpler and required for + // emptyPagesAllowed=true (where each noChanges page is surfaced to the + // caller and the outer Paginator loop must keep calling nextPage()). FeedRangeContinuation continuation = this.changeFeedState.getContinuation(); return continuation != null && continuation.isDone(); } diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/Paginator.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/Paginator.java index cbd3ec9fba90..bf2fd3d332cd 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/Paginator.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/Paginator.java @@ -97,6 +97,7 @@ public static Flux> getChangeFeedQueryResultAsObservable( int preFetchCount, boolean isSplitHandlingDisabled, boolean completeAfterAllCurrentChangesRetrieved, + boolean emptyPagesAllowed, Long endLsn, OperationContextAndListenerTuple operationContext, DiagnosticsClientContext diagnosticsClientContext) { @@ -112,6 +113,7 @@ public static Flux> getChangeFeedQueryResultAsObservable( maxPageSize, isSplitHandlingDisabled, completeAfterAllCurrentChangesRetrieved, + emptyPagesAllowed, endLsn, operationContext, client.getGlobalEndpointManager(), diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosChangeFeedRequestOptions.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosChangeFeedRequestOptions.java index bf36a0692473..2cbecc7cc0a2 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosChangeFeedRequestOptions.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosChangeFeedRequestOptions.java @@ -442,11 +442,19 @@ CosmosChangeFeedRequestOptions withCosmosPagedFluxOptions( CosmosChangeFeedRequestOptions effectiveRequestOptions = this; if (pagedFluxOptions.getRequestContinuation() != null) { + // NOTE: this hand-maintained copy list is a known drift hazard. + // `createForProcessingFromContinuation` builds a fresh + // CosmosChangeFeedRequestOptionsImpl from the parsed continuation + // token, so any field NOT explicitly copied below is silently + // dropped. If you add a new field to CosmosChangeFeedRequestOptions + // and it must survive the paged-flux pull continuation path, + // propagate it here (and ideally add a test covering it). effectiveRequestOptions = CosmosChangeFeedRequestOptions.createForProcessingFromContinuation( pagedFluxOptions.getRequestContinuation()); effectiveRequestOptions.setMaxPrefetchPageCount(this.getMaxPrefetchPageCount()); effectiveRequestOptions.setThroughputControlGroupName(this.getThroughputControlGroupName()); + effectiveRequestOptions.getImpl().setEmptyPagesAllowed(this.getImpl().isEmptyPagesAllowed()); } if (pagedFluxOptions.getMaxItemCount() != null) { @@ -783,6 +791,23 @@ public Map getProperties(CosmosChangeFeedRequestOptions changeFe public CosmosChangeFeedRequestOptions disableSplitHandling(CosmosChangeFeedRequestOptions changeFeedRequestOptions) { return changeFeedRequestOptions.disableSplitHandling(); } + + @Override + public boolean getAllowEmptyPages(CosmosChangeFeedRequestOptions changeFeedRequestOptions) { + return changeFeedRequestOptions.getImpl().isEmptyPagesAllowed(); + } + + @Override + public void setAllowEmptyPages( + CosmosChangeFeedRequestOptions changeFeedRequestOptions, + boolean emptyPagesAllowed) { + + // Note: kept package-private (impl-side only). We deliberately do NOT + // expose a public setEmptyPagesAllowed on CosmosChangeFeedRequestOptions + // because the flag changes paging semantics in subtle ways the SDK + // does not want most callers to opt into. + changeFeedRequestOptions.getImpl().setEmptyPagesAllowed(emptyPagesAllowed); + } }); } From 8616057fbe5901189946f728ac715fa9440985f0 Mon Sep 17 00:00:00 2001 From: tvaron3 Date: Tue, 26 May 2026 16:25:42 -0700 Subject: [PATCH 02/15] Add PR link to CHANGELOG entries Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- sdk/cosmos/azure-cosmos-spark_3-3_2-12/CHANGELOG.md | 2 +- sdk/cosmos/azure-cosmos-spark_3-4_2-12/CHANGELOG.md | 2 +- sdk/cosmos/azure-cosmos-spark_3-5_2-12/CHANGELOG.md | 2 +- sdk/cosmos/azure-cosmos-spark_3-5_2-13/CHANGELOG.md | 2 +- sdk/cosmos/azure-cosmos/CHANGELOG.md | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) diff --git a/sdk/cosmos/azure-cosmos-spark_3-3_2-12/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_3-3_2-12/CHANGELOG.md index c92ea17be091..df2d28f68f04 100644 --- a/sdk/cosmos/azure-cosmos-spark_3-3_2-12/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos-spark_3-3_2-12/CHANGELOG.md @@ -8,7 +8,7 @@ #### Bugs Fixed * Improved partition planning performance for change feed with large number of feed ranges. - See [PR 49086](https://github.com/Azure/azure-sdk-for-java/pull/49086) -* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for query and change feed reads against workloads that produce long runs of empty pages (for example a cross-partition query that effectively performs a full-table scan and returns only a few documents). The connector now opts into the SDK's `emptyPagesAllowed` behavior so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. +* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for query and change feed reads against workloads that produce long runs of empty pages (for example a cross-partition query that effectively performs a full-table scan and returns only a few documents). The connector now opts into the SDK's `emptyPagesAllowed` behavior so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) #### Other Changes diff --git a/sdk/cosmos/azure-cosmos-spark_3-4_2-12/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_3-4_2-12/CHANGELOG.md index 30978ce05fa5..1000f34f0930 100644 --- a/sdk/cosmos/azure-cosmos-spark_3-4_2-12/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos-spark_3-4_2-12/CHANGELOG.md @@ -8,7 +8,7 @@ #### Bugs Fixed * Improved partition planning performance for change feed with large number of feed ranges. - See [PR 49086](https://github.com/Azure/azure-sdk-for-java/pull/49086) -* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for query and change feed reads against workloads that produce long runs of empty pages (for example a cross-partition query that effectively performs a full-table scan and returns only a few documents). The connector now opts into the SDK's `emptyPagesAllowed` behavior so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. +* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for query and change feed reads against workloads that produce long runs of empty pages (for example a cross-partition query that effectively performs a full-table scan and returns only a few documents). The connector now opts into the SDK's `emptyPagesAllowed` behavior so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) #### Other Changes diff --git a/sdk/cosmos/azure-cosmos-spark_3-5_2-12/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_3-5_2-12/CHANGELOG.md index 149c59057bfe..019f35a705b6 100644 --- a/sdk/cosmos/azure-cosmos-spark_3-5_2-12/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos-spark_3-5_2-12/CHANGELOG.md @@ -8,7 +8,7 @@ #### Bugs Fixed * Improved partition planning performance for change feed with large number of feed ranges. - See [PR 49086](https://github.com/Azure/azure-sdk-for-java/pull/49086) -* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for query and change feed reads against workloads that produce long runs of empty pages (for example a cross-partition query that effectively performs a full-table scan and returns only a few documents). The connector now opts into the SDK's `emptyPagesAllowed` behavior so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. +* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for query and change feed reads against workloads that produce long runs of empty pages (for example a cross-partition query that effectively performs a full-table scan and returns only a few documents). The connector now opts into the SDK's `emptyPagesAllowed` behavior so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) #### Other Changes diff --git a/sdk/cosmos/azure-cosmos-spark_3-5_2-13/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_3-5_2-13/CHANGELOG.md index 325de232cce7..a04aa12ba82a 100644 --- a/sdk/cosmos/azure-cosmos-spark_3-5_2-13/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos-spark_3-5_2-13/CHANGELOG.md @@ -8,7 +8,7 @@ #### Bugs Fixed * Improved partition planning performance for change feed with large number of feed ranges. - See [PR 49086](https://github.com/Azure/azure-sdk-for-java/pull/49086) -* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for query and change feed reads against workloads that produce long runs of empty pages (for example a cross-partition query that effectively performs a full-table scan and returns only a few documents). The connector now opts into the SDK's `emptyPagesAllowed` behavior so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. +* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for query and change feed reads against workloads that produce long runs of empty pages (for example a cross-partition query that effectively performs a full-table scan and returns only a few documents). The connector now opts into the SDK's `emptyPagesAllowed` behavior so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) #### Other Changes diff --git a/sdk/cosmos/azure-cosmos/CHANGELOG.md b/sdk/cosmos/azure-cosmos/CHANGELOG.md index feaa31ee82e9..56bb798c1188 100644 --- a/sdk/cosmos/azure-cosmos/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos/CHANGELOG.md @@ -8,7 +8,7 @@ #### Breaking Changes #### Bugs Fixed -* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for change feed reads on workloads that produce long runs of empty pages (e.g. cross-partition change feed iterations returning very few results across many sub-feedRanges). Empty change feed pages are now surfaced to the caller when the new internal `emptyPagesAllowed` request option is set on `CosmosChangeFeedRequestOptions`, so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains inside `ChangeFeedFetcher`. The Cosmos Spark connector opts into this behavior automatically; the default behavior of `CosmosChangeFeedRequestOptions` is unchanged. +* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for change feed reads on workloads that produce long runs of empty pages (e.g. cross-partition change feed iterations returning very few results across many sub-feedRanges). Empty change feed pages are now surfaced to the caller when the new internal `emptyPagesAllowed` request option is set on `CosmosChangeFeedRequestOptions`, so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains inside `ChangeFeedFetcher`. The Cosmos Spark connector opts into this behavior automatically; the default behavior of `CosmosChangeFeedRequestOptions` is unchanged. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) #### Other Changes * Replaced per-client `Schedulers.newSingle()` schedulers in `GlobalEndpointManager` and `GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker` with shared `BoundedElastic` schedulers in `CosmosSchedulers` to prevent thread count from scaling linearly with client/tenant count. - See [PR 49062](https://github.com/Azure/azure-sdk-for-java/pull/49062) From 08f690b097e4652165cf4fbc017f4989f9e14955 Mon Sep 17 00:00:00 2001 From: tvaron3 Date: Tue, 26 May 2026 16:46:18 -0700 Subject: [PATCH 03/15] Restore noChanges short-circuit in isFullyDrained when emptyPagesAllowed=false The previous cleanup of ChangeFeedFetcher.isFullyDrained removed the noChanges short-circuit unconditionally, on the rationale that consulting continuation.isDone() was simpler. That regressed every non-Spark caller of the change-feed API: FeedRangeCompositeContinuationImpl.isDone() returns compositeContinuationTokens.size() == 0, but moveToNextToken() rotates the deque via poll() + add() and never shrinks it. So isDone() is permanently false for normal incremental change-feed iteration. For the default emptyPagesAllowed=false path: 1. The 304 arrives. 2. updateState calls isFullyDrained -> false (because isDone() is false). 3. nextPageInternal's else-branch sees handleChangeFeedNotModified return NO_RETRY (single-partition case, multi-partition cycle-complete, or the >4*(size+1) consecutive-304 defense) and falls through to Mono.just(r). 4. Paginator's generate-loop checks shouldFetchMore() -> true and calls nextPage() again -> infinite poll loop. Customer-visible impact would be: any consumer that drains queryChangeFeed(...).byPage() to completion (e.g. .toIterable().iterator(), .collectList(), .blockLast()) hangs forever once the change feed catches up. flag is true (Spark path), surface every noChanges to the caller and let the consumer decide when to stop iterating. When the flag is false (every other caller, including the SDK's public queryChangeFeed API), preserve the original termination signal. Also addressed reviewer feedback: * Drop unused org.mockito.Mockito import (would break checkstyle's UnusedImports rule). * Replace reflective field assignment in stubRequest() with direct field writes on the public fields RxDocumentServiceRequest.requestContext and .faultInjectionRequestContext. Mockito only intercepts method calls; field writes on a mock work directly. * Add a regression test isFullyDrained_noChangesResponseWithEmptyPagesAllowedFalse_returnsTrue that locks in the termination signal for the default path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../ChangeFeedFetcherEmptyPagesTest.java | 41 +++++++++------ .../query/ChangeFeedFetcher.java | 50 +++++++++++++------ 2 files changed, 60 insertions(+), 31 deletions(-) diff --git a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java index d8db52fa5d54..9165b5cd648e 100644 --- a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java +++ b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java @@ -21,12 +21,10 @@ import com.azure.cosmos.implementation.perPartitionCircuitBreaker.GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker; import com.azure.cosmos.models.FeedResponse; import com.azure.cosmos.models.ModelBridgeInternal; -import org.mockito.Mockito; import org.testng.annotations.Test; import reactor.core.publisher.Mono; import reactor.test.StepVerifier; -import java.lang.reflect.Field; import java.lang.reflect.Method; import java.util.HashMap; import java.util.Map; @@ -98,6 +96,28 @@ public void isFullyDrained_realResponseWithFinishedContinuation_returnsTrue() th assertThat(invokeIsFullyDrained(fetcher, dataResponse)).isTrue(); } + @Test(groups = { "unit" }) + public void isFullyDrained_noChangesResponseWithEmptyPagesAllowedFalse_returnsTrue() throws Exception { + // Regression test: with the default emptyPagesAllowed=false the noChanges short-circuit + // MUST stay in place. Otherwise any non-Spark caller that drains the change feed flux + // (e.g. queryChangeFeed(...).byPage().toIterable().iterator()) would loop forever, + // because FeedRangeCompositeContinuationImpl.isDone() returns + // compositeContinuationTokens.size()==0, which is permanently false for incremental + // change feed (moveToNextToken rotates the deque, never shrinks it). The + // NO_RETRY result from handleChangeFeedNotModified is the only termination signal, + // and Paginator only honors it via Fetcher.shouldFetchMore() being flipped by + // isFullyDrained=true on this noChanges page. + FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); + when(continuation.isDone()).thenReturn(false); + ChangeFeedFetcher fetcher = newFetcher(continuation, /* emptyPagesAllowed */ false); + + FeedResponse noChangesResponse = changeFeedNoChanges("token-D"); + + assertThat(invokeIsFullyDrained(fetcher, noChangesResponse)) + .describedAs("With emptyPagesAllowed=false the noChanges short-circuit must remain to terminate iteration") + .isTrue(); + } + @Test(groups = { "unit" }) public void nextPage_emptyPagesAllowedTrue_surfacesNoChangesPagesIndividually() { // Scenario: 3 consecutive noChanges (304) pages from the same sub-feedRange, @@ -236,19 +256,10 @@ private static RxDocumentServiceRequest stubRequest() { when(request.getResourceAddress()).thenReturn("dbs/db1/colls/coll1"); when(request.getOperationType()).thenReturn(OperationType.ReadFeed); when(request.getResourceType()).thenReturn(ResourceType.Document); - DocumentServiceRequestContext requestContext = new DocumentServiceRequestContext(); - try { - Field f = RxDocumentServiceRequest.class.getField("requestContext"); - f.set(request, requestContext); - } catch (NoSuchFieldException | IllegalAccessException e) { - throw new RuntimeException("Failed to inject requestContext into mock", e); - } - try { - Field f = RxDocumentServiceRequest.class.getField("faultInjectionRequestContext"); - f.set(request, new FaultInjectionRequestContext()); - } catch (NoSuchFieldException | IllegalAccessException e) { - throw new RuntimeException("Failed to inject faultInjectionRequestContext into mock", e); - } + // requestContext and faultInjectionRequestContext are public fields on the real class; + // direct assignment on the mock works (Mockito only intercepts method calls). + request.requestContext = new DocumentServiceRequestContext(); + request.faultInjectionRequestContext = new FaultInjectionRequestContext(); return request; } diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java index 3fe08d88de74..38c81b7ee00f 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java @@ -139,13 +139,16 @@ private Mono> nextPageInternal(DocumentClientRetryPolicy retryPo // if we have reached here, it means we have got 304 for the current feedRange, // but we need to continue drain the changes from other sub-feedRange if (this.emptyPagesAllowed) { - // Surface the empty page to the caller instead of swallowing it via - // repeatWhenEmpty. isFullyDrained does not flip shouldFetchMore off - // for noChanges responses (it consults only continuation.isDone()), - // so the outer Paginator generate-loop will issue the next nextPage() - // call to drain the remaining sub-feedRanges. + // Surface the empty page to the caller. When emptyPagesAllowed=true, + // isFullyDrained() does NOT flip shouldFetchMore off on noChanges + // (it only consults continuation.isDone()), so the outer Paginator + // generate-loop will issue the next nextPage() call to drain the + // remaining sub-feedRanges. return Mono.just(r); } + // Default path: isFullyDrained() returned true for this noChanges, so + // shouldFetchMore is currently false. Re-enable it before going through + // repeatWhenEmpty so Paginator will call nextPage() again. this.reEnableShouldFetchMoreForRetry(); return Mono.empty(); } @@ -158,12 +161,14 @@ private Mono> nextPageInternal(DocumentClientRetryPolicy retryPo // not all continuations have been drained yet // repeat with the next continuation if (this.emptyPagesAllowed) { - // Surface the empty/304 page to the caller instead of swallowing it via - // repeatWhenEmpty. isFullyDrained does not flip shouldFetchMore off for - // noChanges responses, so Paginator will continue to drain remaining - // sub-feedRanges via the next nextPage() call. + // Surface the empty/304 page to the caller. When emptyPagesAllowed=true, + // isFullyDrained() does NOT flip shouldFetchMore off on noChanges, so + // Paginator will keep calling nextPage() to drain remaining sub-feedRanges. + // Caller-side iteration terminates via the consumer (e.g., Spark task). return Mono.just(r); } + // Default path: isFullyDrained() returned true for this noChanges, so + // re-enable shouldFetchMore before going through repeatWhenEmpty. this.reEnableShouldFetchMoreForRetry(); return Mono.empty(); } @@ -188,13 +193,26 @@ protected String applyServerResponseContinuation( @Override protected boolean isFullyDrained(boolean isChangeFeed, FeedResponse response) { - // Note: we intentionally do NOT short-circuit to true on noChanges. The - // original logic returned true for noChanges and then nextPageInternal - // had to undo that via reEnableShouldFetchMoreForRetry() before either - // surfacing or swallowing the page via repeatWhenEmpty. Consulting the - // continuation directly is both simpler and required for - // emptyPagesAllowed=true (where each noChanges page is surfaced to the - // caller and the outer Paginator loop must keep calling nextPage()). + // When emptyPagesAllowed=true we explicitly want to surface every noChanges + // page to the caller and let the outer Paginator generate-loop keep calling + // nextPage() until the caller stops consuming (e.g., until Spark's Iterator + // is exhausted). isDone() on FeedRangeCompositeContinuationImpl checks + // compositeContinuationTokens.size()==0, which never holds for normal + // incremental change-feed iteration (moveToNextToken rotates the deque, + // never shrinks it). So in that mode, "is the iteration over?" is decided + // by the consumer, not by us here. + // + // When emptyPagesAllowed=false (the SDK default for all non-Spark callers) + // we MUST short-circuit on noChanges. Otherwise, after + // FeedRangeCompositeContinuationImpl.handleChangeFeedNotModified returns + // NO_RETRY (single-partition case, multi-partition cycle-complete, or the + // >4*(size+1) consecutive-304 defense), nextPageInternal would fall through + // to Mono.just(r) and Paginator would call nextPage() again forever + // because isDone() never flips true for incremental change feed. + if (!this.emptyPagesAllowed && ModelBridgeInternal.noChanges(response)) { + return true; + } + FeedRangeContinuation continuation = this.changeFeedState.getContinuation(); return continuation != null && continuation.isDone(); } From 48a9e990149b77e574ad4dcdaadc1d8e206dfb9e Mon Sep 17 00:00:00 2001 From: tvaron3 Date: Tue, 26 May 2026 17:04:27 -0700 Subject: [PATCH 04/15] Address pass-3 review: defense-in-depth on NO_RETRY + DRY cleanup * Defense-in-depth (F1): when emptyPagesAllowed=true the streaming change- feed path takes branch 2 of nextPageInternal. If handleChangeFeedNotModified returns NO_RETRY on a noChanges page (single- partition case, multi-partition full-cycle complete, or the >4*(size+1) consecutive-304 defense in FeedRangeCompositeContinuationImpl), the SDK's built-in termination signal was being silently dropped because isFullyDrained() consults only continuation.isDone() in that mode (which is permanently false for incremental change feed). Now we explicitly call disableShouldFetchMore() to preserve the defense-in-depth termination guarantee even for emptyPagesAllowed=true callers. * DRY (F4): extracted the two near-identical 'surface or swallow via repeatWhenEmpty' blocks in nextPageInternal into a private surfaceOrSwallowNoChangesPage(r) helper. The branches now read as a one-line intent instead of seven near-identical lines. * Comment density (F9): tightened the long isFullyDrained comment to a 2-line tl;dr followed by the detailed rationale. * Test (F2): new nextPage_emptyPagesAllowedTrueWithNoRetryOnNoChanges_ terminatesIteration locks in the defense-in-depth fix - asserts that after a terminal NO_RETRY noChanges page, shouldFetchMore() flips to false so Paginator stops calling nextPage(). * Test (F3): added Mockito.verify(continuation, never()).isDone() to the pass-2 regression test so a future refactor that accidentally drops the noChanges short-circuit and falls through to the (permanently-false) continuation.isDone() check fails loudly instead of silently hanging. Test results: ChangeFeedFetcherEmptyPagesTest 7/7, FetcherTest 5/5, CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest 3/3, TransientIOErrorsRetryingIteratorSpec 7/7 - all green. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../ChangeFeedFetcherEmptyPagesTest.java | 52 +++++++++++ .../query/ChangeFeedFetcher.java | 87 +++++++++---------- 2 files changed, 94 insertions(+), 45 deletions(-) diff --git a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java index 9165b5cd648e..d1bafd877b38 100644 --- a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java +++ b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java @@ -21,6 +21,7 @@ import com.azure.cosmos.implementation.perPartitionCircuitBreaker.GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker; import com.azure.cosmos.models.FeedResponse; import com.azure.cosmos.models.ModelBridgeInternal; +import org.mockito.Mockito; import org.testng.annotations.Test; import reactor.core.publisher.Mono; import reactor.test.StepVerifier; @@ -116,6 +117,11 @@ public void isFullyDrained_noChangesResponseWithEmptyPagesAllowedFalse_returnsTr assertThat(invokeIsFullyDrained(fetcher, noChangesResponse)) .describedAs("With emptyPagesAllowed=false the noChanges short-circuit must remain to terminate iteration") .isTrue(); + // The short-circuit must fire WITHOUT consulting the continuation; if a future + // refactor accidentally drops the noChanges check and falls through to + // continuation.isDone() (which is permanently false in incremental change feed), + // this verify catches it as a loud failure rather than a hard-to-diagnose hang. + Mockito.verify(continuation, Mockito.never()).isDone(); } @Test(groups = { "unit" }) @@ -155,6 +161,52 @@ public void nextPage_emptyPagesAllowedTrue_surfacesNoChangesPagesIndividually() assertThat(callIndex.get()).describedAs("executeFunc should have been called once per surfaced page").isEqualTo(4); } + @Test(groups = { "unit" }) + public void nextPage_emptyPagesAllowedTrueWithNoRetryOnNoChanges_terminatesIteration() { + // Defense-in-depth: with emptyPagesAllowed=true, isFullyDrained() consults only + // continuation.isDone() (permanently false in incremental change feed), so the + // SDK's own termination signal would otherwise be lost. nextPageInternal must + // explicitly disableShouldFetchMore() when handleChangeFeedNotModified returns + // NO_RETRY on a noChanges page (single-partition case, multi-partition full + // cycle complete, or the >4*(size+1) consecutive-304 defense). + // + // This test scripts: 3 noChanges with RETRY_NOW (mid-cycle), followed by a 4th + // noChanges with NO_RETRY (terminal). All 4 must surface, and shouldFetchMore() + // must be false after the terminal page so Paginator's outer loop stops. + FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); + when(continuation.isDone()).thenReturn(false); + + FeedResponse mid1 = changeFeedNoChanges("t1"); + FeedResponse mid2 = changeFeedNoChanges("t2"); + FeedResponse mid3 = changeFeedNoChanges("t3"); + FeedResponse terminal = changeFeedNoChanges("t4"); + + // The continuation distinguishes terminal from mid-cycle by reference identity. + when(continuation.handleChangeFeedNotModified(any())).thenAnswer(invocation -> { + FeedResponse rsp = invocation.getArgument(0); + return rsp == terminal ? ShouldRetryResult.noRetry() : ShouldRetryResult.RETRY_NOW; + }); + + FeedResponse[] script = new FeedResponse[] { mid1, mid2, mid3, terminal }; + AtomicInteger callIndex = new AtomicInteger(); + Function>> executeFunc = + req -> Mono.just(script[callIndex.getAndIncrement()]); + + ChangeFeedFetcher fetcher = + newFetcherWithExecuteFunc(continuation, /* emptyPagesAllowed */ true, executeFunc); + + StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == mid1).verifyComplete(); + assertThat(fetcher.shouldFetchMore()).describedAs("after mid1").isTrue(); + StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == mid2).verifyComplete(); + assertThat(fetcher.shouldFetchMore()).describedAs("after mid2").isTrue(); + StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == mid3).verifyComplete(); + assertThat(fetcher.shouldFetchMore()).describedAs("after mid3").isTrue(); + StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == terminal).verifyComplete(); + assertThat(fetcher.shouldFetchMore()) + .describedAs("NO_RETRY on terminal noChanges page MUST stop Paginator from polling again") + .isFalse(); + } + @Test(groups = { "unit" }) public void nextPage_emptyPagesAllowedFalse_swallowsNoChangesPagesUntilData() { // Same scenario, but with the default emptyPagesAllowed=false. The diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java index 38c81b7ee00f..4246cdda796e 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java @@ -136,41 +136,31 @@ private Mono> nextPageInternal(DocumentClientRetryPolicy retryPo } if (ModelBridgeInternal.noChanges(r)) { - // if we have reached here, it means we have got 304 for the current feedRange, - // but we need to continue drain the changes from other sub-feedRange - if (this.emptyPagesAllowed) { - // Surface the empty page to the caller. When emptyPagesAllowed=true, - // isFullyDrained() does NOT flip shouldFetchMore off on noChanges - // (it only consults continuation.isDone()), so the outer Paginator - // generate-loop will issue the next nextPage() call to drain the - // remaining sub-feedRanges. - return Mono.just(r); - } - // Default path: isFullyDrained() returned true for this noChanges, so - // shouldFetchMore is currently false. Re-enable it before going through - // repeatWhenEmpty so Paginator will call nextPage() again. - this.reEnableShouldFetchMoreForRetry(); - return Mono.empty(); + // 304 for the current sub-feedRange; need to drain the next one. + return surfaceOrSwallowNoChangesPage(r); } } } else { - // complete query based on 304s - if (continuationSnapshot != null && - continuationSnapshot.handleChangeFeedNotModified(r) == ShouldRetryResult.RETRY_NOW) { - - // not all continuations have been drained yet - // repeat with the next continuation - if (this.emptyPagesAllowed) { - // Surface the empty/304 page to the caller. When emptyPagesAllowed=true, - // isFullyDrained() does NOT flip shouldFetchMore off on noChanges, so - // Paginator will keep calling nextPage() to drain remaining sub-feedRanges. - // Caller-side iteration terminates via the consumer (e.g., Spark task). + // Streaming change feed (no endLSN). Terminate either when no continuation + // exists or when handleChangeFeedNotModified signals NO_RETRY (single-partition + // case, multi-partition full-cycle complete, or the >4*(size+1) consecutive-304 + // defense in FeedRangeCompositeContinuationImpl). + if (continuationSnapshot != null) { + ShouldRetryResult retryResult = continuationSnapshot.handleChangeFeedNotModified(r); + if (retryResult == ShouldRetryResult.RETRY_NOW) { + // not all continuations have been drained yet; repeat with the next continuation + return surfaceOrSwallowNoChangesPage(r); + } + if (ModelBridgeInternal.noChanges(r) && this.emptyPagesAllowed) { + // NO_RETRY on a noChanges page: the SDK's termination signal. Without + // emptyPagesAllowed=true, isFullyDrained() already flipped shouldFetchMore + // off. With emptyPagesAllowed=true, isFullyDrained() consults only + // continuation.isDone() (which is permanently false for incremental change + // feed), so we must explicitly disable further fetches here to preserve + // the defense-in-depth termination guarantee. + this.disableShouldFetchMore(); return Mono.just(r); } - // Default path: isFullyDrained() returned true for this noChanges, so - // re-enable shouldFetchMore before going through repeatWhenEmpty. - this.reEnableShouldFetchMoreForRetry(); - return Mono.empty(); } } @@ -179,6 +169,19 @@ private Mono> nextPageInternal(DocumentClientRetryPolicy retryPo .repeatWhenEmpty(o -> o); } + /** + * Either surface a noChanges page to the caller (when emptyPagesAllowed=true) or swallow it via + * Reactor's repeatWhenEmpty (the legacy behavior). When swallowing, shouldFetchMore must be + * re-enabled first because isFullyDrained() already flipped it off for the noChanges page. + */ + private Mono> surfaceOrSwallowNoChangesPage(FeedResponse r) { + if (this.emptyPagesAllowed) { + return Mono.just(r); + } + this.reEnableShouldFetchMoreForRetry(); + return Mono.empty(); + } + @Override protected String applyServerResponseContinuation( String serverContinuationToken, @@ -193,22 +196,16 @@ protected String applyServerResponseContinuation( @Override protected boolean isFullyDrained(boolean isChangeFeed, FeedResponse response) { - // When emptyPagesAllowed=true we explicitly want to surface every noChanges - // page to the caller and let the outer Paginator generate-loop keep calling - // nextPage() until the caller stops consuming (e.g., until Spark's Iterator - // is exhausted). isDone() on FeedRangeCompositeContinuationImpl checks - // compositeContinuationTokens.size()==0, which never holds for normal - // incremental change-feed iteration (moveToNextToken rotates the deque, - // never shrinks it). So in that mode, "is the iteration over?" is decided - // by the consumer, not by us here. + // Short-circuit when emptyPagesAllowed=false: noChanges -> fully drained. + // Required because FeedRangeCompositeContinuationImpl.isDone() never flips + // true for incremental change feed (the deque is rotated, never shrunk), + // so handleChangeFeedNotModified()=NO_RETRY would otherwise leave + // shouldFetchMore=true and Paginator would poll indefinitely. // - // When emptyPagesAllowed=false (the SDK default for all non-Spark callers) - // we MUST short-circuit on noChanges. Otherwise, after - // FeedRangeCompositeContinuationImpl.handleChangeFeedNotModified returns - // NO_RETRY (single-partition case, multi-partition cycle-complete, or the - // >4*(size+1) consecutive-304 defense), nextPageInternal would fall through - // to Mono.just(r) and Paginator would call nextPage() again forever - // because isDone() never flips true for incremental change feed. + // When emptyPagesAllowed=true we deliberately skip the short-circuit so + // each noChanges page surfaces to the caller (e.g., the Spark connector + // iterator); termination is handled either by the consumer or, on + // NO_RETRY, by an explicit disableShouldFetchMore() in nextPageInternal. if (!this.emptyPagesAllowed && ModelBridgeInternal.noChanges(response)) { return true; } From 0304dfe9e29a8056f0e32c2c7db16cceaa4ce355 Mon Sep 17 00:00:00 2001 From: tvaron3 Date: Tue, 26 May 2026 17:13:34 -0700 Subject: [PATCH 05/15] Address pass-4 review: lock contract on data-page non-termination * Test (F1): assert callIndex==4 after NO_RETRY termination in the pass-3 defense-in-depth test so a future regression that terminates iteration but still over-fetches is caught. * Test (F2): new nextPage_emptyPagesAllowedTrueWithDataPages_doesNotTerminate pins the production contract that the noChanges(r) guard on the disableShouldFetchMore() arm is load-bearing. In production, FeedRangeCompositeContinuationImpl.handleChangeFeedNotModified returns NO_RETRY for EVERY non-noChanges response (the early branch resets state and falls through). Without the noChanges(r) guard, every data page would silently truncate iteration after the first emission. * Comment (F5): added an inline rationale next to the noChanges(r) guard explaining why it must remain - prevents a future engineer from 'simplifying' away the guard without realizing the production truncation hazard. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../ChangeFeedFetcherEmptyPagesTest.java | 43 +++++++++++++++++++ .../query/ChangeFeedFetcher.java | 6 +++ 2 files changed, 49 insertions(+) diff --git a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java index d1bafd877b38..4b530ba36f8e 100644 --- a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java +++ b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java @@ -205,6 +205,49 @@ public void nextPage_emptyPagesAllowedTrueWithNoRetryOnNoChanges_terminatesItera assertThat(fetcher.shouldFetchMore()) .describedAs("NO_RETRY on terminal noChanges page MUST stop Paginator from polling again") .isFalse(); + assertThat(callIndex.get()) + .describedAs("after NO_RETRY termination, executeFunc must NOT be called again") + .isEqualTo(4); + } + + @Test(groups = { "unit" }) + public void nextPage_emptyPagesAllowedTrueWithDataPages_doesNotTerminate() { + // Defense-in-depth contract guard: handleChangeFeedNotModified in the real + // FeedRangeCompositeContinuationImpl returns NO_RETRY for EVERY non-noChanges + // response (the early `if (!noChanges(r))` clause resets state and then falls + // through to the final `return NO_RETRY`). The branch-2 termination logic in + // ChangeFeedFetcher.nextPageInternal therefore MUST gate the + // disableShouldFetchMore() call on `noChanges(r) && emptyPagesAllowed`; if a + // future refactor drops the noChanges(r) guard, every data page would silently + // truncate iteration after the first emission. This test pins that contract. + FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); + when(continuation.isDone()).thenReturn(false); + // Match real production behavior: NO_RETRY on data pages, RETRY_NOW on noChanges. + when(continuation.handleChangeFeedNotModified(any())).thenAnswer(invocation -> { + FeedResponse rsp = invocation.getArgument(0); + return ModelBridgeInternal.noChanges(rsp) ? ShouldRetryResult.RETRY_NOW : ShouldRetryResult.noRetry(); + }); + + FeedResponse data1 = changeFeedDataPage("d1", new Document()); + FeedResponse data2 = changeFeedDataPage("d2", new Document()); + FeedResponse data3 = changeFeedDataPage("d3", new Document()); + + FeedResponse[] script = new FeedResponse[] { data1, data2, data3 }; + AtomicInteger callIndex = new AtomicInteger(); + Function>> executeFunc = + req -> Mono.just(script[callIndex.getAndIncrement()]); + + ChangeFeedFetcher fetcher = + newFetcherWithExecuteFunc(continuation, /* emptyPagesAllowed */ true, executeFunc); + + StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == data1).verifyComplete(); + assertThat(fetcher.shouldFetchMore()) + .describedAs("data pages must NOT terminate iteration even when handleChangeFeedNotModified returns NO_RETRY") + .isTrue(); + StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == data2).verifyComplete(); + assertThat(fetcher.shouldFetchMore()).describedAs("after data2").isTrue(); + StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == data3).verifyComplete(); + assertThat(fetcher.shouldFetchMore()).describedAs("after data3").isTrue(); } @Test(groups = { "unit" }) diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java index 4246cdda796e..14bbeccd6ce8 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java @@ -151,6 +151,12 @@ private Mono> nextPageInternal(DocumentClientRetryPolicy retryPo // not all continuations have been drained yet; repeat with the next continuation return surfaceOrSwallowNoChangesPage(r); } + // The noChanges(r) guard is LOAD-BEARING: in production, + // FeedRangeCompositeContinuationImpl.handleChangeFeedNotModified + // returns NO_RETRY for EVERY non-noChanges (i.e. data) response too + // (the early `if (!noChanges(r))` clause resets state and falls + // through to the final `return NO_RETRY`). Without this guard, every + // data page would silently truncate iteration after the first emission. if (ModelBridgeInternal.noChanges(r) && this.emptyPagesAllowed) { // NO_RETRY on a noChanges page: the SDK's termination signal. Without // emptyPagesAllowed=true, isFullyDrained() already flipped shouldFetchMore From 34d3da4fa518247c027dce28b92b4c994511cc06 Mon Sep 17 00:00:00 2001 From: tvaron3 Date: Tue, 26 May 2026 22:35:10 -0700 Subject: [PATCH 06/15] Shrink bridge accessor surface; reclassify CHANGELOG entry * Drop the new CosmosChangeFeedRequestOptionsAccessor.{get,set}AllowEmptyPages wrapper methods. Callers (ChangeFeedQueryImpl, Spark ChangeFeedPartitionReader, the propagation unit test) now use the already-exposed accessor.getImpl(options).{is,set}EmptyPagesAllowed() instead, keeping the bridge accessor interface at its pre-PR shape. * Move the azure-cosmos CHANGELOG entry from 'Bugs Fixed' to 'Other Changes' and reword: this PR adds an internal-only field on CosmosChangeFeedRequestOptionsImpl that pure SDK consumers cannot reach without going through getImpl(). The customer-facing fix lives in the Spark connector CHANGELOGs (which keep their 'Bugs Fixed' entries). Test results: ChangeFeedFetcherEmptyPagesTest 8/8, FetcherTest 5/5, CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest 3/3, TransientIOErrorsRetryingIteratorSpec 7/7 - all green. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../spark/ChangeFeedPartitionReader.scala | 3 ++- ...dRequestOptionsWithPagedFluxOptionsTest.java | 15 ++++++++++----- sdk/cosmos/azure-cosmos/CHANGELOG.md | 2 +- .../implementation/ChangeFeedQueryImpl.java | 2 +- .../ImplementationBridgeHelpers.java | 4 ---- .../models/CosmosChangeFeedRequestOptions.java | 17 ----------------- 6 files changed, 14 insertions(+), 29 deletions(-) diff --git a/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedPartitionReader.scala b/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedPartitionReader.scala index e5e0cce7fa3b..86a8fbeb1d61 100644 --- a/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedPartitionReader.scala +++ b/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedPartitionReader.scala @@ -218,7 +218,8 @@ private case class ChangeFeedPartitionReader // applies to each individual page rather than being exceeded by serial // empty-page drains inside ChangeFeedFetcher. ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper.getCosmosChangeFeedRequestOptionsAccessor - .setAllowEmptyPages(options, true) + .getImpl(options) + .setEmptyPagesAllowed(true) options.setCustomItemSerializer(itemDeserializer) } diff --git a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest.java b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest.java index ebb23fafbac1..6c91f6694290 100644 --- a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest.java +++ b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest.java @@ -31,7 +31,8 @@ public void emptyPagesAllowed_isPropagated_whenContinuationTokenSupplied() { .createForProcessingFromBeginning(FeedRangeEpkImpl.forFullRange()); ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper .getCosmosChangeFeedRequestOptionsAccessor() - .setAllowEmptyPages(src, true); + .getImpl(src) + .setEmptyPagesAllowed(true); // AND a continuation token supplied via the paged-flux pull mechanism CosmosPagedFluxOptions pagedFluxOptions = new CosmosPagedFluxOptions(); @@ -44,7 +45,8 @@ public void emptyPagesAllowed_isPropagated_whenContinuationTokenSupplied() { // THEN emptyPagesAllowed must be preserved on the freshly-built impl assertThat(ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper .getCosmosChangeFeedRequestOptionsAccessor() - .getAllowEmptyPages(effective)) + .getImpl(effective) + .isEmptyPagesAllowed()) .describedAs("emptyPagesAllowed must survive the paged-flux pull continuation rebuild") .isTrue(); } @@ -63,7 +65,8 @@ public void emptyPagesAllowedFalse_isPropagated_whenContinuationTokenSupplied() assertThat(ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper .getCosmosChangeFeedRequestOptionsAccessor() - .getAllowEmptyPages(effective)) + .getImpl(effective) + .isEmptyPagesAllowed()) .describedAs("emptyPagesAllowed default (false) must survive the paged-flux pull continuation rebuild") .isFalse(); } @@ -75,7 +78,8 @@ public void emptyPagesAllowed_isPreserved_whenNoContinuationTokenSupplied() { .createForProcessingFromBeginning(FeedRangeEpkImpl.forFullRange()); ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper .getCosmosChangeFeedRequestOptionsAccessor() - .setAllowEmptyPages(src, true); + .getImpl(src) + .setEmptyPagesAllowed(true); CosmosPagedFluxOptions pagedFluxOptions = new CosmosPagedFluxOptions(); pagedFluxOptions.setMaxItemCount(50); @@ -85,7 +89,8 @@ public void emptyPagesAllowed_isPreserved_whenNoContinuationTokenSupplied() { assertThat(ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper .getCosmosChangeFeedRequestOptionsAccessor() - .getAllowEmptyPages(effective)) + .getImpl(effective) + .isEmptyPagesAllowed()) .isTrue(); } diff --git a/sdk/cosmos/azure-cosmos/CHANGELOG.md b/sdk/cosmos/azure-cosmos/CHANGELOG.md index 56bb798c1188..421d7f1f5a86 100644 --- a/sdk/cosmos/azure-cosmos/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos/CHANGELOG.md @@ -8,9 +8,9 @@ #### Breaking Changes #### Bugs Fixed -* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for change feed reads on workloads that produce long runs of empty pages (e.g. cross-partition change feed iterations returning very few results across many sub-feedRanges). Empty change feed pages are now surfaced to the caller when the new internal `emptyPagesAllowed` request option is set on `CosmosChangeFeedRequestOptions`, so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains inside `ChangeFeedFetcher`. The Cosmos Spark connector opts into this behavior automatically; the default behavior of `CosmosChangeFeedRequestOptions` is unchanged. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) #### Other Changes +* Added an internal `emptyPagesAllowed` field on `CosmosChangeFeedRequestOptionsImpl` (default `false`; not exposed via the public `CosmosChangeFeedRequestOptions` API). When set, `ChangeFeedFetcher` surfaces empty / 304 pages to the caller instead of swallowing them via `repeatWhenEmpty`, so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. Consumed by the Cosmos Spark connector to fix an `OperationCancelledException` on sparse cross-partition change-feed workloads. Default behavior of `CosmosChangeFeedRequestOptions` is unchanged. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) * Replaced per-client `Schedulers.newSingle()` schedulers in `GlobalEndpointManager` and `GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker` with shared `BoundedElastic` schedulers in `CosmosSchedulers` to prevent thread count from scaling linearly with client/tenant count. - See [PR 49062](https://github.com/Azure/azure-sdk-for-java/pull/49062) ### 4.80.0 (2026-05-01) diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java index 5507acf6d705..435db221a2a7 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java @@ -121,7 +121,7 @@ public Flux> executeAsync() { this.options.getMaxPrefetchPageCount(), ModelBridgeInternal.getChangeFeedIsSplitHandlingDisabled(this.options), this.options.isCompleteAfterAllCurrentChangesRetrieved(), - changeFeedOptionsAccessor().getAllowEmptyPages(this.options), + changeFeedOptionsAccessor().getImpl(this.options).isEmptyPagesAllowed(), changeFeedOptionsAccessor() .getEndLSN(this.options), changeFeedOptionsAccessor() diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java index 46ed7e744224..c48555496c29 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java @@ -444,10 +444,6 @@ public interface CosmosChangeFeedRequestOptionsAccessor { void setPartitionKeyDefinition(CosmosChangeFeedRequestOptions changeFeedRequestOptions, PartitionKeyDefinition partitionKeyDefinition); Map getProperties(CosmosChangeFeedRequestOptions changeFeedRequestOptions); CosmosChangeFeedRequestOptions disableSplitHandling(CosmosChangeFeedRequestOptions changeFeedRequestOptions); - - boolean getAllowEmptyPages(CosmosChangeFeedRequestOptions changeFeedRequestOptions); - - void setAllowEmptyPages(CosmosChangeFeedRequestOptions changeFeedRequestOptions, boolean emptyPagesAllowed); } } diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosChangeFeedRequestOptions.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosChangeFeedRequestOptions.java index 2cbecc7cc0a2..fa8c59aebef3 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosChangeFeedRequestOptions.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosChangeFeedRequestOptions.java @@ -791,23 +791,6 @@ public Map getProperties(CosmosChangeFeedRequestOptions changeFe public CosmosChangeFeedRequestOptions disableSplitHandling(CosmosChangeFeedRequestOptions changeFeedRequestOptions) { return changeFeedRequestOptions.disableSplitHandling(); } - - @Override - public boolean getAllowEmptyPages(CosmosChangeFeedRequestOptions changeFeedRequestOptions) { - return changeFeedRequestOptions.getImpl().isEmptyPagesAllowed(); - } - - @Override - public void setAllowEmptyPages( - CosmosChangeFeedRequestOptions changeFeedRequestOptions, - boolean emptyPagesAllowed) { - - // Note: kept package-private (impl-side only). We deliberately do NOT - // expose a public setEmptyPagesAllowed on CosmosChangeFeedRequestOptions - // because the flag changes paging semantics in subtle ways the SDK - // does not want most callers to opt into. - changeFeedRequestOptions.getImpl().setEmptyPagesAllowed(emptyPagesAllowed); - } }); } From 255bad99524d4feff225dff5ddedaffddda45668 Mon Sep 17 00:00:00 2001 From: tvaron3 Date: Thu, 28 May 2026 00:05:37 -0700 Subject: [PATCH 07/15] Address pass-5 review (Kushagra + Annie): drift hazard, test coverage, polish MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Major (Kushagra): * M1 — Update PR description to reflect the post-commit-34d3da4 accessor shape (getImpl() escape hatch was being misrepresented as a named accessor wrapper). * M2 — Invert the default in CosmosChangeFeedRequestOptions .withCosmosPagedFluxOptions: instead of building a fresh impl from the continuation token and copying back 4 fields, build from the token and inherit ALL non-token-encoded fields via a new CosmosChangeFeedRequestOptionsImpl.inheritNonContinuationFieldsFrom helper. This closes silent drift for endLSN, customSerializer, excludeRegions, readConsistencyStrategy, thresholds, customOptions, operationContextAndListenerTuple, keywordIdentifiers, completeAfterAllCurrentChangesRetrieved, quotaInfoEnabled, isSplitHandlingDisabled, partitionKeyDefinition, and collectionRid. The 4 token-encoded fields (continuationState, feedRangeInternal, mode, startFromInternal) remain authoritative from the parsed token. Single maintenance point for any future field. * M3 — Add timeOut = 10_000 to the 4 nextPage_* tests in ChangeFeedFetcherEmptyPagesTest so a regression that reintroduces unbounded repeatWhenEmpty drain fails fast instead of looking like CI flake (.NET parity). Minor (Kushagra + Annie): * m1+m3+m7 — Combined javadoc on CosmosChangeFeedRequestOptionsImpl .setEmptyPagesAllowed/isEmptyPagesAllowed: explains default, paging semantics impact, and the deliberate 'not surfaced on public API' decision. Replaces the PR-body claim with an in-code source of truth. * m2 = Annie #2 — Re-add the bridge accessor wrappers setAllowEmptyPages/getAllowEmptyPages on CosmosChangeFeedRequestOptionsAccessor mirroring the query-side pattern. Restores grep-discoverability and reduces refactor blast radius. The public CosmosChangeFeedRequestOptions API is unchanged (no public setter); friend-API surface only. * m4 = Annie #1 — Add 2 nextPage_endLsnSet_emptyPagesAllowed_* tests exercising branch 1 of nextPageInternal (the completeAfterAllCurrentChangesRetrieved || endLSN != null path), which is the production path Spark's ChangeFeedPartitionReader hits for bounded snapshot reads. * m5 — Add emulator-group end-to-end test CosmosContainerChangeFeedTest.changeFeedQuery_emptyPagesAllowed_ surfacesNoChangesPagesAndTerminates exercising the real FeedRangeCompositeContinuationImpl >4*(size+1) consecutive-304 defense path that mock-based unit tests can't reach (the impl class is package-private by design). * m6 — Shorten the verbose Spark + azure-cosmos CHANGELOG entries and append a brief 'one iterator callback per empty page' trade-off note for operator observability. * Annie #3 = M2 — broader drift hazard closed via the inheritNonContinuationFieldsFrom approach above. Nits: * n1 — Extract the nextPageInternal flatMap body into a private applyNoChangesDecision(FeedResponse) method. Reduces nesting depth from 4 to 2 and improves readability of the contract. * n3 — Replace reflective Fetcher.isFullyDrained invocation with a direct call. The test lives in com.azure.cosmos.implementation.query, same package as Fetcher, so protected access works without reflection. Tests: * ChangeFeedFetcherEmptyPagesTest: 10 tests (was 8), all green * CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest: 6 tests (was 3 — adds endLSN, customSerializer, negative pin), all green * FetcherTest: 5/5 (no regression) * TransientIOErrorsRetryingIteratorSpec: 7/7 * CosmosContainerChangeFeedTest.changeFeedQuery_emptyPagesAllowed_*: new emulator-group test (compiles; runs in CI Test Emulator lane) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../azure-cosmos-spark_3-3_2-12/CHANGELOG.md | 2 +- .../azure-cosmos-spark_3-4_2-12/CHANGELOG.md | 2 +- .../azure-cosmos-spark_3-5_2-12/CHANGELOG.md | 2 +- .../azure-cosmos-spark_3-5_2-13/CHANGELOG.md | 2 +- .../spark/ChangeFeedPartitionReader.scala | 3 +- .../cosmos/CosmosContainerChangeFeedTest.java | 72 +++++++++++ ...equestOptionsWithPagedFluxOptionsTest.java | 96 +++++++++++++-- .../ChangeFeedFetcherEmptyPagesTest.java | 115 +++++++++++++++--- sdk/cosmos/azure-cosmos/CHANGELOG.md | 2 +- .../implementation/ChangeFeedQueryImpl.java | 2 +- .../CosmosChangeFeedRequestOptionsImpl.java | 65 ++++++++++ .../ImplementationBridgeHelpers.java | 18 +++ .../query/ChangeFeedFetcher.java | 109 +++++++++-------- .../CosmosChangeFeedRequestOptions.java | 29 +++-- 14 files changed, 425 insertions(+), 94 deletions(-) diff --git a/sdk/cosmos/azure-cosmos-spark_3-3_2-12/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_3-3_2-12/CHANGELOG.md index df2d28f68f04..7e2ef0f18fb4 100644 --- a/sdk/cosmos/azure-cosmos-spark_3-3_2-12/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos-spark_3-3_2-12/CHANGELOG.md @@ -8,7 +8,7 @@ #### Bugs Fixed * Improved partition planning performance for change feed with large number of feed ranges. - See [PR 49086](https://github.com/Azure/azure-sdk-for-java/pull/49086) -* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for query and change feed reads against workloads that produce long runs of empty pages (for example a cross-partition query that effectively performs a full-table scan and returns only a few documents). The connector now opts into the SDK's `emptyPagesAllowed` behavior so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) +* Fixed `OperationCancelledException` ("End-to-end timeout hit") on sparse cross-partition queries / change feed by opting into the SDK's `emptyPagesAllowed` behavior, so the per-page timeout applies per page instead of being exceeded by serial empty-page drains. Note: this surfaces one iterator callback per empty page where previously a single callback could drain many. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) #### Other Changes diff --git a/sdk/cosmos/azure-cosmos-spark_3-4_2-12/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_3-4_2-12/CHANGELOG.md index 1000f34f0930..48c15798f4ae 100644 --- a/sdk/cosmos/azure-cosmos-spark_3-4_2-12/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos-spark_3-4_2-12/CHANGELOG.md @@ -8,7 +8,7 @@ #### Bugs Fixed * Improved partition planning performance for change feed with large number of feed ranges. - See [PR 49086](https://github.com/Azure/azure-sdk-for-java/pull/49086) -* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for query and change feed reads against workloads that produce long runs of empty pages (for example a cross-partition query that effectively performs a full-table scan and returns only a few documents). The connector now opts into the SDK's `emptyPagesAllowed` behavior so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) +* Fixed `OperationCancelledException` ("End-to-end timeout hit") on sparse cross-partition queries / change feed by opting into the SDK's `emptyPagesAllowed` behavior, so the per-page timeout applies per page instead of being exceeded by serial empty-page drains. Note: this surfaces one iterator callback per empty page where previously a single callback could drain many. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) #### Other Changes diff --git a/sdk/cosmos/azure-cosmos-spark_3-5_2-12/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_3-5_2-12/CHANGELOG.md index 019f35a705b6..85d6f36f2b1b 100644 --- a/sdk/cosmos/azure-cosmos-spark_3-5_2-12/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos-spark_3-5_2-12/CHANGELOG.md @@ -8,7 +8,7 @@ #### Bugs Fixed * Improved partition planning performance for change feed with large number of feed ranges. - See [PR 49086](https://github.com/Azure/azure-sdk-for-java/pull/49086) -* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for query and change feed reads against workloads that produce long runs of empty pages (for example a cross-partition query that effectively performs a full-table scan and returns only a few documents). The connector now opts into the SDK's `emptyPagesAllowed` behavior so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) +* Fixed `OperationCancelledException` ("End-to-end timeout hit") on sparse cross-partition queries / change feed by opting into the SDK's `emptyPagesAllowed` behavior, so the per-page timeout applies per page instead of being exceeded by serial empty-page drains. Note: this surfaces one iterator callback per empty page where previously a single callback could drain many. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) #### Other Changes diff --git a/sdk/cosmos/azure-cosmos-spark_3-5_2-13/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_3-5_2-13/CHANGELOG.md index a04aa12ba82a..e7561d05958e 100644 --- a/sdk/cosmos/azure-cosmos-spark_3-5_2-13/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos-spark_3-5_2-13/CHANGELOG.md @@ -8,7 +8,7 @@ #### Bugs Fixed * Improved partition planning performance for change feed with large number of feed ranges. - See [PR 49086](https://github.com/Azure/azure-sdk-for-java/pull/49086) -* Fixed `OperationCancelledException` ("End-to-end timeout hit when trying to retrieve the next page") for query and change feed reads against workloads that produce long runs of empty pages (for example a cross-partition query that effectively performs a full-table scan and returns only a few documents). The connector now opts into the SDK's `emptyPagesAllowed` behavior so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) +* Fixed `OperationCancelledException` ("End-to-end timeout hit") on sparse cross-partition queries / change feed by opting into the SDK's `emptyPagesAllowed` behavior, so the per-page timeout applies per page instead of being exceeded by serial empty-page drains. Note: this surfaces one iterator callback per empty page where previously a single callback could drain many. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) #### Other Changes diff --git a/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedPartitionReader.scala b/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedPartitionReader.scala index 86a8fbeb1d61..e5e0cce7fa3b 100644 --- a/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedPartitionReader.scala +++ b/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedPartitionReader.scala @@ -218,8 +218,7 @@ private case class ChangeFeedPartitionReader // applies to each individual page rather than being exceeded by serial // empty-page drains inside ChangeFeedFetcher. ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper.getCosmosChangeFeedRequestOptionsAccessor - .getImpl(options) - .setEmptyPagesAllowed(true) + .setAllowEmptyPages(options, true) options.setCustomItemSerializer(itemDeserializer) } diff --git a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java index 7c8418ed8992..6be6fb18a924 100644 --- a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java +++ b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java @@ -1097,6 +1097,78 @@ public void changeFeedQueryCompleteAfterAvailableNow( } } + @Test(groups = { "emulator" }, timeOut = TIMEOUT * 5) + public void changeFeedQuery_emptyPagesAllowed_surfacesNoChangesPagesAndTerminates() { + // End-to-end guard for IcM 51000001033272: when the SDK is opted into emptyPagesAllowed=true + // (via the friend-API bridge accessor — the same opt-in the Cosmos Spark connector uses), + // change-feed reads against a multi-partition container must: + // (a) surface 304/noChanges pages individually to the caller, AND + // (b) terminate via the FeedRangeCompositeContinuationImpl >4*(size+1) consecutive-304 + // defense rather than poll indefinitely. + // + // This is the integration-level pin for the contract that ChangeFeedFetcher.nextPageInternal + // branch 2 explicitly calls disableShouldFetchMore() on NO_RETRY noChanges. Without that + // arm, a caller that drained the flux to completion would hang. + String testContainerId = UUID.randomUUID().toString(); + try { + CosmosContainerProperties containerProperties = new CosmosContainerProperties(testContainerId, "/mypk"); + CosmosAsyncContainer testContainer = + createCollection( + this.createdAsyncDatabase, + containerProperties, + new CosmosContainerRequestOptions(), + // throughput high enough to provision multiple physical partitions + 11000); + + // Sparse workload: a few docs spread across partitions; most physical partitions + // will return 304 / noChanges on read, exercising the empty-page surfacing path. + insertDocuments(/* partitionCount */ 3, /* documentCount */ 2, testContainer); + + CosmosChangeFeedRequestOptions options = + CosmosChangeFeedRequestOptions.createForProcessingFromBeginning(FeedRange.forFullRange()); + ImplementationBridgeHelpers + .CosmosChangeFeedRequestOptionsHelper + .getCosmosChangeFeedRequestOptionsAccessor() + .setAllowEmptyPages(options, true); + + AtomicInteger totalPagesObserved = new AtomicInteger(0); + AtomicInteger totalDocsObserved = new AtomicInteger(0); + AtomicInteger emptyPagesObserved = new AtomicInteger(0); + + // Drain a bounded slice of the change feed - the iteration must terminate within + // a reasonable page count via the SDK's consecutive-304 defense. + testContainer.queryChangeFeed(options, JsonNode.class) + .byPage(1) + .take(100) + .doOnNext(response -> { + totalPagesObserved.incrementAndGet(); + int pageSize = response.getResults().size(); + totalDocsObserved.addAndGet(pageSize); + if (pageSize == 0) { + emptyPagesObserved.incrementAndGet(); + } + }) + .blockLast(); + + // (a) at least some empty pages must have surfaced - the whole point of the opt-in + assertThat(emptyPagesObserved.get()) + .describedAs("emptyPagesAllowed=true must surface 304/noChanges pages individually") + .isGreaterThan(0); + // (b) all inserted docs must be observed - empty-page surfacing must not interfere + // with data-page emission + assertThat(totalDocsObserved.get()) + .describedAs("all inserted documents must surface") + .isEqualTo(6); + // (c) iteration must have terminated (we didn't hit the take(100) cap, otherwise + // we'd be polling indefinitely - that's the regression the defense-in-depth arm prevents) + assertThat(totalPagesObserved.get()) + .describedAs("iteration must terminate via consecutive-304 defense, not hit the take(100) cap") + .isLessThan(100); + } finally { + safeDeleteCollection(this.createdAsyncDatabase.getContainer(testContainerId)); + } + } + void insertDocuments( int partitionCount, int documentCount) { diff --git a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest.java b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest.java index 6c91f6694290..4b3e4e3e235a 100644 --- a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest.java +++ b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest.java @@ -3,6 +3,7 @@ package com.azure.cosmos.implementation; +import com.azure.cosmos.CosmosItemSerializer; import com.azure.cosmos.implementation.changefeed.common.ChangeFeedMode; import com.azure.cosmos.implementation.changefeed.common.ChangeFeedStartFromInternal; import com.azure.cosmos.implementation.changefeed.common.ChangeFeedStateV1; @@ -31,8 +32,7 @@ public void emptyPagesAllowed_isPropagated_whenContinuationTokenSupplied() { .createForProcessingFromBeginning(FeedRangeEpkImpl.forFullRange()); ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper .getCosmosChangeFeedRequestOptionsAccessor() - .getImpl(src) - .setEmptyPagesAllowed(true); + .setAllowEmptyPages(src, true); // AND a continuation token supplied via the paged-flux pull mechanism CosmosPagedFluxOptions pagedFluxOptions = new CosmosPagedFluxOptions(); @@ -45,8 +45,7 @@ public void emptyPagesAllowed_isPropagated_whenContinuationTokenSupplied() { // THEN emptyPagesAllowed must be preserved on the freshly-built impl assertThat(ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper .getCosmosChangeFeedRequestOptionsAccessor() - .getImpl(effective) - .isEmptyPagesAllowed()) + .getAllowEmptyPages(effective)) .describedAs("emptyPagesAllowed must survive the paged-flux pull continuation rebuild") .isTrue(); } @@ -65,8 +64,7 @@ public void emptyPagesAllowedFalse_isPropagated_whenContinuationTokenSupplied() assertThat(ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper .getCosmosChangeFeedRequestOptionsAccessor() - .getImpl(effective) - .isEmptyPagesAllowed()) + .getAllowEmptyPages(effective)) .describedAs("emptyPagesAllowed default (false) must survive the paged-flux pull continuation rebuild") .isFalse(); } @@ -78,8 +76,7 @@ public void emptyPagesAllowed_isPreserved_whenNoContinuationTokenSupplied() { .createForProcessingFromBeginning(FeedRangeEpkImpl.forFullRange()); ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper .getCosmosChangeFeedRequestOptionsAccessor() - .getImpl(src) - .setEmptyPagesAllowed(true); + .setAllowEmptyPages(src, true); CosmosPagedFluxOptions pagedFluxOptions = new CosmosPagedFluxOptions(); pagedFluxOptions.setMaxItemCount(50); @@ -89,11 +86,90 @@ public void emptyPagesAllowed_isPreserved_whenNoContinuationTokenSupplied() { assertThat(ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper .getCosmosChangeFeedRequestOptionsAccessor() - .getImpl(effective) - .isEmptyPagesAllowed()) + .getAllowEmptyPages(effective)) .isTrue(); } + @Test(groups = { "unit" }) + public void endLSN_isPropagated_whenContinuationTokenSupplied() { + // Locks in the bounded-change-feed contract across a byPage(savedContinuation) round-trip: + // a caller who set endLSN=42 must continue to see iteration bounded by LSN 42 after resume. + // Before the inheritNonContinuationFieldsFrom fix, endLSN was silently dropped on the rebuild + // path, turning a bounded change feed into an unbounded one. + CosmosChangeFeedRequestOptions src = CosmosChangeFeedRequestOptions + .createForProcessingFromBeginning(FeedRangeEpkImpl.forFullRange()); + ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper + .getCosmosChangeFeedRequestOptionsAccessor() + .setEndLSN(src, 42L); + + CosmosPagedFluxOptions pagedFluxOptions = new CosmosPagedFluxOptions(); + pagedFluxOptions.setRequestContinuation(buildContinuationToken()); + + CosmosChangeFeedRequestOptions effective = ModelBridgeInternal + .getEffectiveChangeFeedRequestOptions(src, pagedFluxOptions); + + assertThat(ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper + .getCosmosChangeFeedRequestOptionsAccessor() + .getEndLSN(effective)) + .describedAs("endLSN must survive the paged-flux pull continuation rebuild") + .isEqualTo(42L); + } + + @Test(groups = { "unit" }) + public void customSerializer_isPropagated_whenContinuationTokenSupplied() { + // Locks in custom-serializer preservation across a byPage(savedContinuation) round-trip: + // a caller who registered a custom CosmosItemSerializer must continue to see items + // deserialized through that serializer after resume. Before the inheritNonContinuationFieldsFrom + // fix, the customSerializer was silently dropped on the rebuild path, falling back to the + // SDK's internal default serializer and potentially producing wrong field values. + CosmosItemSerializer sentinel = new CosmosItemSerializer() { + @Override + public java.util.Map serialize(T item) { return null; } + + @Override + public T deserialize(java.util.Map jsonNodeMap, Class classType) { return null; } + }; + CosmosChangeFeedRequestOptions src = CosmosChangeFeedRequestOptions + .createForProcessingFromBeginning(FeedRangeEpkImpl.forFullRange()); + src.setCustomItemSerializer(sentinel); + + CosmosPagedFluxOptions pagedFluxOptions = new CosmosPagedFluxOptions(); + pagedFluxOptions.setRequestContinuation(buildContinuationToken()); + + CosmosChangeFeedRequestOptions effective = ModelBridgeInternal + .getEffectiveChangeFeedRequestOptions(src, pagedFluxOptions); + + assertThat(effective.getCustomItemSerializer()) + .describedAs("customSerializer must survive the paged-flux pull continuation rebuild") + .isSameAs(sentinel); + } + + @Test(groups = { "unit" }) + public void tokenEncodedFields_overrideCallerSuppliedValues_whenContinuationTokenSupplied() { + // Negative pin: the four token-encoded fields (continuationState, feedRangeInternal, mode, + // startFromInternal) MUST come from the token, not from the caller's pre-resume options. + // The caller's options here have continuationState=null (createForProcessingFromBeginning), + // but the resulting effective options must have a non-null continuationState parsed from + // the supplied token. If a future refactor accidentally inherits the token-encoded fields + // from the source impl (e.g. moving them into inheritNonContinuationFieldsFrom), this test + // catches the regression because the source's continuationState would clobber the token's. + CosmosChangeFeedRequestOptions src = CosmosChangeFeedRequestOptions + .createForProcessingFromBeginning(FeedRangeEpkImpl.forFullRange()); + + CosmosPagedFluxOptions pagedFluxOptions = new CosmosPagedFluxOptions(); + pagedFluxOptions.setRequestContinuation(buildContinuationToken()); + + CosmosChangeFeedRequestOptions effective = ModelBridgeInternal + .getEffectiveChangeFeedRequestOptions(src, pagedFluxOptions); + + assertThat(ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper + .getCosmosChangeFeedRequestOptionsAccessor() + .getImpl(effective) + .getContinuation()) + .describedAs("continuationState is encoded in the token and MUST override the caller's pre-resume value") + .isNotNull(); + } + private static String buildContinuationToken() { // Build a real ChangeFeedState so we can serialize a valid (base64-encoded) continuation token. // We use the state's own toString() which round-trips through createForProcessingFromContinuation. diff --git a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java index 4b530ba36f8e..99b961b9bcf3 100644 --- a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java +++ b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java @@ -26,7 +26,6 @@ import reactor.core.publisher.Mono; import reactor.test.StepVerifier; -import java.lang.reflect.Method; import java.util.HashMap; import java.util.Map; import java.util.concurrent.atomic.AtomicInteger; @@ -58,7 +57,7 @@ public class ChangeFeedFetcherEmptyPagesTest { @Test(groups = { "unit" }) - public void isFullyDrained_noChangesResponseWithUnfinishedContinuation_returnsFalse() throws Exception { + public void isFullyDrained_noChangesResponseWithUnfinishedContinuation_returnsFalse() { // GIVEN a ChangeFeedFetcher and a noChanges response with continuation not yet done FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); when(continuation.isDone()).thenReturn(false); @@ -76,7 +75,7 @@ public void isFullyDrained_noChangesResponseWithUnfinishedContinuation_returnsFa } @Test(groups = { "unit" }) - public void isFullyDrained_noChangesResponseWithFinishedContinuation_returnsTrue() throws Exception { + public void isFullyDrained_noChangesResponseWithFinishedContinuation_returnsTrue() { FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); when(continuation.isDone()).thenReturn(true); ChangeFeedFetcher fetcher = newFetcher(continuation, /* emptyPagesAllowed */ true); @@ -87,7 +86,7 @@ public void isFullyDrained_noChangesResponseWithFinishedContinuation_returnsTrue } @Test(groups = { "unit" }) - public void isFullyDrained_realResponseWithFinishedContinuation_returnsTrue() throws Exception { + public void isFullyDrained_realResponseWithFinishedContinuation_returnsTrue() { FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); when(continuation.isDone()).thenReturn(true); ChangeFeedFetcher fetcher = newFetcher(continuation, /* emptyPagesAllowed */ false); @@ -98,7 +97,7 @@ public void isFullyDrained_realResponseWithFinishedContinuation_returnsTrue() th } @Test(groups = { "unit" }) - public void isFullyDrained_noChangesResponseWithEmptyPagesAllowedFalse_returnsTrue() throws Exception { + public void isFullyDrained_noChangesResponseWithEmptyPagesAllowedFalse_returnsTrue() { // Regression test: with the default emptyPagesAllowed=false the noChanges short-circuit // MUST stay in place. Otherwise any non-Spark caller that drains the change feed flux // (e.g. queryChangeFeed(...).byPage().toIterable().iterator()) would loop forever, @@ -124,7 +123,7 @@ public void isFullyDrained_noChangesResponseWithEmptyPagesAllowedFalse_returnsTr Mockito.verify(continuation, Mockito.never()).isDone(); } - @Test(groups = { "unit" }) + @Test(groups = { "unit" }, timeOut = 10_000) public void nextPage_emptyPagesAllowedTrue_surfacesNoChangesPagesIndividually() { // Scenario: 3 consecutive noChanges (304) pages from the same sub-feedRange, // followed by a real data page. With emptyPagesAllowed=true, each of the 4 @@ -161,7 +160,7 @@ public void nextPage_emptyPagesAllowedTrue_surfacesNoChangesPagesIndividually() assertThat(callIndex.get()).describedAs("executeFunc should have been called once per surfaced page").isEqualTo(4); } - @Test(groups = { "unit" }) + @Test(groups = { "unit" }, timeOut = 10_000) public void nextPage_emptyPagesAllowedTrueWithNoRetryOnNoChanges_terminatesIteration() { // Defense-in-depth: with emptyPagesAllowed=true, isFullyDrained() consults only // continuation.isDone() (permanently false in incremental change feed), so the @@ -210,7 +209,7 @@ public void nextPage_emptyPagesAllowedTrueWithNoRetryOnNoChanges_terminatesItera .isEqualTo(4); } - @Test(groups = { "unit" }) + @Test(groups = { "unit" }, timeOut = 10_000) public void nextPage_emptyPagesAllowedTrueWithDataPages_doesNotTerminate() { // Defense-in-depth contract guard: handleChangeFeedNotModified in the real // FeedRangeCompositeContinuationImpl returns NO_RETRY for EVERY non-noChanges @@ -250,7 +249,7 @@ public void nextPage_emptyPagesAllowedTrueWithDataPages_doesNotTerminate() { assertThat(fetcher.shouldFetchMore()).describedAs("after data3").isTrue(); } - @Test(groups = { "unit" }) + @Test(groups = { "unit" }, timeOut = 10_000) public void nextPage_emptyPagesAllowedFalse_swallowsNoChangesPagesUntilData() { // Same scenario, but with the default emptyPagesAllowed=false. The // 3 noChanges responses should be swallowed via repeatWhenEmpty and @@ -282,6 +281,83 @@ public void nextPage_emptyPagesAllowedFalse_swallowsNoChangesPagesUntilData() { assertThat(callIndex.get()).describedAs("executeFunc should be called once per physical fetch").isEqualTo(4); } + @Test(groups = { "unit" }, timeOut = 10_000) + public void nextPage_endLsnSet_emptyPagesAllowedTrue_surfacesNoChangesUntilHasFetchedAllChanges() { + // Spark batch reads set BOTH endLSN AND emptyPagesAllowed=true, routing through + // ChangeFeedFetcher.nextPageInternal branch 1 (the + // `completeAfterAllCurrentChangesRetrieved || endLSN != null` path). That branch calls + // surfaceOrSwallowNoChangesPage(r) on noChanges pages, then terminates via + // hasFetchedAllChanges -> disableShouldFetchMore(). This test pins both halves of the + // contract: empty pages surface to the caller (per-page timeout enforcement) AND + // iteration eventually terminates when hasFetchedAllChanges returns true. + FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); + when(continuation.isDone()).thenReturn(false); + + FeedResponse noChanges1 = changeFeedNoChanges("t1"); + FeedResponse noChanges2 = changeFeedNoChanges("t2"); + FeedResponse terminal = changeFeedNoChanges("t3"); + + // hasFetchedAllChanges returns true only for the terminal page (reference identity). + when(continuation.hasFetchedAllChanges(any(), any())).thenAnswer(invocation -> { + FeedResponse rsp = invocation.getArgument(0); + return rsp == terminal; + }); + + FeedResponse[] script = new FeedResponse[] { noChanges1, noChanges2, terminal }; + AtomicInteger callIndex = new AtomicInteger(); + Function>> executeFunc = + req -> Mono.just(script[callIndex.getAndIncrement()]); + + ChangeFeedFetcher fetcher = + newFetcherWithExecuteFunc(continuation, /* emptyPagesAllowed */ true, /* endLSN */ 999L, executeFunc); + + StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == noChanges1).verifyComplete(); + assertThat(fetcher.shouldFetchMore()).describedAs("after noChanges1, mid-cycle").isTrue(); + StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == noChanges2).verifyComplete(); + assertThat(fetcher.shouldFetchMore()).describedAs("after noChanges2, mid-cycle").isTrue(); + StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == terminal).verifyComplete(); + assertThat(fetcher.shouldFetchMore()) + .describedAs("hasFetchedAllChanges=true on terminal page MUST stop Paginator from polling again") + .isFalse(); + assertThat(callIndex.get()) + .describedAs("after termination, executeFunc must NOT be called again") + .isEqualTo(3); + } + + @Test(groups = { "unit" }, timeOut = 10_000) + public void nextPage_endLsnSet_emptyPagesAllowedFalse_swallowsNoChangesViaRepeatWhenEmpty() { + // Legacy regression guard for branch 1 with the default emptyPagesAllowed=false. + // The 2 noChanges responses should be swallowed via repeatWhenEmpty; only the data + // page should surface from a SINGLE nextPage() call (matching legacy behavior). + FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); + when(continuation.isDone()).thenReturn(false); + + FeedResponse noChanges1 = changeFeedNoChanges("t1"); + FeedResponse noChanges2 = changeFeedNoChanges("t2"); + FeedResponse data = changeFeedDataPage("t3", new Document()); + + // hasFetchedAllChanges returns false for all 3 (iteration shouldn't terminate via this signal). + when(continuation.hasFetchedAllChanges(any(), any())).thenReturn(false); + + FeedResponse[] script = new FeedResponse[] { noChanges1, noChanges2, data }; + AtomicInteger callIndex = new AtomicInteger(); + Function>> executeFunc = + req -> Mono.just(script[callIndex.getAndIncrement()]); + + ChangeFeedFetcher fetcher = + newFetcherWithExecuteFunc(continuation, /* emptyPagesAllowed */ false, /* endLSN */ 999L, executeFunc); + + // A single nextPage() should drain all 2 noChanges responses via repeatWhenEmpty + // and surface the data page. + StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == data).verifyComplete(); + assertThat(callIndex.get()).describedAs("executeFunc should be called once per physical fetch").isEqualTo(3); + } + + // Note: the >4*(size+1) consecutive-304 defense in FeedRangeCompositeContinuationImpl is + // pinned end-to-end by CosmosContainerChangeFeedTest.changeFeedQuery_emptyPagesAllowed_* + // (emulator group). That impl class is package-private, so a unit-level integration test + // here would require reflection — the emulator test exercises the real production path. + // ---- helpers ---- private static FeedResponse changeFeedNoChanges(String continuationToken) { @@ -299,12 +375,21 @@ private static FeedResponse changeFeedDataPage(String continuationToke } private static ChangeFeedFetcher newFetcher(FeedRangeContinuation continuation, boolean emptyPagesAllowed) { - return newFetcherWithExecuteFunc(continuation, emptyPagesAllowed, req -> Mono.empty()); + return newFetcherWithExecuteFunc(continuation, emptyPagesAllowed, /* endLSN */ null, req -> Mono.empty()); + } + + private static ChangeFeedFetcher newFetcherWithExecuteFunc( + FeedRangeContinuation continuation, + boolean emptyPagesAllowed, + Function>> executeFunc) { + + return newFetcherWithExecuteFunc(continuation, emptyPagesAllowed, /* endLSN */ null, executeFunc); } private static ChangeFeedFetcher newFetcherWithExecuteFunc( FeedRangeContinuation continuation, boolean emptyPagesAllowed, + Long endLSN, Function>> executeFunc) { RxDocumentClientImpl client = mock(RxDocumentClientImpl.class); @@ -338,7 +423,7 @@ private static ChangeFeedFetcher newFetcherWithExecuteFunc( /* isSplitHandlingDisabled */ true, /* completeAfterAllCurrentChangesRetrieved */ false, emptyPagesAllowed, - /* endLSN */ null, + endLSN, /* operationContext */ null, gem, gpe, @@ -358,12 +443,12 @@ private static RxDocumentServiceRequest stubRequest() { return request; } + // isFullyDrained is `protected` on Fetcher; this test class lives in the same package + // (com.azure.cosmos.implementation.query), so we can call it directly without reflection. private static boolean invokeIsFullyDrained( ChangeFeedFetcher fetcher, - FeedResponse response) throws Exception { + FeedResponse response) { - Method m = Fetcher.class.getDeclaredMethod("isFullyDrained", boolean.class, FeedResponse.class); - m.setAccessible(true); - return (Boolean) m.invoke(fetcher, true, response); + return fetcher.isFullyDrained(/* isChangeFeed */ true, response); } } diff --git a/sdk/cosmos/azure-cosmos/CHANGELOG.md b/sdk/cosmos/azure-cosmos/CHANGELOG.md index 421d7f1f5a86..776c08c551d0 100644 --- a/sdk/cosmos/azure-cosmos/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos/CHANGELOG.md @@ -10,7 +10,7 @@ #### Bugs Fixed #### Other Changes -* Added an internal `emptyPagesAllowed` field on `CosmosChangeFeedRequestOptionsImpl` (default `false`; not exposed via the public `CosmosChangeFeedRequestOptions` API). When set, `ChangeFeedFetcher` surfaces empty / 304 pages to the caller instead of swallowing them via `repeatWhenEmpty`, so the per-page end-to-end timeout applies to each individual page rather than being exceeded by serial empty-page drains. Consumed by the Cosmos Spark connector to fix an `OperationCancelledException` on sparse cross-partition change-feed workloads. Default behavior of `CosmosChangeFeedRequestOptions` is unchanged. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) +* Added an internal `emptyPagesAllowed` field on `CosmosChangeFeedRequestOptionsImpl` (default `false`; not exposed publicly). When set, `ChangeFeedFetcher` surfaces 304/empty pages to the caller. Consumed by the Cosmos Spark connector to fix an `OperationCancelledException` on sparse cross-partition change-feed workloads. Default behavior is unchanged. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) * Replaced per-client `Schedulers.newSingle()` schedulers in `GlobalEndpointManager` and `GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker` with shared `BoundedElastic` schedulers in `CosmosSchedulers` to prevent thread count from scaling linearly with client/tenant count. - See [PR 49062](https://github.com/Azure/azure-sdk-for-java/pull/49062) ### 4.80.0 (2026-05-01) diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java index 435db221a2a7..5507acf6d705 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java @@ -121,7 +121,7 @@ public Flux> executeAsync() { this.options.getMaxPrefetchPageCount(), ModelBridgeInternal.getChangeFeedIsSplitHandlingDisabled(this.options), this.options.isCompleteAfterAllCurrentChangesRetrieved(), - changeFeedOptionsAccessor().getImpl(this.options).isEmptyPagesAllowed(), + changeFeedOptionsAccessor().getAllowEmptyPages(this.options), changeFeedOptionsAccessor() .getEndLSN(this.options), changeFeedOptionsAccessor() diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsImpl.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsImpl.java index 81c4bf2b1858..f5365b7fac27 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsImpl.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsImpl.java @@ -84,6 +84,47 @@ public CosmosChangeFeedRequestOptionsImpl(CosmosChangeFeedRequestOptionsImpl toB this.emptyPagesAllowed = toBeCloned.emptyPagesAllowed; } + /** + * Inherits all non-token-encoded fields from {@code source} onto this instance, preserving the + * caller-supplied configuration when this instance was freshly built from a continuation token. + * + *

The four fields encoded in the continuation token itself ({@code continuationState}, + * {@code feedRangeInternal}, {@code mode}, {@code startFromInternal}) are intentionally NOT + * copied — they describe "where to resume from" and must come from the token, not the caller's + * pre-resume options. Every other field IS copied so the caller's configuration (endLSN, + * customSerializer, excludeRegions, read-consistency strategy, throughput-control group, + * diagnostic thresholds, etc.) survives the {@code byPage(savedContinuation)} round-trip. + * + *

Maintenance contract: when a new field is added to this class, decide whether the + * continuation token encodes it. If not (the common case for caller-supplied configuration), + * propagate it here. + */ + public void inheritNonContinuationFieldsFrom(CosmosChangeFeedRequestOptionsImpl source) { + // continuationState, feedRangeInternal, mode, startFromInternal: + // intentionally NOT copied (encoded in the continuation token itself). + // collectionRid IS preserved: it lives on the impl but is not embedded in the + // continuation token (the token's separate containerRid lives on ChangeFeedStateV1). + // The rest-of-SDK clone path (RxDocumentClientImpl.queryDocumentChangeFeedFromPagedFluxInternal + // -> accessor.clone -> copy ctor) preserves collectionRid; we match that here. + this.maxItemCount = source.maxItemCount; + this.maxPrefetchPageCount = source.maxPrefetchPageCount; + this.isSplitHandlingDisabled = source.isSplitHandlingDisabled; + this.quotaInfoEnabled = source.quotaInfoEnabled; + this.throughputControlGroupName = source.throughputControlGroupName; + this.customOptions = source.customOptions; + this.operationContextAndListenerTuple = source.operationContextAndListenerTuple; + this.thresholds = source.thresholds; + this.excludeRegions = source.excludeRegions; + this.customSerializer = source.customSerializer; + this.partitionKeyDefinition = source.partitionKeyDefinition; + this.collectionRid = source.collectionRid; + this.keywordIdentifiers = source.keywordIdentifiers; + this.completeAfterAllCurrentChangesRetrieved = source.completeAfterAllCurrentChangesRetrieved; + this.endLSN = source.endLSN; + this.readConsistencyStrategy = source.readConsistencyStrategy; + this.emptyPagesAllowed = source.emptyPagesAllowed; + } + public CosmosChangeFeedRequestOptionsImpl( FeedRangeInternal feedRange, ChangeFeedStartFromInternal startFromInternal, @@ -187,10 +228,34 @@ public void setQuotaInfoEnabled(boolean quotaInfoEnabled) { this.quotaInfoEnabled = quotaInfoEnabled; } + /** + * Returns whether the change-feed pipeline surfaces 304/noChanges (empty) pages to the caller. + * + * @return {@code true} when each 304/noChanges page is surfaced individually (default is {@code false}). + */ public boolean isEmptyPagesAllowed() { return this.emptyPagesAllowed; } + /** + * Controls whether {@code ChangeFeedFetcher} surfaces 304/noChanges pages to the caller instead + * of swallowing them via Reactor's {@code repeatWhenEmpty}. Defaults to {@code false} (legacy + * behavior: empty pages are absorbed and only the next non-empty page is emitted). + * + *

When set to {@code true}, every physical 304 response surfaces as its own + * {@code FeedResponse}, so the SDK's per-page end-to-end timeout applies to each page rather + * than being exceeded by serial empty-page drains. Caller iterators MUST handle empty + * {@code FeedResponse} pages without entering retry loops. + * + *

Intentionally not surfaced on the public {@link com.azure.cosmos.models.CosmosChangeFeedRequestOptions} + * API. The flag changes paging semantics in subtle ways the SDK does not want most callers + * to opt into; reachable only from sibling modules (e.g. the Cosmos Spark connector) via the + * {@code ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper} bridge accessor. + * + * @param emptyPagesAllowed {@code true} to surface 304/noChanges pages individually; + * {@code false} (default) to swallow them via {@code repeatWhenEmpty}. + * @return this instance for fluent chaining. + */ public CosmosChangeFeedRequestOptionsImpl setEmptyPagesAllowed(boolean emptyPagesAllowed) { this.emptyPagesAllowed = emptyPagesAllowed; return this; diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java index c48555496c29..e49ae5e2aa67 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java @@ -444,6 +444,24 @@ public interface CosmosChangeFeedRequestOptionsAccessor { void setPartitionKeyDefinition(CosmosChangeFeedRequestOptions changeFeedRequestOptions, PartitionKeyDefinition partitionKeyDefinition); Map getProperties(CosmosChangeFeedRequestOptions changeFeedRequestOptions); CosmosChangeFeedRequestOptions disableSplitHandling(CosmosChangeFeedRequestOptions changeFeedRequestOptions); + + /** + * Mirrors {@link CosmosQueryRequestOptionsAccessor#setAllowEmptyPages(CosmosQueryRequestOptions, boolean)} + * for the change-feed path. Controls whether {@code ChangeFeedFetcher} surfaces + * 304/noChanges pages to the caller instead of swallowing them via {@code repeatWhenEmpty}. + * + *

Default is {@code false} (legacy swallow behavior). When {@code true}, every physical + * 304 response surfaces as its own {@code FeedResponse}; caller iterators must handle + * empty {@code FeedResponse} pages without entering retry loops. Intentionally NOT + * exposed on the public {@code CosmosChangeFeedRequestOptions} API — friend-only. + */ + void setAllowEmptyPages(CosmosChangeFeedRequestOptions options, boolean emptyPagesAllowed); + + /** + * Returns whether 304/noChanges pages are surfaced individually to the caller. See + * {@link #setAllowEmptyPages(CosmosChangeFeedRequestOptions, boolean)}. + */ + boolean getAllowEmptyPages(CosmosChangeFeedRequestOptions options); } } diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java index 14bbeccd6ce8..c506e760f183 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java @@ -121,60 +121,67 @@ public Mono> nextPage() { private Mono> nextPageInternal(DocumentClientRetryPolicy retryPolicy) { return Mono.fromSupplier(() -> nextPageCore(retryPolicy)) .flatMap(Function.identity()) - .flatMap((r) -> { - FeedRangeContinuation continuationSnapshot = - this.changeFeedState.getContinuation(); - - if (this.completeAfterAllCurrentChangesRetrieved || this.endLSN != null) { - if (continuationSnapshot != null) { - - //track the end-LSN for each sub-feedRange and then find the next sub-feedRange to fetch more changes - boolean shouldComplete = continuationSnapshot.hasFetchedAllChanges(r, endLSN); - if (shouldComplete) { - this.disableShouldFetchMore(); - return Mono.just(r); - } - - if (ModelBridgeInternal.noChanges(r)) { - // 304 for the current sub-feedRange; need to drain the next one. - return surfaceOrSwallowNoChangesPage(r); - } - } - } else { - // Streaming change feed (no endLSN). Terminate either when no continuation - // exists or when handleChangeFeedNotModified signals NO_RETRY (single-partition - // case, multi-partition full-cycle complete, or the >4*(size+1) consecutive-304 - // defense in FeedRangeCompositeContinuationImpl). - if (continuationSnapshot != null) { - ShouldRetryResult retryResult = continuationSnapshot.handleChangeFeedNotModified(r); - if (retryResult == ShouldRetryResult.RETRY_NOW) { - // not all continuations have been drained yet; repeat with the next continuation - return surfaceOrSwallowNoChangesPage(r); - } - // The noChanges(r) guard is LOAD-BEARING: in production, - // FeedRangeCompositeContinuationImpl.handleChangeFeedNotModified - // returns NO_RETRY for EVERY non-noChanges (i.e. data) response too - // (the early `if (!noChanges(r))` clause resets state and falls - // through to the final `return NO_RETRY`). Without this guard, every - // data page would silently truncate iteration after the first emission. - if (ModelBridgeInternal.noChanges(r) && this.emptyPagesAllowed) { - // NO_RETRY on a noChanges page: the SDK's termination signal. Without - // emptyPagesAllowed=true, isFullyDrained() already flipped shouldFetchMore - // off. With emptyPagesAllowed=true, isFullyDrained() consults only - // continuation.isDone() (which is permanently false for incremental change - // feed), so we must explicitly disable further fetches here to preserve - // the defense-in-depth termination guarantee. - this.disableShouldFetchMore(); - return Mono.just(r); - } - } - } - - return Mono.just(r); - }) + .flatMap(this::applyNoChangesDecision) .repeatWhenEmpty(o -> o); } + /** + * Decides what to do with a single FeedResponse before it reaches the outer Paginator loop: + * surface it, swallow it via {@code repeatWhenEmpty}, or terminate iteration entirely. The + * decision depends on the change-feed mode (bounded vs streaming), whether the response is + * a noChanges 304, the continuation's {@code handleChangeFeedNotModified} signal, and whether + * the caller opted into {@code emptyPagesAllowed=true}. + */ + private Mono> applyNoChangesDecision(FeedResponse r) { + FeedRangeContinuation continuationSnapshot = this.changeFeedState.getContinuation(); + + if (this.completeAfterAllCurrentChangesRetrieved || this.endLSN != null) { + if (continuationSnapshot != null) { + //track the end-LSN for each sub-feedRange and then find the next sub-feedRange to fetch more changes + boolean shouldComplete = continuationSnapshot.hasFetchedAllChanges(r, endLSN); + if (shouldComplete) { + this.disableShouldFetchMore(); + return Mono.just(r); + } + + if (ModelBridgeInternal.noChanges(r)) { + // 304 for the current sub-feedRange; need to drain the next one. + return surfaceOrSwallowNoChangesPage(r); + } + } + } else { + // Streaming change feed (no endLSN). Terminate either when no continuation + // exists or when handleChangeFeedNotModified signals NO_RETRY (single-partition + // case, multi-partition full-cycle complete, or the >4*(size+1) consecutive-304 + // defense in FeedRangeCompositeContinuationImpl). + if (continuationSnapshot != null) { + ShouldRetryResult retryResult = continuationSnapshot.handleChangeFeedNotModified(r); + if (retryResult == ShouldRetryResult.RETRY_NOW) { + // not all continuations have been drained yet; repeat with the next continuation + return surfaceOrSwallowNoChangesPage(r); + } + // The noChanges(r) guard is LOAD-BEARING: in production, + // FeedRangeCompositeContinuationImpl.handleChangeFeedNotModified + // returns NO_RETRY for EVERY non-noChanges (i.e. data) response too + // (the early `if (!noChanges(r))` clause resets state and falls + // through to the final `return NO_RETRY`). Without this guard, every + // data page would silently truncate iteration after the first emission. + if (ModelBridgeInternal.noChanges(r) && this.emptyPagesAllowed) { + // NO_RETRY on a noChanges page: the SDK's termination signal. Without + // emptyPagesAllowed=true, isFullyDrained() already flipped shouldFetchMore + // off. With emptyPagesAllowed=true, isFullyDrained() consults only + // continuation.isDone() (which is permanently false for incremental change + // feed), so we must explicitly disable further fetches here to preserve + // the defense-in-depth termination guarantee. + this.disableShouldFetchMore(); + return Mono.just(r); + } + } + } + + return Mono.just(r); + } + /** * Either surface a noChanges page to the caller (when emptyPagesAllowed=true) or swallow it via * Reactor's repeatWhenEmpty (the legacy behavior). When swallowing, shouldFetchMore must be diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosChangeFeedRequestOptions.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosChangeFeedRequestOptions.java index fa8c59aebef3..d46713907c50 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosChangeFeedRequestOptions.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosChangeFeedRequestOptions.java @@ -442,19 +442,18 @@ CosmosChangeFeedRequestOptions withCosmosPagedFluxOptions( CosmosChangeFeedRequestOptions effectiveRequestOptions = this; if (pagedFluxOptions.getRequestContinuation() != null) { - // NOTE: this hand-maintained copy list is a known drift hazard. - // `createForProcessingFromContinuation` builds a fresh - // CosmosChangeFeedRequestOptionsImpl from the parsed continuation - // token, so any field NOT explicitly copied below is silently - // dropped. If you add a new field to CosmosChangeFeedRequestOptions - // and it must survive the paged-flux pull continuation path, - // propagate it here (and ideally add a test covering it). + // Rebuild from the saved continuation token (this produces the 4 token-encoded + // fields: continuationState, feedRangeInternal, mode, startFromInternal) and then + // inherit every other field from the caller's original options. Without this + // inheritance, a byPage(savedContinuation) round-trip would silently drop + // caller-supplied configuration like endLSN, customSerializer, excludeRegions, + // readConsistencyStrategy, etc. See + // CosmosChangeFeedRequestOptionsImpl.inheritNonContinuationFieldsFrom for the + // exhaustive field-by-field rationale. effectiveRequestOptions = CosmosChangeFeedRequestOptions.createForProcessingFromContinuation( pagedFluxOptions.getRequestContinuation()); - effectiveRequestOptions.setMaxPrefetchPageCount(this.getMaxPrefetchPageCount()); - effectiveRequestOptions.setThroughputControlGroupName(this.getThroughputControlGroupName()); - effectiveRequestOptions.getImpl().setEmptyPagesAllowed(this.getImpl().isEmptyPagesAllowed()); + effectiveRequestOptions.getImpl().inheritNonContinuationFieldsFrom(this.actualRequestOptions); } if (pagedFluxOptions.getMaxItemCount() != null) { @@ -791,6 +790,16 @@ public Map getProperties(CosmosChangeFeedRequestOptions changeFe public CosmosChangeFeedRequestOptions disableSplitHandling(CosmosChangeFeedRequestOptions changeFeedRequestOptions) { return changeFeedRequestOptions.disableSplitHandling(); } + + @Override + public void setAllowEmptyPages(CosmosChangeFeedRequestOptions options, boolean emptyPagesAllowed) { + options.getImpl().setEmptyPagesAllowed(emptyPagesAllowed); + } + + @Override + public boolean getAllowEmptyPages(CosmosChangeFeedRequestOptions options) { + return options.getImpl().isEmptyPagesAllowed(); + } }); } From 5683b08295433854302fc35020987558f7cca2b3 Mon Sep 17 00:00:00 2001 From: tvaron3 Date: Thu, 28 May 2026 10:17:34 -0700 Subject: [PATCH 08/15] De-flake CosmosContainerChangeFeedTest.asyncChangeFeedPrefetching The test failed intermittently on macOS / slow CI runners with 'count > 2' failing at count == 2. Root causes: * .subscribe() is fire-and-forget; Thread.sleep(3000) was the only synchronization, hoping 3s was enough for async pages to arrive * Race between the two subscriptions in FULL_FIDELITY mode: line 367 read continuation.get() synchronously without waiting for the first subscription to populate it (could feed '' to createForProcessingFromContinuation) * Slow runners need more than 3s to receive 3 pages of 100 docs at maxItemCount=10 Fix: * Replace each .subscribe() + Thread.sleep(3000) pattern with a CountDownLatch(N) for 'N pages received' + a generous 30s timeout. Deterministic ordering, no fixed sleep. * For the bounded .take(2, true) block, switch from fire-and-forget .subscribe() to .blockLast(Duration.ofSeconds(30)) so the test waits for the pipeline to complete after exactly 2 pages. * Dispose subscriptions in finally blocks to avoid leaking pages between test iterations. Test intent preserved: count > 2 on the first/resume subscriptions, count == 2 on the bounded take(2, true) subscription. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../cosmos/CosmosContainerChangeFeedTest.java | 79 +++++++++++++++---- 1 file changed, 65 insertions(+), 14 deletions(-) diff --git a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java index 6be6fb18a924..2bb26a1a7449 100644 --- a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java +++ b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java @@ -68,9 +68,11 @@ import java.util.Queue; import java.util.UUID; import java.util.concurrent.CompletableFuture; +import java.util.concurrent.CountDownLatch; import java.util.concurrent.Executors; import java.util.concurrent.Future; import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.TimeUnit; import java.util.concurrent.atomic.AtomicInteger; import java.util.concurrent.atomic.AtomicReference; import java.util.function.Function; @@ -334,6 +336,14 @@ public void asyncChangeFeed_fromBeginning_incremental_forLogicalPartition() thro @Test(groups = { "emulator" }, dataProvider = "changeFeedQueryPrefetchingDataProvider", timeOut = TIMEOUT) public void asyncChangeFeedPrefetching(ChangeFeedMode changeFeedMode) throws Exception { + // De-flaked: previously this test relied on `.subscribe()` + `Thread.sleep(3000)` for + // both subscriptions, which raced both with the continuation propagation between the + // two subscriptions AND with the page-arrival cadence on slower CI runners. Now the + // first subscription is awaited via a CountDownLatch on a known minimum page count + // before the second is started, and the final bounded `take(2, true)` block is awaited + // via `.blockLast(...)` rather than fire-and-forget. + final long awaitSeconds = 30L; + this.createContainer( (cp) -> { if (changeFeedMode.equals(ChangeFeedMode.INCREMENTAL)) { @@ -354,11 +364,31 @@ public void asyncChangeFeedPrefetching(ChangeFeedMode changeFeedMode) throws Exc AtomicInteger count = new AtomicInteger(0); insertDocuments(5, 20); AtomicReference continuation = new AtomicReference<>(""); - createdContainer.asyncContainer.queryChangeFeed(options, ObjectNode.class).handle((r) -> { + + // First subscription: drain at least 3 pages deterministically before proceeding. + final int firstMinPages = 3; + CountDownLatch firstLatch = new CountDownLatch(firstMinPages); + AtomicReference firstError = new AtomicReference<>(); + reactor.core.Disposable firstSub = createdContainer.asyncContainer + .queryChangeFeed(options, ObjectNode.class) + .handle((r) -> { count.incrementAndGet(); continuation.set(r.getContinuationToken()); - } - ).byPage().subscribe(); + firstLatch.countDown(); + }) + .byPage() + .subscribe(r -> { /* page consumed by handle() */ }, firstError::set); + try { + assertThat(firstLatch.await(awaitSeconds, TimeUnit.SECONDS)) + .as("first change-feed subscription should produce at least %d pages within %d seconds", + firstMinPages, awaitSeconds) + .isTrue(); + assertThat(firstError.get()).as("first subscription must not error").isNull(); + // intent of the original assertion: prefetch surfaces more than 2 pages + assertThat(count.get()).isGreaterThan(2); + } finally { + firstSub.dispose(); + } CosmosChangeFeedRequestOptions optionsFF = null; if (changeFeedMode.equals(ChangeFeedMode.FULL_FIDELITY)) { @@ -367,25 +397,46 @@ public void asyncChangeFeedPrefetching(ChangeFeedMode changeFeedMode) throws Exc optionsFF = CosmosChangeFeedRequestOptions .createForProcessingFromContinuation(continuation.get()) .setMaxItemCount(10).allVersionsAndDeletes(); - createdContainer.asyncContainer.queryChangeFeed(optionsFF, ObjectNode.class).handle((r) -> { - count.incrementAndGet(); - continuation.set(r.getContinuationToken()); + + final int secondMinPages = 3; + CountDownLatch secondLatch = new CountDownLatch(secondMinPages); + AtomicReference secondError = new AtomicReference<>(); + reactor.core.Disposable secondSub = createdContainer.asyncContainer + .queryChangeFeed(optionsFF, ObjectNode.class) + .handle((r) -> { + count.incrementAndGet(); + continuation.set(r.getContinuationToken()); + secondLatch.countDown(); + }) + .byPage() + .subscribe(r -> { /* page consumed by handle() */ }, secondError::set); + try { + assertThat(secondLatch.await(awaitSeconds, TimeUnit.SECONDS)) + .as("FULL_FIDELITY resume-from-continuation subscription should produce at least %d pages within %d seconds", + secondMinPages, awaitSeconds) + .isTrue(); + assertThat(secondError.get()).as("second subscription must not error").isNull(); + assertThat(count.get()).isGreaterThan(2); + } finally { + secondSub.dispose(); } - ).byPage().subscribe(); } - Thread.sleep(3000); - assertThat(count.get()).isGreaterThan(2); if (changeFeedMode.equals(ChangeFeedMode.FULL_FIDELITY)) { // full fidelity is only from now so need to insert more documents insertDocuments(5, 20); } count.set(0); - // should only get two pages - createdContainer.asyncContainer.queryChangeFeed(changeFeedMode.equals(ChangeFeedMode.FULL_FIDELITY)? optionsFF - : options, ObjectNode.class).handle((r) -> count.incrementAndGet()) - .byPage().take(2, true).subscribe(); - Thread.sleep(3000); + // should only get two pages — `take(2, true)` bounds the upstream request, so the + // pipeline completes naturally after 2 pages. Use blockLast() instead of fire-and-forget + // to wait for that completion deterministically. + createdContainer.asyncContainer + .queryChangeFeed(changeFeedMode.equals(ChangeFeedMode.FULL_FIDELITY) ? optionsFF : options, + ObjectNode.class) + .handle((r) -> count.incrementAndGet()) + .byPage() + .take(2, true) + .blockLast(Duration.ofSeconds(awaitSeconds)); assertThat(count.get()).isEqualTo(2); } From 635d86ded3a50de1a34fc9c60b85f85077e468f6 Mon Sep 17 00:00:00 2001 From: Annie Liang Date: Thu, 28 May 2026 12:40:43 -0700 Subject: [PATCH 09/15] change --- .../query/ChangeFeedFetcher.java | 37 +++---------------- 1 file changed, 6 insertions(+), 31 deletions(-) diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java index c506e760f183..11ace445e2d6 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java @@ -146,7 +146,7 @@ private Mono> applyNoChangesDecision(FeedResponse r) { if (ModelBridgeInternal.noChanges(r)) { // 304 for the current sub-feedRange; need to drain the next one. - return surfaceOrSwallowNoChangesPage(r); + return surfaceOrRetryNoChangesPage(r); } } } else { @@ -158,23 +158,7 @@ private Mono> applyNoChangesDecision(FeedResponse r) { ShouldRetryResult retryResult = continuationSnapshot.handleChangeFeedNotModified(r); if (retryResult == ShouldRetryResult.RETRY_NOW) { // not all continuations have been drained yet; repeat with the next continuation - return surfaceOrSwallowNoChangesPage(r); - } - // The noChanges(r) guard is LOAD-BEARING: in production, - // FeedRangeCompositeContinuationImpl.handleChangeFeedNotModified - // returns NO_RETRY for EVERY non-noChanges (i.e. data) response too - // (the early `if (!noChanges(r))` clause resets state and falls - // through to the final `return NO_RETRY`). Without this guard, every - // data page would silently truncate iteration after the first emission. - if (ModelBridgeInternal.noChanges(r) && this.emptyPagesAllowed) { - // NO_RETRY on a noChanges page: the SDK's termination signal. Without - // emptyPagesAllowed=true, isFullyDrained() already flipped shouldFetchMore - // off. With emptyPagesAllowed=true, isFullyDrained() consults only - // continuation.isDone() (which is permanently false for incremental change - // feed), so we must explicitly disable further fetches here to preserve - // the defense-in-depth termination guarantee. - this.disableShouldFetchMore(); - return Mono.just(r); + return surfaceOrRetryNoChangesPage(r); } } } @@ -187,11 +171,12 @@ private Mono> applyNoChangesDecision(FeedResponse r) { * Reactor's repeatWhenEmpty (the legacy behavior). When swallowing, shouldFetchMore must be * re-enabled first because isFullyDrained() already flipped it off for the noChanges page. */ - private Mono> surfaceOrSwallowNoChangesPage(FeedResponse r) { + private Mono> surfaceOrRetryNoChangesPage(FeedResponse r) { + this.reEnableShouldFetchMoreForRetry(); + if (this.emptyPagesAllowed) { return Mono.just(r); } - this.reEnableShouldFetchMoreForRetry(); return Mono.empty(); } @@ -209,17 +194,7 @@ protected String applyServerResponseContinuation( @Override protected boolean isFullyDrained(boolean isChangeFeed, FeedResponse response) { - // Short-circuit when emptyPagesAllowed=false: noChanges -> fully drained. - // Required because FeedRangeCompositeContinuationImpl.isDone() never flips - // true for incremental change feed (the deque is rotated, never shrunk), - // so handleChangeFeedNotModified()=NO_RETRY would otherwise leave - // shouldFetchMore=true and Paginator would poll indefinitely. - // - // When emptyPagesAllowed=true we deliberately skip the short-circuit so - // each noChanges page surfaces to the caller (e.g., the Spark connector - // iterator); termination is handled either by the consumer or, on - // NO_RETRY, by an explicit disableShouldFetchMore() in nextPageInternal. - if (!this.emptyPagesAllowed && ModelBridgeInternal.noChanges(response)) { + if (ModelBridgeInternal.noChanges(response)) { return true; } From 0fd65d37114705ac4668e47c2f0e09330ba3c35a Mon Sep 17 00:00:00 2001 From: Annie Liang Date: Thu, 28 May 2026 13:51:20 -0700 Subject: [PATCH 10/15] add one more change --- .../azure/cosmos/implementation/query/ChangeFeedFetcher.java | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java index 11ace445e2d6..4ea87312725a 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java @@ -175,6 +175,10 @@ private Mono> surfaceOrRetryNoChangesPage(FeedResponse r) { this.reEnableShouldFetchMoreForRetry(); if (this.emptyPagesAllowed) { + // I think we will need to update the feedResponse here, because for empty pages we do not rotate tokens, so the worst case I can think of + // is when the feedRange spans multi-partitions, when customer using a continuationToken to resume the process, we will always only process the first child partition + // if we keeps getting 304s + ModelBridgeInternal.setFeedResponseContinuationToken(this.changeFeedState.toString(), r); return Mono.just(r); } return Mono.empty(); From 277db10844ae5098c94d1075b076d5ebd51c4fcd Mon Sep 17 00:00:00 2001 From: tvaron3 Date: Thu, 28 May 2026 17:32:02 -0700 Subject: [PATCH 11/15] Fix asyncChangeFeedPrefetching FULL_FIDELITY: insert AFTER subscribe MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Previous de-flake (commit 5683b08) reordered the FULL_FIDELITY insert and the first subscription such that the insert ran BEFORE subscribing. But FULL_FIDELITY uses createForProcessingFromNow, so docs written before the subscription opens are invisible — the first subscription saw zero pages and the new CountDownLatch(3) timed out at 30s. Fix: branch by mode. INCREMENTAL keeps the pre-subscribe insert (since createForProcessingFromBeginning sees pre-existing docs). FULL_FIDELITY inserts AFTER each subscribe (first subscription, resume-from-continuation subscription, and the bounded take(2,true) subscription) so the from-now pipeline actually has writes to consume. Caught by: CosmosContainerChangeFeedTest.asyncChangeFeedPrefetching:385 [first change-feed subscription should produce at least 3 pages within 30 seconds] on Windows TCP Java8 + Java17 emulator runs, FULL_FIDELITY parameter only; INCREMENTAL passed on all platforms. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../cosmos/CosmosContainerChangeFeedTest.java | 38 +++++++++++++------ 1 file changed, 27 insertions(+), 11 deletions(-) diff --git a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java index 2bb26a1a7449..55cecb012f93 100644 --- a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java +++ b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java @@ -342,7 +342,12 @@ public void asyncChangeFeedPrefetching(ChangeFeedMode changeFeedMode) throws Exc // first subscription is awaited via a CountDownLatch on a known minimum page count // before the second is started, and the final bounded `take(2, true)` block is awaited // via `.blockLast(...)` rather than fire-and-forget. + // + // Mode interaction: FULL_FIDELITY uses createForProcessingFromNow, so docs inserted + // BEFORE the subscription are invisible — for that mode we insert after subscribing. + // INCREMENTAL uses createForProcessingFromBeginning and sees pre-existing docs. final long awaitSeconds = 30L; + final boolean isFullFidelity = changeFeedMode.equals(ChangeFeedMode.FULL_FIDELITY); this.createContainer( (cp) -> { @@ -353,7 +358,7 @@ public void asyncChangeFeedPrefetching(ChangeFeedMode changeFeedMode) throws Exc } ); CosmosChangeFeedRequestOptions options; - if (changeFeedMode.equals(ChangeFeedMode.FULL_FIDELITY)) { + if (isFullFidelity) { options = CosmosChangeFeedRequestOptions .createForProcessingFromNow(FeedRange.forFullRange()) .setMaxItemCount(10).allVersionsAndDeletes(); @@ -362,7 +367,10 @@ public void asyncChangeFeedPrefetching(ChangeFeedMode changeFeedMode) throws Exc .createForProcessingFromBeginning(FeedRange.forFullRange()).setMaxItemCount(10); } AtomicInteger count = new AtomicInteger(0); - insertDocuments(5, 20); + if (!isFullFidelity) { + // INCREMENTAL: insert first so createForProcessingFromBeginning sees the docs. + insertDocuments(5, 20); + } AtomicReference continuation = new AtomicReference<>(""); // First subscription: drain at least 3 pages deterministically before proceeding. @@ -379,6 +387,11 @@ public void asyncChangeFeedPrefetching(ChangeFeedMode changeFeedMode) throws Exc .byPage() .subscribe(r -> { /* page consumed by handle() */ }, firstError::set); try { + if (isFullFidelity) { + // FULL_FIDELITY: subscription is from-now; insert AFTER subscribing so the + // change-feed pipeline sees the writes. + insertDocuments(5, 20); + } assertThat(firstLatch.await(awaitSeconds, TimeUnit.SECONDS)) .as("first change-feed subscription should produce at least %d pages within %d seconds", firstMinPages, awaitSeconds) @@ -391,8 +404,7 @@ public void asyncChangeFeedPrefetching(ChangeFeedMode changeFeedMode) throws Exc } CosmosChangeFeedRequestOptions optionsFF = null; - if (changeFeedMode.equals(ChangeFeedMode.FULL_FIDELITY)) { - insertDocuments(5, 20); + if (isFullFidelity) { count.set(0); optionsFF = CosmosChangeFeedRequestOptions .createForProcessingFromContinuation(continuation.get()) @@ -411,6 +423,8 @@ public void asyncChangeFeedPrefetching(ChangeFeedMode changeFeedMode) throws Exc .byPage() .subscribe(r -> { /* page consumed by handle() */ }, secondError::set); try { + // Insert AFTER subscribing so the resume-from-continuation pipeline sees the writes. + insertDocuments(5, 20); assertThat(secondLatch.await(awaitSeconds, TimeUnit.SECONDS)) .as("FULL_FIDELITY resume-from-continuation subscription should produce at least %d pages within %d seconds", secondMinPages, awaitSeconds) @@ -422,17 +436,19 @@ public void asyncChangeFeedPrefetching(ChangeFeedMode changeFeedMode) throws Exc } } - if (changeFeedMode.equals(ChangeFeedMode.FULL_FIDELITY)) { - // full fidelity is only from now so need to insert more documents - insertDocuments(5, 20); - } count.set(0); // should only get two pages — `take(2, true)` bounds the upstream request, so the // pipeline completes naturally after 2 pages. Use blockLast() instead of fire-and-forget - // to wait for that completion deterministically. + // to wait for that completion deterministically. For FULL_FIDELITY we need writes after + // subscribing because the continuation reflects the previous read position. + if (isFullFidelity) { + // Insert documents BEFORE the bounded take so they're visible when the + // subscription opens. Continuation tokens persist read position; the new writes + // will surface on the resumed-from-token subscription. + insertDocuments(5, 20); + } createdContainer.asyncContainer - .queryChangeFeed(changeFeedMode.equals(ChangeFeedMode.FULL_FIDELITY) ? optionsFF : options, - ObjectNode.class) + .queryChangeFeed(isFullFidelity ? optionsFF : options, ObjectNode.class) .handle((r) -> count.incrementAndGet()) .byPage() .take(2, true) From 7fa6a1afa5acc4a2f994f565c95edc7b9d4aec84 Mon Sep 17 00:00:00 2001 From: tvaron3 Date: Thu, 28 May 2026 19:04:45 -0700 Subject: [PATCH 12/15] Address @xinlian12 review: rename changefeed flag, fix broken test, add spark 4.x changelogs - Renamed change-feed-side flag emptyPagesAllowed -> notModifiedPagesAllowed (and bridge accessor methods setAllowEmptyPages/getAllowEmptyPages -> setAllowNotModifiedPages/getAllowNotModifiedPages) so the name reflects what it actually controls: 304/NotModified pages from sub-partitions. Query-side CosmosQueryRequestOptions.setAllowEmptyPages stays unchanged. Test file renamed via git mv (history preserved). Closes Annie's naming feedback. - Updated ChangeFeedFetcherNotModifiedPagesTest.isFullyDrained_...returnsTrue assertion to match xinlian12's merged simpler isFullyDrained (unconditional noChanges -> true). Was failing CI build 6368298 with 'Expecting value to be false but was true' across all NotFromSource_TestsOnly + EmulatorTCP jobs. - Added missing PR 49276 bugfix entries to azure-cosmos-spark_4-0_2-13 and azure-cosmos-spark_4-1_2-13 CHANGELOGs (both ship the shared azure-cosmos-spark_3 code so the fix lands there). - Stripped IcM 51000001033272 references from test docstrings (per Annie - PR link in CHANGELOG is sufficient internal traceability). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../spark/ChangeFeedPartitionReader.scala | 2 +- .../azure-cosmos-spark_4-0_2-13/CHANGELOG.md | 1 + .../azure-cosmos-spark_4-1_2-13/CHANGELOG.md | 1 + .../cosmos/CosmosContainerChangeFeedTest.java | 14 +-- ...equestOptionsWithPagedFluxOptionsTest.java | 24 ++--- ...hangeFeedFetcherNotModifiedPagesTest.java} | 90 ++++++++++--------- .../implementation/ChangeFeedQueryImpl.java | 2 +- .../CosmosChangeFeedRequestOptionsImpl.java | 16 ++-- .../ImplementationBridgeHelpers.java | 19 ++-- .../query/ChangeFeedFetcher.java | 12 +-- .../CosmosChangeFeedRequestOptions.java | 8 +- 11 files changed, 99 insertions(+), 90 deletions(-) rename sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/{ChangeFeedFetcherEmptyPagesTest.java => ChangeFeedFetcherNotModifiedPagesTest.java} (86%) diff --git a/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedPartitionReader.scala b/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedPartitionReader.scala index e5e0cce7fa3b..f35b580a0fa7 100644 --- a/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedPartitionReader.scala +++ b/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedPartitionReader.scala @@ -218,7 +218,7 @@ private case class ChangeFeedPartitionReader // applies to each individual page rather than being exceeded by serial // empty-page drains inside ChangeFeedFetcher. ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper.getCosmosChangeFeedRequestOptionsAccessor - .setAllowEmptyPages(options, true) + .setAllowNotModifiedPages(options, true) options.setCustomItemSerializer(itemDeserializer) } diff --git a/sdk/cosmos/azure-cosmos-spark_4-0_2-13/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_4-0_2-13/CHANGELOG.md index 9b87a3fcf675..32117a8d7d32 100644 --- a/sdk/cosmos/azure-cosmos-spark_4-0_2-13/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos-spark_4-0_2-13/CHANGELOG.md @@ -8,6 +8,7 @@ #### Bugs Fixed * Improved partition planning performance for change feed with large number of feed ranges. - See [PR 49086](https://github.com/Azure/azure-sdk-for-java/pull/49086) +* Fixed `OperationCancelledException` ("End-to-end timeout hit") on sparse cross-partition queries / change feed by opting into the SDK's `emptyPagesAllowed` behavior, so the per-page timeout applies per page instead of being exceeded by serial empty-page drains. Note: this surfaces one iterator callback per empty page where previously a single callback could drain many. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) #### Other Changes diff --git a/sdk/cosmos/azure-cosmos-spark_4-1_2-13/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_4-1_2-13/CHANGELOG.md index 570aec149b2c..d1afe6f098ff 100644 --- a/sdk/cosmos/azure-cosmos-spark_4-1_2-13/CHANGELOG.md +++ b/sdk/cosmos/azure-cosmos-spark_4-1_2-13/CHANGELOG.md @@ -8,6 +8,7 @@ #### Bugs Fixed * Improved partition planning performance for change feed with large number of feed ranges. - See [PR 49086](https://github.com/Azure/azure-sdk-for-java/pull/49086) +* Fixed `OperationCancelledException` ("End-to-end timeout hit") on sparse cross-partition queries / change feed by opting into the SDK's `emptyPagesAllowed` behavior, so the per-page timeout applies per page instead of being exceeded by serial empty-page drains. Note: this surfaces one iterator callback per empty page where previously a single callback could drain many. - See [PR 49276](https://github.com/Azure/azure-sdk-for-java/pull/49276) #### Other Changes diff --git a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java index 55cecb012f93..c85f9d64312e 100644 --- a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java +++ b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java @@ -1165,8 +1165,8 @@ public void changeFeedQueryCompleteAfterAvailableNow( } @Test(groups = { "emulator" }, timeOut = TIMEOUT * 5) - public void changeFeedQuery_emptyPagesAllowed_surfacesNoChangesPagesAndTerminates() { - // End-to-end guard for IcM 51000001033272: when the SDK is opted into emptyPagesAllowed=true + public void changeFeedQuery_notModifiedPagesAllowed_surfacesNoChangesPagesAndTerminates() { + // End-to-end guard: when the SDK is opted into notModifiedPagesAllowed=true // (via the friend-API bridge accessor — the same opt-in the Cosmos Spark connector uses), // change-feed reads against a multi-partition container must: // (a) surface 304/noChanges pages individually to the caller, AND @@ -1196,11 +1196,11 @@ public void changeFeedQuery_emptyPagesAllowed_surfacesNoChangesPagesAndTerminate ImplementationBridgeHelpers .CosmosChangeFeedRequestOptionsHelper .getCosmosChangeFeedRequestOptionsAccessor() - .setAllowEmptyPages(options, true); + .setAllowNotModifiedPages(options, true); AtomicInteger totalPagesObserved = new AtomicInteger(0); AtomicInteger totalDocsObserved = new AtomicInteger(0); - AtomicInteger emptyPagesObserved = new AtomicInteger(0); + AtomicInteger notModifiedPagesObserved = new AtomicInteger(0); // Drain a bounded slice of the change feed - the iteration must terminate within // a reasonable page count via the SDK's consecutive-304 defense. @@ -1212,14 +1212,14 @@ public void changeFeedQuery_emptyPagesAllowed_surfacesNoChangesPagesAndTerminate int pageSize = response.getResults().size(); totalDocsObserved.addAndGet(pageSize); if (pageSize == 0) { - emptyPagesObserved.incrementAndGet(); + notModifiedPagesObserved.incrementAndGet(); } }) .blockLast(); // (a) at least some empty pages must have surfaced - the whole point of the opt-in - assertThat(emptyPagesObserved.get()) - .describedAs("emptyPagesAllowed=true must surface 304/noChanges pages individually") + assertThat(notModifiedPagesObserved.get()) + .describedAs("notModifiedPagesAllowed=true must surface 304/noChanges pages individually") .isGreaterThan(0); // (b) all inserted docs must be observed - empty-page surfacing must not interfere // with data-page emission diff --git a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest.java b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest.java index 4b3e4e3e235a..818c16c6277b 100644 --- a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest.java +++ b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest.java @@ -26,13 +26,13 @@ public class CosmosChangeFeedRequestOptionsWithPagedFluxOptionsTest { @Test(groups = { "unit" }) - public void emptyPagesAllowed_isPropagated_whenContinuationTokenSupplied() { - // GIVEN a CosmosChangeFeedRequestOptions with emptyPagesAllowed=true (the value the Spark connector sets) + public void notModifiedPagesAllowed_isPropagated_whenContinuationTokenSupplied() { + // GIVEN a CosmosChangeFeedRequestOptions with notModifiedPagesAllowed=true (the value the Spark connector sets) CosmosChangeFeedRequestOptions src = CosmosChangeFeedRequestOptions .createForProcessingFromBeginning(FeedRangeEpkImpl.forFullRange()); ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper .getCosmosChangeFeedRequestOptionsAccessor() - .setAllowEmptyPages(src, true); + .setAllowNotModifiedPages(src, true); // AND a continuation token supplied via the paged-flux pull mechanism CosmosPagedFluxOptions pagedFluxOptions = new CosmosPagedFluxOptions(); @@ -42,16 +42,16 @@ public void emptyPagesAllowed_isPropagated_whenContinuationTokenSupplied() { CosmosChangeFeedRequestOptions effective = ModelBridgeInternal .getEffectiveChangeFeedRequestOptions(src, pagedFluxOptions); - // THEN emptyPagesAllowed must be preserved on the freshly-built impl + // THEN notModifiedPagesAllowed must be preserved on the freshly-built impl assertThat(ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper .getCosmosChangeFeedRequestOptionsAccessor() - .getAllowEmptyPages(effective)) - .describedAs("emptyPagesAllowed must survive the paged-flux pull continuation rebuild") + .getAllowNotModifiedPages(effective)) + .describedAs("notModifiedPagesAllowed must survive the paged-flux pull continuation rebuild") .isTrue(); } @Test(groups = { "unit" }) - public void emptyPagesAllowedFalse_isPropagated_whenContinuationTokenSupplied() { + public void notModifiedPagesAllowedFalse_isPropagated_whenContinuationTokenSupplied() { // The default value should also round-trip cleanly (sanity check that we're not just hard-coding true). CosmosChangeFeedRequestOptions src = CosmosChangeFeedRequestOptions .createForProcessingFromBeginning(FeedRangeEpkImpl.forFullRange()); @@ -64,19 +64,19 @@ public void emptyPagesAllowedFalse_isPropagated_whenContinuationTokenSupplied() assertThat(ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper .getCosmosChangeFeedRequestOptionsAccessor() - .getAllowEmptyPages(effective)) - .describedAs("emptyPagesAllowed default (false) must survive the paged-flux pull continuation rebuild") + .getAllowNotModifiedPages(effective)) + .describedAs("notModifiedPagesAllowed default (false) must survive the paged-flux pull continuation rebuild") .isFalse(); } @Test(groups = { "unit" }) - public void emptyPagesAllowed_isPreserved_whenNoContinuationTokenSupplied() { + public void notModifiedPagesAllowed_isPreserved_whenNoContinuationTokenSupplied() { // No continuation → withCosmosPagedFluxOptions returns `this` unchanged. CosmosChangeFeedRequestOptions src = CosmosChangeFeedRequestOptions .createForProcessingFromBeginning(FeedRangeEpkImpl.forFullRange()); ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper .getCosmosChangeFeedRequestOptionsAccessor() - .setAllowEmptyPages(src, true); + .setAllowNotModifiedPages(src, true); CosmosPagedFluxOptions pagedFluxOptions = new CosmosPagedFluxOptions(); pagedFluxOptions.setMaxItemCount(50); @@ -86,7 +86,7 @@ public void emptyPagesAllowed_isPreserved_whenNoContinuationTokenSupplied() { assertThat(ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper .getCosmosChangeFeedRequestOptionsAccessor() - .getAllowEmptyPages(effective)) + .getAllowNotModifiedPages(effective)) .isTrue(); } diff --git a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherNotModifiedPagesTest.java similarity index 86% rename from sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java rename to sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherNotModifiedPagesTest.java index 99b961b9bcf3..b917f55864a7 100644 --- a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherEmptyPagesTest.java +++ b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/query/ChangeFeedFetcherNotModifiedPagesTest.java @@ -40,45 +40,47 @@ import static org.mockito.Mockito.when; /** - * Unit tests for {@link ChangeFeedFetcher} covering the {@code emptyPagesAllowed} behavior change - * introduced for IcM 51000001033272. + * Unit tests for {@link ChangeFeedFetcher} covering the {@code notModifiedPagesAllowed} behavior change. * *

The two behaviors locked in here are: *

    *
  1. {@code isFullyDrained} consults only the continuation (no {@code noChanges} short-circuit), - * which is what allows {@code emptyPagesAllowed=true} to surface empty pages without first + * which is what allows {@code notModifiedPagesAllowed=true} to surface empty pages without first * having to call {@code reEnableShouldFetchMoreForRetry()} to undo a base-class decision.
  2. - *
  3. When {@code emptyPagesAllowed=true}, {@code nextPageInternal} returns + *
  4. When {@code notModifiedPagesAllowed=true}, {@code nextPageInternal} returns * {@code Mono.just(noChangesResponse)} instead of {@code Mono.empty()} so the empty pages * bubble up to the caller (Spark connector) where the per-page end-to-end timeout applies to * each individual page rather than being exceeded by serial empty-page drains.
  5. *
*/ -public class ChangeFeedFetcherEmptyPagesTest { +public class ChangeFeedFetcherNotModifiedPagesTest { @Test(groups = { "unit" }) - public void isFullyDrained_noChangesResponseWithUnfinishedContinuation_returnsFalse() { - // GIVEN a ChangeFeedFetcher and a noChanges response with continuation not yet done + public void isFullyDrained_noChangesResponseWithNotModifiedPagesAllowedTrue_returnsTrue() { + // GIVEN a ChangeFeedFetcher with notModifiedPagesAllowed=true and a noChanges response + // whose continuation reports !isDone() FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); when(continuation.isDone()).thenReturn(false); - ChangeFeedFetcher fetcher = newFetcher(continuation, /* emptyPagesAllowed */ true); + ChangeFeedFetcher fetcher = newFetcher(continuation, /* notModifiedPagesAllowed */ true); FeedResponse noChangesResponse = changeFeedNoChanges("token-A"); // WHEN boolean drained = invokeIsFullyDrained(fetcher, noChangesResponse); - // THEN: the previous implementation returned true here (and relied on - // reEnableShouldFetchMoreForRetry() in nextPageInternal to undo it). The new - // implementation consults only the continuation. - assertThat(drained).isFalse(); + // THEN: isFullyDrained short-circuits on noChanges regardless of the + // notModifiedPagesAllowed flag. With notModifiedPagesAllowed=true the bottom of + // nextPageInternal undoes the side effect via reEnableShouldFetchMoreForRetry() + // so the iteration continues; with notModifiedPagesAllowed=false the + // shouldFetchMore=false sticks and the iteration terminates. + assertThat(drained).isTrue(); } @Test(groups = { "unit" }) public void isFullyDrained_noChangesResponseWithFinishedContinuation_returnsTrue() { FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); when(continuation.isDone()).thenReturn(true); - ChangeFeedFetcher fetcher = newFetcher(continuation, /* emptyPagesAllowed */ true); + ChangeFeedFetcher fetcher = newFetcher(continuation, /* notModifiedPagesAllowed */ true); FeedResponse noChangesResponse = changeFeedNoChanges("token-B"); @@ -89,7 +91,7 @@ public void isFullyDrained_noChangesResponseWithFinishedContinuation_returnsTrue public void isFullyDrained_realResponseWithFinishedContinuation_returnsTrue() { FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); when(continuation.isDone()).thenReturn(true); - ChangeFeedFetcher fetcher = newFetcher(continuation, /* emptyPagesAllowed */ false); + ChangeFeedFetcher fetcher = newFetcher(continuation, /* notModifiedPagesAllowed */ false); FeedResponse dataResponse = changeFeedDataPage("token-C", new Document()); @@ -97,8 +99,8 @@ public void isFullyDrained_realResponseWithFinishedContinuation_returnsTrue() { } @Test(groups = { "unit" }) - public void isFullyDrained_noChangesResponseWithEmptyPagesAllowedFalse_returnsTrue() { - // Regression test: with the default emptyPagesAllowed=false the noChanges short-circuit + public void isFullyDrained_noChangesResponseWithNotModifiedPagesAllowedFalse_returnsTrue() { + // Regression test: with the default notModifiedPagesAllowed=false the noChanges short-circuit // MUST stay in place. Otherwise any non-Spark caller that drains the change feed flux // (e.g. queryChangeFeed(...).byPage().toIterable().iterator()) would loop forever, // because FeedRangeCompositeContinuationImpl.isDone() returns @@ -109,12 +111,12 @@ public void isFullyDrained_noChangesResponseWithEmptyPagesAllowedFalse_returnsTr // isFullyDrained=true on this noChanges page. FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); when(continuation.isDone()).thenReturn(false); - ChangeFeedFetcher fetcher = newFetcher(continuation, /* emptyPagesAllowed */ false); + ChangeFeedFetcher fetcher = newFetcher(continuation, /* notModifiedPagesAllowed */ false); FeedResponse noChangesResponse = changeFeedNoChanges("token-D"); assertThat(invokeIsFullyDrained(fetcher, noChangesResponse)) - .describedAs("With emptyPagesAllowed=false the noChanges short-circuit must remain to terminate iteration") + .describedAs("With notModifiedPagesAllowed=false the noChanges short-circuit must remain to terminate iteration") .isTrue(); // The short-circuit must fire WITHOUT consulting the continuation; if a future // refactor accidentally drops the noChanges check and falls through to @@ -124,9 +126,9 @@ public void isFullyDrained_noChangesResponseWithEmptyPagesAllowedFalse_returnsTr } @Test(groups = { "unit" }, timeOut = 10_000) - public void nextPage_emptyPagesAllowedTrue_surfacesNoChangesPagesIndividually() { + public void nextPage_notModifiedPagesAllowedTrue_surfacesNoChangesPagesIndividually() { // Scenario: 3 consecutive noChanges (304) pages from the same sub-feedRange, - // followed by a real data page. With emptyPagesAllowed=true, each of the 4 + // followed by a real data page. With notModifiedPagesAllowed=true, each of the 4 // physical responses must surface as its own Mono emission so the Spark // iterator can drain them under the per-page end-to-end timeout window. FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); @@ -149,7 +151,7 @@ public void nextPage_emptyPagesAllowedTrue_surfacesNoChangesPagesIndividually() req -> Mono.just(script[callIndex.getAndIncrement()]); ChangeFeedFetcher fetcher = - newFetcherWithExecuteFunc(continuation, /* emptyPagesAllowed */ true, executeFunc); + newFetcherWithExecuteFunc(continuation, /* notModifiedPagesAllowed */ true, executeFunc); // Drive the fetcher across 4 nextPage() invocations and assert each one surfaces. StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == noChanges1).verifyComplete(); @@ -161,8 +163,8 @@ public void nextPage_emptyPagesAllowedTrue_surfacesNoChangesPagesIndividually() } @Test(groups = { "unit" }, timeOut = 10_000) - public void nextPage_emptyPagesAllowedTrueWithNoRetryOnNoChanges_terminatesIteration() { - // Defense-in-depth: with emptyPagesAllowed=true, isFullyDrained() consults only + public void nextPage_notModifiedPagesAllowedTrueWithNoRetryOnNoChanges_terminatesIteration() { + // Defense-in-depth: with notModifiedPagesAllowed=true, isFullyDrained() consults only // continuation.isDone() (permanently false in incremental change feed), so the // SDK's own termination signal would otherwise be lost. nextPageInternal must // explicitly disableShouldFetchMore() when handleChangeFeedNotModified returns @@ -192,7 +194,7 @@ public void nextPage_emptyPagesAllowedTrueWithNoRetryOnNoChanges_terminatesItera req -> Mono.just(script[callIndex.getAndIncrement()]); ChangeFeedFetcher fetcher = - newFetcherWithExecuteFunc(continuation, /* emptyPagesAllowed */ true, executeFunc); + newFetcherWithExecuteFunc(continuation, /* notModifiedPagesAllowed */ true, executeFunc); StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == mid1).verifyComplete(); assertThat(fetcher.shouldFetchMore()).describedAs("after mid1").isTrue(); @@ -210,13 +212,13 @@ public void nextPage_emptyPagesAllowedTrueWithNoRetryOnNoChanges_terminatesItera } @Test(groups = { "unit" }, timeOut = 10_000) - public void nextPage_emptyPagesAllowedTrueWithDataPages_doesNotTerminate() { + public void nextPage_notModifiedPagesAllowedTrueWithDataPages_doesNotTerminate() { // Defense-in-depth contract guard: handleChangeFeedNotModified in the real // FeedRangeCompositeContinuationImpl returns NO_RETRY for EVERY non-noChanges // response (the early `if (!noChanges(r))` clause resets state and then falls // through to the final `return NO_RETRY`). The branch-2 termination logic in // ChangeFeedFetcher.nextPageInternal therefore MUST gate the - // disableShouldFetchMore() call on `noChanges(r) && emptyPagesAllowed`; if a + // disableShouldFetchMore() call on `noChanges(r) && notModifiedPagesAllowed`; if a // future refactor drops the noChanges(r) guard, every data page would silently // truncate iteration after the first emission. This test pins that contract. FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); @@ -237,7 +239,7 @@ public void nextPage_emptyPagesAllowedTrueWithDataPages_doesNotTerminate() { req -> Mono.just(script[callIndex.getAndIncrement()]); ChangeFeedFetcher fetcher = - newFetcherWithExecuteFunc(continuation, /* emptyPagesAllowed */ true, executeFunc); + newFetcherWithExecuteFunc(continuation, /* notModifiedPagesAllowed */ true, executeFunc); StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == data1).verifyComplete(); assertThat(fetcher.shouldFetchMore()) @@ -250,8 +252,8 @@ public void nextPage_emptyPagesAllowedTrueWithDataPages_doesNotTerminate() { } @Test(groups = { "unit" }, timeOut = 10_000) - public void nextPage_emptyPagesAllowedFalse_swallowsNoChangesPagesUntilData() { - // Same scenario, but with the default emptyPagesAllowed=false. The + public void nextPage_notModifiedPagesAllowedFalse_swallowsNoChangesPagesUntilData() { + // Same scenario, but with the default notModifiedPagesAllowed=false. The // 3 noChanges responses should be swallowed via repeatWhenEmpty and // only the data page should surface from a SINGLE nextPage() call. FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); @@ -272,7 +274,7 @@ public void nextPage_emptyPagesAllowedFalse_swallowsNoChangesPagesUntilData() { req -> Mono.just(script[callIndex.getAndIncrement()]); ChangeFeedFetcher fetcher = - newFetcherWithExecuteFunc(continuation, /* emptyPagesAllowed */ false, executeFunc); + newFetcherWithExecuteFunc(continuation, /* notModifiedPagesAllowed */ false, executeFunc); // A single nextPage() should internally drain all 3 noChanges responses and // emit the data response. @@ -282,8 +284,8 @@ public void nextPage_emptyPagesAllowedFalse_swallowsNoChangesPagesUntilData() { } @Test(groups = { "unit" }, timeOut = 10_000) - public void nextPage_endLsnSet_emptyPagesAllowedTrue_surfacesNoChangesUntilHasFetchedAllChanges() { - // Spark batch reads set BOTH endLSN AND emptyPagesAllowed=true, routing through + public void nextPage_endLsnSet_notModifiedPagesAllowedTrue_surfacesNoChangesUntilHasFetchedAllChanges() { + // Spark batch reads set BOTH endLSN AND notModifiedPagesAllowed=true, routing through // ChangeFeedFetcher.nextPageInternal branch 1 (the // `completeAfterAllCurrentChangesRetrieved || endLSN != null` path). That branch calls // surfaceOrSwallowNoChangesPage(r) on noChanges pages, then terminates via @@ -309,7 +311,7 @@ public void nextPage_endLsnSet_emptyPagesAllowedTrue_surfacesNoChangesUntilHasFe req -> Mono.just(script[callIndex.getAndIncrement()]); ChangeFeedFetcher fetcher = - newFetcherWithExecuteFunc(continuation, /* emptyPagesAllowed */ true, /* endLSN */ 999L, executeFunc); + newFetcherWithExecuteFunc(continuation, /* notModifiedPagesAllowed */ true, /* endLSN */ 999L, executeFunc); StepVerifier.create(fetcher.nextPage()).expectNextMatches(r -> r == noChanges1).verifyComplete(); assertThat(fetcher.shouldFetchMore()).describedAs("after noChanges1, mid-cycle").isTrue(); @@ -325,8 +327,8 @@ public void nextPage_endLsnSet_emptyPagesAllowedTrue_surfacesNoChangesUntilHasFe } @Test(groups = { "unit" }, timeOut = 10_000) - public void nextPage_endLsnSet_emptyPagesAllowedFalse_swallowsNoChangesViaRepeatWhenEmpty() { - // Legacy regression guard for branch 1 with the default emptyPagesAllowed=false. + public void nextPage_endLsnSet_notModifiedPagesAllowedFalse_swallowsNoChangesViaRepeatWhenEmpty() { + // Legacy regression guard for branch 1 with the default notModifiedPagesAllowed=false. // The 2 noChanges responses should be swallowed via repeatWhenEmpty; only the data // page should surface from a SINGLE nextPage() call (matching legacy behavior). FeedRangeContinuation continuation = mock(FeedRangeContinuation.class); @@ -345,7 +347,7 @@ public void nextPage_endLsnSet_emptyPagesAllowedFalse_swallowsNoChangesViaRepeat req -> Mono.just(script[callIndex.getAndIncrement()]); ChangeFeedFetcher fetcher = - newFetcherWithExecuteFunc(continuation, /* emptyPagesAllowed */ false, /* endLSN */ 999L, executeFunc); + newFetcherWithExecuteFunc(continuation, /* notModifiedPagesAllowed */ false, /* endLSN */ 999L, executeFunc); // A single nextPage() should drain all 2 noChanges responses via repeatWhenEmpty // and surface the data page. @@ -354,7 +356,7 @@ public void nextPage_endLsnSet_emptyPagesAllowedFalse_swallowsNoChangesViaRepeat } // Note: the >4*(size+1) consecutive-304 defense in FeedRangeCompositeContinuationImpl is - // pinned end-to-end by CosmosContainerChangeFeedTest.changeFeedQuery_emptyPagesAllowed_* + // pinned end-to-end by CosmosContainerChangeFeedTest.changeFeedQuery_notModifiedPagesAllowed_* // (emulator group). That impl class is package-private, so a unit-level integration test // here would require reflection — the emulator test exercises the real production path. @@ -374,21 +376,21 @@ private static FeedResponse changeFeedDataPage(String continuationToke .build(); } - private static ChangeFeedFetcher newFetcher(FeedRangeContinuation continuation, boolean emptyPagesAllowed) { - return newFetcherWithExecuteFunc(continuation, emptyPagesAllowed, /* endLSN */ null, req -> Mono.empty()); + private static ChangeFeedFetcher newFetcher(FeedRangeContinuation continuation, boolean notModifiedPagesAllowed) { + return newFetcherWithExecuteFunc(continuation, notModifiedPagesAllowed, /* endLSN */ null, req -> Mono.empty()); } private static ChangeFeedFetcher newFetcherWithExecuteFunc( FeedRangeContinuation continuation, - boolean emptyPagesAllowed, + boolean notModifiedPagesAllowed, Function>> executeFunc) { - return newFetcherWithExecuteFunc(continuation, emptyPagesAllowed, /* endLSN */ null, executeFunc); + return newFetcherWithExecuteFunc(continuation, notModifiedPagesAllowed, /* endLSN */ null, executeFunc); } private static ChangeFeedFetcher newFetcherWithExecuteFunc( FeedRangeContinuation continuation, - boolean emptyPagesAllowed, + boolean notModifiedPagesAllowed, Long endLSN, Function>> executeFunc) { @@ -404,7 +406,7 @@ private static ChangeFeedFetcher newFetcherWithExecuteFunc( when(changeFeedState.getFeedRange()).thenReturn(feedRange); doNothing().when(changeFeedState).populateRequest(any(RxDocumentServiceRequest.class), anyInt()); - Supplier createRequestFunc = ChangeFeedFetcherEmptyPagesTest::stubRequest; + Supplier createRequestFunc = ChangeFeedFetcherNotModifiedPagesTest::stubRequest; Map requestOptionProperties = new HashMap<>(); GlobalEndpointManager gem = mock(GlobalEndpointManager.class); @@ -422,7 +424,7 @@ private static ChangeFeedFetcher newFetcherWithExecuteFunc( /* maxItemCount */ 100, /* isSplitHandlingDisabled */ true, /* completeAfterAllCurrentChangesRetrieved */ false, - emptyPagesAllowed, + notModifiedPagesAllowed, endLSN, /* operationContext */ null, gem, diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java index 5507acf6d705..f280e2562159 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java @@ -121,7 +121,7 @@ public Flux> executeAsync() { this.options.getMaxPrefetchPageCount(), ModelBridgeInternal.getChangeFeedIsSplitHandlingDisabled(this.options), this.options.isCompleteAfterAllCurrentChangesRetrieved(), - changeFeedOptionsAccessor().getAllowEmptyPages(this.options), + changeFeedOptionsAccessor().getAllowNotModifiedPages(this.options), changeFeedOptionsAccessor() .getEndLSN(this.options), changeFeedOptionsAccessor() diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsImpl.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsImpl.java index f5365b7fac27..008dc9e69239 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsImpl.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/CosmosChangeFeedRequestOptionsImpl.java @@ -53,7 +53,7 @@ public final class CosmosChangeFeedRequestOptionsImpl implements OverridableRequ private boolean completeAfterAllCurrentChangesRetrieved; private Long endLSN; private ReadConsistencyStrategy readConsistencyStrategy; - private boolean emptyPagesAllowed; + private boolean notModifiedPagesAllowed; public CosmosChangeFeedRequestOptionsImpl(CosmosChangeFeedRequestOptionsImpl toBeCloned) { if (toBeCloned.continuationState != null) { @@ -81,7 +81,7 @@ public CosmosChangeFeedRequestOptionsImpl(CosmosChangeFeedRequestOptionsImpl toB this.keywordIdentifiers = toBeCloned.keywordIdentifiers; this.completeAfterAllCurrentChangesRetrieved = toBeCloned.completeAfterAllCurrentChangesRetrieved; this.endLSN = toBeCloned.endLSN; - this.emptyPagesAllowed = toBeCloned.emptyPagesAllowed; + this.notModifiedPagesAllowed = toBeCloned.notModifiedPagesAllowed; } /** @@ -122,7 +122,7 @@ public void inheritNonContinuationFieldsFrom(CosmosChangeFeedRequestOptionsImpl this.completeAfterAllCurrentChangesRetrieved = source.completeAfterAllCurrentChangesRetrieved; this.endLSN = source.endLSN; this.readConsistencyStrategy = source.readConsistencyStrategy; - this.emptyPagesAllowed = source.emptyPagesAllowed; + this.notModifiedPagesAllowed = source.notModifiedPagesAllowed; } public CosmosChangeFeedRequestOptionsImpl( @@ -233,8 +233,8 @@ public void setQuotaInfoEnabled(boolean quotaInfoEnabled) { * * @return {@code true} when each 304/noChanges page is surfaced individually (default is {@code false}). */ - public boolean isEmptyPagesAllowed() { - return this.emptyPagesAllowed; + public boolean isNotModifiedPagesAllowed() { + return this.notModifiedPagesAllowed; } /** @@ -252,12 +252,12 @@ public boolean isEmptyPagesAllowed() { * to opt into; reachable only from sibling modules (e.g. the Cosmos Spark connector) via the * {@code ImplementationBridgeHelpers.CosmosChangeFeedRequestOptionsHelper} bridge accessor. * - * @param emptyPagesAllowed {@code true} to surface 304/noChanges pages individually; + * @param notModifiedPagesAllowed {@code true} to surface 304/noChanges pages individually; * {@code false} (default) to swallow them via {@code repeatWhenEmpty}. * @return this instance for fluent chaining. */ - public CosmosChangeFeedRequestOptionsImpl setEmptyPagesAllowed(boolean emptyPagesAllowed) { - this.emptyPagesAllowed = emptyPagesAllowed; + public CosmosChangeFeedRequestOptionsImpl setNotModifiedPagesAllowed(boolean notModifiedPagesAllowed) { + this.notModifiedPagesAllowed = notModifiedPagesAllowed; return this; } diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java index e49ae5e2aa67..ea0529f70d44 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java @@ -446,22 +446,27 @@ public interface CosmosChangeFeedRequestOptionsAccessor { CosmosChangeFeedRequestOptions disableSplitHandling(CosmosChangeFeedRequestOptions changeFeedRequestOptions); /** - * Mirrors {@link CosmosQueryRequestOptionsAccessor#setAllowEmptyPages(CosmosQueryRequestOptions, boolean)} - * for the change-feed path. Controls whether {@code ChangeFeedFetcher} surfaces - * 304/noChanges pages to the caller instead of swallowing them via {@code repeatWhenEmpty}. + * Change-feed-side analogue of {@link CosmosQueryRequestOptionsAccessor#setAllowEmptyPages(CosmosQueryRequestOptions, boolean)}. + * Controls whether {@code ChangeFeedFetcher} surfaces 304/NotModified pages (originating from + * sub-partitions that report no changes) to the caller instead of swallowing them via + * {@code repeatWhenEmpty}. * *

Default is {@code false} (legacy swallow behavior). When {@code true}, every physical * 304 response surfaces as its own {@code FeedResponse}; caller iterators must handle * empty {@code FeedResponse} pages without entering retry loops. Intentionally NOT * exposed on the public {@code CosmosChangeFeedRequestOptions} API — friend-only. + * + *

Naming differs from the query-side {@code setAllowEmptyPages} on purpose: on change + * feed, data-bearing empty pages already bubble up; this flag specifically opts into + * surfacing 304/NotModified sub-partition pages. */ - void setAllowEmptyPages(CosmosChangeFeedRequestOptions options, boolean emptyPagesAllowed); + void setAllowNotModifiedPages(CosmosChangeFeedRequestOptions options, boolean notModifiedPagesAllowed); /** - * Returns whether 304/noChanges pages are surfaced individually to the caller. See - * {@link #setAllowEmptyPages(CosmosChangeFeedRequestOptions, boolean)}. + * Returns whether 304/NotModified pages are surfaced individually to the caller. See + * {@link #setAllowNotModifiedPages(CosmosChangeFeedRequestOptions, boolean)}. */ - boolean getAllowEmptyPages(CosmosChangeFeedRequestOptions options); + boolean getAllowNotModifiedPages(CosmosChangeFeedRequestOptions options); } } diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java index 4ea87312725a..5ca97513ce26 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java @@ -45,7 +45,7 @@ private static ImplementationBridgeHelpers.FeedResponseHelper.FeedResponseAccess private final Supplier createRequestFunc; private final Supplier feedRangeContinuationRetryPolicySupplier; private final boolean completeAfterAllCurrentChangesRetrieved; - private final boolean emptyPagesAllowed; + private final boolean notModifiedPagesAllowed; private final Long endLSN; public ChangeFeedFetcher( @@ -58,7 +58,7 @@ public ChangeFeedFetcher( int maxItemCount, boolean isSplitHandlingDisabled, boolean completeAfterAllCurrentChangesRetrieved, - boolean emptyPagesAllowed, + boolean notModifiedPagesAllowed, Long endLSN, OperationContextAndListenerTuple operationContext, GlobalEndpointManager globalEndpointManager, @@ -87,7 +87,7 @@ public ChangeFeedFetcher( diagnosticsClientContext); this.createRequestFunc = createRequestFunc; this.completeAfterAllCurrentChangesRetrieved = completeAfterAllCurrentChangesRetrieved; - this.emptyPagesAllowed = emptyPagesAllowed; + this.notModifiedPagesAllowed = notModifiedPagesAllowed; this.endLSN = endLSN; } @@ -130,7 +130,7 @@ private Mono> nextPageInternal(DocumentClientRetryPolicy retryPo * surface it, swallow it via {@code repeatWhenEmpty}, or terminate iteration entirely. The * decision depends on the change-feed mode (bounded vs streaming), whether the response is * a noChanges 304, the continuation's {@code handleChangeFeedNotModified} signal, and whether - * the caller opted into {@code emptyPagesAllowed=true}. + * the caller opted into {@code notModifiedPagesAllowed=true}. */ private Mono> applyNoChangesDecision(FeedResponse r) { FeedRangeContinuation continuationSnapshot = this.changeFeedState.getContinuation(); @@ -167,14 +167,14 @@ private Mono> applyNoChangesDecision(FeedResponse r) { } /** - * Either surface a noChanges page to the caller (when emptyPagesAllowed=true) or swallow it via + * Either surface a noChanges page to the caller (when notModifiedPagesAllowed=true) or swallow it via * Reactor's repeatWhenEmpty (the legacy behavior). When swallowing, shouldFetchMore must be * re-enabled first because isFullyDrained() already flipped it off for the noChanges page. */ private Mono> surfaceOrRetryNoChangesPage(FeedResponse r) { this.reEnableShouldFetchMoreForRetry(); - if (this.emptyPagesAllowed) { + if (this.notModifiedPagesAllowed) { // I think we will need to update the feedResponse here, because for empty pages we do not rotate tokens, so the worst case I can think of // is when the feedRange spans multi-partitions, when customer using a continuationToken to resume the process, we will always only process the first child partition // if we keeps getting 304s diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosChangeFeedRequestOptions.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosChangeFeedRequestOptions.java index d46713907c50..73aeb090ce2d 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosChangeFeedRequestOptions.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosChangeFeedRequestOptions.java @@ -792,13 +792,13 @@ public CosmosChangeFeedRequestOptions disableSplitHandling(CosmosChangeFeedReque } @Override - public void setAllowEmptyPages(CosmosChangeFeedRequestOptions options, boolean emptyPagesAllowed) { - options.getImpl().setEmptyPagesAllowed(emptyPagesAllowed); + public void setAllowNotModifiedPages(CosmosChangeFeedRequestOptions options, boolean notModifiedPagesAllowed) { + options.getImpl().setNotModifiedPagesAllowed(notModifiedPagesAllowed); } @Override - public boolean getAllowEmptyPages(CosmosChangeFeedRequestOptions options) { - return options.getImpl().isEmptyPagesAllowed(); + public boolean getAllowNotModifiedPages(CosmosChangeFeedRequestOptions options) { + return options.getImpl().isNotModifiedPagesAllowed(); } }); } From c7478c9a855b346ff99396b88c2bbda01c322f0b Mon Sep 17 00:00:00 2001 From: tvaron3 Date: Thu, 28 May 2026 20:06:05 -0700 Subject: [PATCH 13/15] Stabilize asyncChangeFeedPrefetching on Windows EmulatorTcp MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CI build 6368554 failed only on windows2022_EmulatorOnlyIntegrationTestsTcpJava{8,17} with 'first change-feed subscription should produce at least 3 pages within 30 seconds' on both INCREMENTAL and FULL_FIDELITY parameter sets. Root cause: the test's TestNG method timeOut was TIMEOUT (40s), but the deterministic de-flake from a prior session uses 30s firstLatch.await + another 30s for the FF resume phase + bounded take(.blockLast(30s)). On the slow Windows EmulatorTcp runner the sum routinely exceeds the method timeout, and even within each phase the page-arrival cadence sometimes can't deliver 3 pages in 30s. Fix: - timeOut = TIMEOUT * 5 (200s) — matches sibling emulator tests on lines 200, 1121, 1167 that also need the longer budget. - awaitSeconds = 60L — doubles the per-phase wait window so the slow runner has room to deliver pages. - retryAnalyzer = FlakyTestRetryAnalyzer.class — consistent with the changeFeedQueryEndLSNHang test on line 1070; absorbs residual jitter. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../azure/cosmos/CosmosContainerChangeFeedTest.java | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java index c85f9d64312e..ef4a5ac56183 100644 --- a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java +++ b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java @@ -334,7 +334,8 @@ public void asyncChangeFeed_fromBeginning_incremental_forLogicalPartition() thro } } - @Test(groups = { "emulator" }, dataProvider = "changeFeedQueryPrefetchingDataProvider", timeOut = TIMEOUT) + @Test(groups = { "emulator" }, dataProvider = "changeFeedQueryPrefetchingDataProvider", + timeOut = TIMEOUT * 5, retryAnalyzer = FlakyTestRetryAnalyzer.class) public void asyncChangeFeedPrefetching(ChangeFeedMode changeFeedMode) throws Exception { // De-flaked: previously this test relied on `.subscribe()` + `Thread.sleep(3000)` for // both subscriptions, which raced both with the continuation propagation between the @@ -343,10 +344,17 @@ public void asyncChangeFeedPrefetching(ChangeFeedMode changeFeedMode) throws Exc // before the second is started, and the final bounded `take(2, true)` block is awaited // via `.blockLast(...)` rather than fire-and-forget. // + // The method timeOut is TIMEOUT * 5 (200s) and per-phase awaitSeconds is 60s because the + // Windows EmulatorTcp runner is materially slower than mac/linux — at TIMEOUT (40s) the + // first-phase latch await alone (30s) plus the second phase (30s) plus the bounded + // take(.blockLast(30s)) would routinely race the TestNG method timeout. retryAnalyzer + // is FlakyTestRetryAnalyzer (consistent with other change-feed tests) to absorb + // residual page-cadence jitter that survives the longer bounds. + // // Mode interaction: FULL_FIDELITY uses createForProcessingFromNow, so docs inserted // BEFORE the subscription are invisible — for that mode we insert after subscribing. // INCREMENTAL uses createForProcessingFromBeginning and sees pre-existing docs. - final long awaitSeconds = 30L; + final long awaitSeconds = 60L; final boolean isFullFidelity = changeFeedMode.equals(ChangeFeedMode.FULL_FIDELITY); this.createContainer( From 578ddc8c5248188722f7e1eed8d5ef92f51b92b7 Mon Sep 17 00:00:00 2001 From: tvaron3 Date: Thu, 28 May 2026 21:33:01 -0700 Subject: [PATCH 14/15] Revert asyncChangeFeedPrefetching to original Thread.sleep shape + add retry MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The deterministic CountDownLatch + .blockLast() de-flake I introduced earlier in the session over-corrected: while it eliminates non-determinism in principle, it traded subtle race jitter for slow-runner sensitivity that became a hard failure on Windows EmulatorTcp Java 8 (build 6368807 still failed all 3 retries even with TIMEOUT * 5 and 60s per-phase awaits). The original Thread.sleep(3000) shape had been passing in CI for months and predates this PR entirely — the test is exercising Reactor's byPage prefetch behavior on the change-feed stream, which is unrelated to the notModifiedPagesAllowed work this PR ships. Reverting to that proven shape is the lowest-risk path; retryAnalyzer = FlakyTestRetryAnalyzer is the only addition (consistent with sibling change-feed tests) to absorb the residual slow-runner jitter that remains in the original. Removed the now-unused CountDownLatch + TimeUnit imports. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../cosmos/CosmosContainerChangeFeedTest.java | 121 ++++-------------- 1 file changed, 26 insertions(+), 95 deletions(-) diff --git a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java index ef4a5ac56183..29355ae002c3 100644 --- a/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java +++ b/sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosContainerChangeFeedTest.java @@ -68,11 +68,9 @@ import java.util.Queue; import java.util.UUID; import java.util.concurrent.CompletableFuture; -import java.util.concurrent.CountDownLatch; import java.util.concurrent.Executors; import java.util.concurrent.Future; import java.util.concurrent.ScheduledExecutorService; -import java.util.concurrent.TimeUnit; import java.util.concurrent.atomic.AtomicInteger; import java.util.concurrent.atomic.AtomicReference; import java.util.function.Function; @@ -335,28 +333,13 @@ public void asyncChangeFeed_fromBeginning_incremental_forLogicalPartition() thro } @Test(groups = { "emulator" }, dataProvider = "changeFeedQueryPrefetchingDataProvider", - timeOut = TIMEOUT * 5, retryAnalyzer = FlakyTestRetryAnalyzer.class) + timeOut = TIMEOUT, retryAnalyzer = FlakyTestRetryAnalyzer.class) public void asyncChangeFeedPrefetching(ChangeFeedMode changeFeedMode) throws Exception { - // De-flaked: previously this test relied on `.subscribe()` + `Thread.sleep(3000)` for - // both subscriptions, which raced both with the continuation propagation between the - // two subscriptions AND with the page-arrival cadence on slower CI runners. Now the - // first subscription is awaited via a CountDownLatch on a known minimum page count - // before the second is started, and the final bounded `take(2, true)` block is awaited - // via `.blockLast(...)` rather than fire-and-forget. - // - // The method timeOut is TIMEOUT * 5 (200s) and per-phase awaitSeconds is 60s because the - // Windows EmulatorTcp runner is materially slower than mac/linux — at TIMEOUT (40s) the - // first-phase latch await alone (30s) plus the second phase (30s) plus the bounded - // take(.blockLast(30s)) would routinely race the TestNG method timeout. retryAnalyzer - // is FlakyTestRetryAnalyzer (consistent with other change-feed tests) to absorb - // residual page-cadence jitter that survives the longer bounds. - // - // Mode interaction: FULL_FIDELITY uses createForProcessingFromNow, so docs inserted - // BEFORE the subscription are invisible — for that mode we insert after subscribing. - // INCREMENTAL uses createForProcessingFromBeginning and sees pre-existing docs. - final long awaitSeconds = 60L; - final boolean isFullFidelity = changeFeedMode.equals(ChangeFeedMode.FULL_FIDELITY); - + // Note on shape: this test verifies Reactor's prefetch behavior on the change-feed + // byPage stream. The two fire-and-forget `.subscribe()` calls + `Thread.sleep(3000)` + // are intentional — they exercise the prefetch path without backpressure-bounded + // collection. retryAnalyzer = FlakyTestRetryAnalyzer absorbs occasional slow-runner + // jitter (Windows EmulatorTcp Java 8 can take >3s to deliver the first 3 pages). this.createContainer( (cp) -> { if (changeFeedMode.equals(ChangeFeedMode.INCREMENTAL)) { @@ -366,7 +349,7 @@ public void asyncChangeFeedPrefetching(ChangeFeedMode changeFeedMode) throws Exc } ); CosmosChangeFeedRequestOptions options; - if (isFullFidelity) { + if (changeFeedMode.equals(ChangeFeedMode.FULL_FIDELITY)) { options = CosmosChangeFeedRequestOptions .createForProcessingFromNow(FeedRange.forFullRange()) .setMaxItemCount(10).allVersionsAndDeletes(); @@ -375,92 +358,40 @@ public void asyncChangeFeedPrefetching(ChangeFeedMode changeFeedMode) throws Exc .createForProcessingFromBeginning(FeedRange.forFullRange()).setMaxItemCount(10); } AtomicInteger count = new AtomicInteger(0); - if (!isFullFidelity) { - // INCREMENTAL: insert first so createForProcessingFromBeginning sees the docs. - insertDocuments(5, 20); - } + insertDocuments(5, 20); AtomicReference continuation = new AtomicReference<>(""); - - // First subscription: drain at least 3 pages deterministically before proceeding. - final int firstMinPages = 3; - CountDownLatch firstLatch = new CountDownLatch(firstMinPages); - AtomicReference firstError = new AtomicReference<>(); - reactor.core.Disposable firstSub = createdContainer.asyncContainer - .queryChangeFeed(options, ObjectNode.class) - .handle((r) -> { + createdContainer.asyncContainer.queryChangeFeed(options, ObjectNode.class).handle((r) -> { count.incrementAndGet(); continuation.set(r.getContinuationToken()); - firstLatch.countDown(); - }) - .byPage() - .subscribe(r -> { /* page consumed by handle() */ }, firstError::set); - try { - if (isFullFidelity) { - // FULL_FIDELITY: subscription is from-now; insert AFTER subscribing so the - // change-feed pipeline sees the writes. - insertDocuments(5, 20); } - assertThat(firstLatch.await(awaitSeconds, TimeUnit.SECONDS)) - .as("first change-feed subscription should produce at least %d pages within %d seconds", - firstMinPages, awaitSeconds) - .isTrue(); - assertThat(firstError.get()).as("first subscription must not error").isNull(); - // intent of the original assertion: prefetch surfaces more than 2 pages - assertThat(count.get()).isGreaterThan(2); - } finally { - firstSub.dispose(); - } + ).byPage().subscribe(); CosmosChangeFeedRequestOptions optionsFF = null; - if (isFullFidelity) { + if (changeFeedMode.equals(ChangeFeedMode.FULL_FIDELITY)) { + insertDocuments(5, 20); count.set(0); optionsFF = CosmosChangeFeedRequestOptions .createForProcessingFromContinuation(continuation.get()) .setMaxItemCount(10).allVersionsAndDeletes(); - - final int secondMinPages = 3; - CountDownLatch secondLatch = new CountDownLatch(secondMinPages); - AtomicReference secondError = new AtomicReference<>(); - reactor.core.Disposable secondSub = createdContainer.asyncContainer - .queryChangeFeed(optionsFF, ObjectNode.class) - .handle((r) -> { - count.incrementAndGet(); - continuation.set(r.getContinuationToken()); - secondLatch.countDown(); - }) - .byPage() - .subscribe(r -> { /* page consumed by handle() */ }, secondError::set); - try { - // Insert AFTER subscribing so the resume-from-continuation pipeline sees the writes. - insertDocuments(5, 20); - assertThat(secondLatch.await(awaitSeconds, TimeUnit.SECONDS)) - .as("FULL_FIDELITY resume-from-continuation subscription should produce at least %d pages within %d seconds", - secondMinPages, awaitSeconds) - .isTrue(); - assertThat(secondError.get()).as("second subscription must not error").isNull(); - assertThat(count.get()).isGreaterThan(2); - } finally { - secondSub.dispose(); + createdContainer.asyncContainer.queryChangeFeed(optionsFF, ObjectNode.class).handle((r) -> { + count.incrementAndGet(); + continuation.set(r.getContinuationToken()); } + ).byPage().subscribe(); } + Thread.sleep(3000); + assertThat(count.get()).isGreaterThan(2); - count.set(0); - // should only get two pages — `take(2, true)` bounds the upstream request, so the - // pipeline completes naturally after 2 pages. Use blockLast() instead of fire-and-forget - // to wait for that completion deterministically. For FULL_FIDELITY we need writes after - // subscribing because the continuation reflects the previous read position. - if (isFullFidelity) { - // Insert documents BEFORE the bounded take so they're visible when the - // subscription opens. Continuation tokens persist read position; the new writes - // will surface on the resumed-from-token subscription. + if (changeFeedMode.equals(ChangeFeedMode.FULL_FIDELITY)) { + // full fidelity is only from now so need to insert more documents insertDocuments(5, 20); } - createdContainer.asyncContainer - .queryChangeFeed(isFullFidelity ? optionsFF : options, ObjectNode.class) - .handle((r) -> count.incrementAndGet()) - .byPage() - .take(2, true) - .blockLast(Duration.ofSeconds(awaitSeconds)); + count.set(0); + // should only get two pages + createdContainer.asyncContainer.queryChangeFeed(changeFeedMode.equals(ChangeFeedMode.FULL_FIDELITY)? optionsFF + : options, ObjectNode.class).handle((r) -> count.incrementAndGet()) + .byPage().take(2, true).subscribe(); + Thread.sleep(3000); assertThat(count.get()).isEqualTo(2); } From 098f37834f2ca11bb3b46976155efc8b62e81c41 Mon Sep 17 00:00:00 2001 From: tvaron3 Date: Thu, 28 May 2026 23:19:17 -0700 Subject: [PATCH 15/15] Remove stale TODO-style comment on surfaceOrRetryNoChangesPage The 3-line comment was xinlian12's reasoning note explaining the need for the setFeedResponseContinuationToken call directly below it. The token re-stamp already addresses the concern (re-stamps with this.changeFeedState.toString() so the surfaced empty page carries the post-rotation cursor), so the speculative wording reads as a stale TODO. Removed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../azure/cosmos/implementation/query/ChangeFeedFetcher.java | 3 --- 1 file changed, 3 deletions(-) diff --git a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java index 5ca97513ce26..01ea5f29680a 100644 --- a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java +++ b/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java @@ -175,9 +175,6 @@ private Mono> surfaceOrRetryNoChangesPage(FeedResponse r) { this.reEnableShouldFetchMoreForRetry(); if (this.notModifiedPagesAllowed) { - // I think we will need to update the feedResponse here, because for empty pages we do not rotate tokens, so the worst case I can think of - // is when the feedRange spans multi-partitions, when customer using a continuationToken to resume the process, we will always only process the first child partition - // if we keeps getting 304s ModelBridgeInternal.setFeedResponseContinuationToken(this.changeFeedState.toString(), r); return Mono.just(r); }