-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Leverage x-ms-cosmos-hub-region-processing-only for 404 Read Session Not Available cross-region retry scenarios.
#47631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leverage x-ms-cosmos-hub-region-processing-only for 404 Read Session Not Available cross-region retry scenarios.
#47631
Conversation
…into AzCosmos_AddHubRegionProcessingOnlyHeader
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java
Show resolved
Hide resolved
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one question whether one change is intended - and if so why?
…ion-processing-only` header is set.
… regions for new hub.
x-ms-cosmos-hub-region-processing-only header.x-ms-cosmos-hub-region-processing-only for 404 Read Session Not Available cross-region retry scenarios.
x-ms-cosmos-hub-region-processing-only for 404 Read Session Not Available cross-region retry scenarios.x-ms-cosmos-hub-region-processing-only for 404 Read Session Not Available cross-region retry scenarios.
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RetryPolicy.java
Outdated
Show resolved
Hide resolved
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
xinlian12
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
…into AzCosmos_AddHubRegionProcessingOnlyHeader
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/TestSuiteBase.java
Show resolved
Hide resolved
xinlian12
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
|
/check-enforcer override |
Motivation
The pull request serves as the first iteration which integrates
x-ms-cosmos-hub-region-processing-onlyheader. Setting the value of this header totruewill allow a Cosmos DB backend node to return a 403:3 in case the backend node belongs to a non-hub physical partition.Using this setup, the
CosmosClientinstance can determine partition-set level hub which in the first iteration helps in region detection of 404Read Session Not Availablecross-region detection for Single-Writer accounts. This is needed in particular when failover happens in a rolling-manner partition-set by partition-set and in Per-Partition Automatic Failover cases where hub is a partition-set granular notion. Simply relying onLocationCacheto provide account-level hub region is incorrect.Scope
In this pull request, the focus is on how 404
Read Session Not Availablecross-region retry handling happens for Single-Writer accounts.Critical Changes
The approach taken here is to pin the
x-ms-cosmos-hub-region-processing-onlyonce a request hits a 404Read Session Not Available. This ensures an operation (a construct which encapsulates several I/O calls) is sticky to the hub region.IMPORTANT: An extra retry cycle is executed on the hub region for 404
Read Session Not Availableprior to pinning thex-ms-cosmos-hub-region-processing-onlyheader. This is done to avoid 403Read/Write Forbiddenretry loop for Single-Writer Non-PPAF enabled accounts untilLocationCachereports the updated hub region.Testing
Per-Partition Automatic Failover
The approach was to set a naming configuration (
simulateRevokeLocalWriteStatusOfPartition) consumed by the service fabric process mapped to the original hub region (sayNorth Central US) for a particular physical partition.Post that, a 404
Read Session Not Availableis injected into the same partition for which the write privilege was revoked (North Central US).Using a "pure-read" workload, the goal is to assert whether the read (a
readItemoperation) gets a 200 status code from the partition-set specific hub region.As "reads" can get a 403
Write Forbiddenstatus code, these "reads" can update partition-set level hub which future reads and writes can use.Single-Writer accounts with no PPAF enabled
Pending item: The expected test setup is to execute a write region change on an account with a physical partition-set count in the order of ~2000 (typical in our DR drills) and to subject the account to a "read-only" workload and see how hub-region stickiness holds up.
Key Changes
1.
ClientRetryPolicy.javaChanges to
shouldRetry()method:Changes to
shouldRetryOnSessionNotAvailable()method:Azure/main(Before)jeet1995:AzCosmos_AddHubRegionProcessingOnlyHeader(After)> 1(2 retries max)> 2(3 retries max)setPerPartitionAutomaticFailoverOverrideForReads()sessionTokenRetryCount == 2), setsshouldAddHubRegionProcessingOnlyHeader = trueNew method
setPerPartitionAutomaticFailoverOverrideForReads():resolvedPartitionKeyRangeForPerPartitionAutomaticFailoverfromresolvedPartitionKeyRangeChanges to
onBeforeSendRequest()method:CrossRegionAvailabilityContext.shouldAddHubRegionProcessingOnlyHeader()HUB_REGION_PROCESSING_ONLYHTTP header to"true"when flag is set2. GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover.java
Renamed class:
PartitionLevelFailoverInfo→PartitionLevelAutomaticFailoverInfoMethod signature changes:
Azure/mainjeet1995:AzCosmos_AddHubRegionProcessingOnlyHeadertryMarkEndpointAsUnavailableForPartitionKeyRange(request, isEndToEndTimeoutHit)(request, isEndToEndTimeoutHit)+(request, isEndToEndTimeoutHit, forceFailoverThroughReads)isPerPartitionAutomaticFailoverApplicable(request)(request)+(request, forceFailoverThroughReads)Changes to
isPerPartitionAutomaticFailoverApplicable():Azure/mainfalsefor all read requestsforceFailoverThroughReadsflag; iftrue, allows PPAF for readsshouldUsePerPartitionAutomaticFailoverOverrideForReadsIfApplicable()unlessforceFailoverThroughReadsis set3. CrossRegionAvailabilityContextForRxDocumentServiceRequest.java
New field:
shouldAddHubRegionProcessingOnlyHeader(AtomicBoolean)New methods:
shouldAddHubRegionProcessingOnlyHeader()- gettersetShouldAddHubRegionProcessingOnlyHeader(boolean)- setterFlow Diagrams
READ_SESSION_NOT_AVAILABLE (404/1002) Retry Flow - Before (
Azure/main)flowchart TD A[Read Request Fails with 404/1002] --> B{Endpoint Discovery Enabled?} B -->|No| C[No Retry] B -->|Yes| D{canUseMultipleWriteLocations?} D -->|Yes| E{sessionTokenRetryCount >= endpoints.size?} E -->|Yes| C E -->|No| F[Retry on next preferred location] D -->|No| G{sessionTokenRetryCount > 1?} G -->|Yes| C G -->|No| H{PPAF Enabled?} H -->|Yes| I[Set PPAF override for reads] H -->|No| J[Standard retry] I --> K[Retry with Duration.ZERO] J --> K style C fill:#f66 style K fill:#6f6READ_SESSION_NOT_AVAILABLE (404/1002) Retry Flow - After (
jeet1995:AzCosmos_AddHubRegionProcessingOnlyHeader)flowchart TD A[Read Request Fails with 404/1002] --> B{Endpoint Discovery Enabled?} B -->|No| C[No Retry] B -->|Yes| D{canUseMultipleWriteLocations?} D -->|Yes| E{sessionTokenRetryCount >= endpoints.size?} E -->|Yes| C E -->|No| F[Retry on next preferred location] D -->|No| G{sessionTokenRetryCount > 2?} G -->|Yes| C G -->|No| H[Set retryContext] H --> I{PPAF Enabled?} I -->|Yes| J[Set PPAF override for reads via helper method] I -->|No| K[Continue] J --> K K --> L{sessionTokenRetryCount == 2?} L -->|Yes| M[Set shouldAddHubRegionProcessingOnlyHeader = true] L -->|No| N[Retry with Duration.ZERO] M --> N style C fill:#f66 style N fill:#6f6 style M fill:#ff9Hub Region Header Application in
onBeforeSendRequest()flowchart TD A[onBeforeSendRequest called] --> B[Set request metadata] B --> C[Clear previous routing directive] C --> D{retryContext != null?} D -->|Yes| E[Route based on retry context] D -->|No| F[Continue] E --> F F --> G{CrossRegionAvailabilityContext exists?} G -->|Yes| H{shouldAddHubRegionProcessingOnlyHeader?} G -->|No| I[Resolve endpoint] H -->|Yes| J[Add HUB_REGION_PROCESSING_ONLY header = true] H -->|No| I J --> I I --> K[Route to resolved endpoint] K --> L[Apply PPAF location override if applicable] style J fill:#ff9PPAF Applicability Check for Reads - After
flowchart TD A[isPerPartitionAutomaticFailoverApplicable] --> B{PPAF Enabled?} B -->|No| C[Return false] B -->|Yes| D{Request is read-only?} D -->|Yes| E{forceFailoverThroughReads?} D -->|No| F[Continue validation] E -->|Yes| F E -->|No| G{PPAF override flag set?} G -->|Yes| F G -->|No| C F --> H{Multiple regions available?} H -->|No| C H -->|Yes| I{Valid resource/operation type?} I -->|No| C I -->|Yes| J{Single-master account?} J -->|Yes| K[Return true] J -->|No| C style C fill:#f66 style K fill:#6f6Behavior Summary
HUB_REGION_PROCESSING_ONLYheaderTesting Considerations
HUB_REGION_PROCESSING_ONLYheader is added only on the 3rd retry attempt (sessionTokenRetryCount == 2)All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines