Skip to content

[Cosmos] Port hub region caching per partition level (#48788)#48789

Draft
jeet1995 wants to merge 1 commit intoAzure:mainfrom
jeet1995:squad/48788-hub-region-caching
Draft

[Cosmos] Port hub region caching per partition level (#48788)#48789
jeet1995 wants to merge 1 commit intoAzure:mainfrom
jeet1995:squad/48788-hub-region-caching

Conversation

@jeet1995
Copy link
Copy Markdown
Member

Issue

Fixes #48788

Description

Ports the per-partition hub region caching feature from .NET SDK (PR Azure/azure-cosmos-dotnet-v3#5648) to the Java SDK.

Problem

On single-master accounts, repeated 404/1002 (ReadSessionNotAvailable) errors cause unnecessary retries without discovering the hub region. No caching exists, so every request repeats the full discovery chain.

Solution

  1. After 2 consecutive 404/1002 on a single-master account, set x-ms-cosmos-hub-region-processing-only header
  2. Non-hub regions return 403/3 (WriteForbidden) — SDK retries to next region (discovery chain)
  3. Hub region responds with 200 OK — SDK caches hub URI for that partition
  4. Future requests route directly to cached hub (warm path skips discovery)
  5. Works for both PPAF and non-PPAF accounts

Feature Flag

Gated behind COSMOS.HUB_REGION_PROCESSING_ENABLED (env var: COSMOS_HUB_REGION_PROCESSING_ENABLED). Disabled by default per Debdatta Kunda's guidance.

Key Changes

  • New GlobalPartitionEndpointManagerForHubRegionRouting — per-partition hub region cache
  • ClientRetryPolicy — hub header trigger after 2x 404/1002, 403/3 handling on read path, cache lookup
  • Configs — COSMOS.HUB_REGION_PROCESSING_ENABLED feature flag
  • HttpConstants — x-ms-cosmos-hub-region-processing-only header constant

Testing

  • 13 new unit tests covering cold cache discovery, warm cache routing, feature flag gating, non-single-master bypass
  • Build compiles clean

Port hub region caching from .NET SDK (PR Azure#5648) to Java SDK.

Feature summary:
- After 2 consecutive 404/1002 (ReadSessionNotAvailable) on single-master
  accounts, SDK sets x-ms-cosmos-hub-region-processing-only header
- Non-hub regions return 403/3 (WriteForbidden); SDK retries to next region
- Hub region responds with 200 OK; SDK caches hub URI for that partition
- Future requests route directly to cached hub (warm path)
- Works for both PPAF and non-PPAF accounts

Implementation details:
- Feature flag: COSMOS.HUB_REGION_PROCESSING_ENABLED (default: false)
- New class: GlobalPartitionEndpointManagerForHubRegionRouting
  - Per-partition ConcurrentHashMap cache for hub region URIs
  - Warm/cold path routing, cache invalidation, thread-safe
- ClientRetryPolicy: 403/3 handling on read path for hub discovery
- ClientRetryPolicy: Hub header gated behind feature flag
- ClientRetryPolicy.onBeforeSendRequest: Warm path cache check
- RxDocumentClientImpl: Cache hub on successful response
- 13 unit tests covering all cache operations and eligibility

Files changed:
- Configs.java: Add feature flag constants and getter
- ClientRetryPolicy.java: Hub header gating, 403/3 read path, warm path
- RetryPolicy.java: Wire hub region manager
- RxDocumentClientImpl.java: Instantiate manager, cache on success
- GlobalPartitionEndpointManagerForHubRegionRouting.java (new)
- 5 test files updated for new constructor parameter

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE REQ] Port hub region caching per partition level from .NET SDK

1 participant