Skip to content

[cosmos] Hedging Detection API — public accessors on response wrappers and exception types #46899

@NaluTripician

Description

@NaluTripician

Summary

Add a public Hedging Detection API to the Cosmos Python SDK so customers can post-hoc determine whether a successful or failed Cosmos point/feed operation went through cross-region hedging, which regions were dispatched against, and which regions responded.

Python has no first-class CosmosDiagnostics object today — only response wrappers (CosmosDict, CosmosList, CosmosItemPaged, CosmosAsyncItemPaged) carrying get_response_headers(), plus opt-in CosmosHttpLoggingPolicy log strings. After cross-SDK gate review, the chosen Python shape is three new accessor methods added directly to each of the five wrapper / exception types, backed by a shared private _HedgingDetectionState instance — matching the get_response_headers() precedent on those wrappers and avoiding a response-wrapper refactor.

This is part of a cross-SDK feature being implemented in parallel across .NET, Java, and Python (with a spec-only deliverable for Rust).

Public API additions

# azure/cosmos/__init__.py — new exports
from azure.cosmos._diagnostics_types import RequestedRegion, RequestedRegionReason
# azure/cosmos/_diagnostics_types.py (NEW)
from dataclasses import dataclass
from enum import Enum

@dataclass(frozen=True, slots=True)
class RequestedRegion:
    region_name: str
    reason: "RequestedRegionReason"

class RequestedRegionReason(Enum):
    INITIAL              = "initial"
    OPERATION_RETRY      = "operation_retry"
    TRANSPORT_RETRY      = "transport_retry"          # reserved — not populated today
    HEDGING              = "hedging"
    REGION_FAILOVER      = "region_failover"
    CIRCUIT_BREAKER_PROBE = "circuit_breaker_probe"   # reserved — not populated in v1
    UNKNOWN              = "unknown"

    @classmethod
    def _missing_(cls, value):                        # forward-compat for reasons added in future versions
        return cls.UNKNOWN

Three new methods added directly to each of the five wrapper / exception types (CosmosDict, CosmosList, CosmosItemPaged, CosmosAsyncItemPaged, CosmosHttpResponseError, CosmosBatchOperationError, CosmosClientTimeoutError):

def is_hedging_started(self) -> bool: ...
def get_requested_regions(self) -> tuple[RequestedRegion, ...]: ...
def get_responded_regions(self) -> tuple[str, ...]: ...

Backing state lives on a private _HedgingDetectionState class (in a new azure/cosmos/_diagnostics.py module) and is shared via a private ._hedging_state attribute on each type. The state object holds a threading.Lock, a list[RequestedRegion], a list[str], and a bool. Methods on each type forward to the shared state.

Critical: closure-argument passing (NOT request_params)

Diagnostics flows through execute_with_hedging as a separate closure argument, not on request_params. Rationale: copy.deepcopy(request_params) at _availability_strategy_handler.py:96 would otherwise silently swallow child appends (SE-002 — explicit deepcopy hazard regression test required).

Critical: sync↔async parity

Every code path added in azure/cosmos/ has a matching change in azure/cosmos/aio/. CI script enforces parity (every sync test has an async twin file with same name + _async suffix). This is the #1 historical Python Cosmos bug pattern (SE-004).

HEDGING append fires inside the hedge-arm coroutine body, after the threshold await:

# Correct: append on actual dispatch, not at task-creation time
async def hedge_arm(region, threshold, diagnostics):
    await asyncio.sleep(threshold)              # primary may complete first; coroutine never resumes here
    diagnostics._record_request(region, RequestedRegionReason.HEDGING)   # only fires post-delay, post-non-cancellation
    return await issue_request(region)

Acceptance criteria (testable)

  • AC1 Single-region client, read_item success → response.is_hedging_started() == False; response.get_requested_regions() == (RequestedRegion("centralus", INITIAL),); response.get_responded_regions() == ("centralus",).
  • AC2 Multi-region client, hedging enabled, primary responds under threshold → is_hedging_started() == False; get_requested_regions() has exactly one INITIAL entry; no phantom HEDGING for the cancelled hedge task.
  • AC3 Multi-region client, hedging enabled, primary slow, hedge arm wins → is_hedging_started() == True; get_requested_regions() has ≥2 entries including (hedge_region, HEDGING); get_responded_regions() has ≥1 entry.
  • AC4 410 Gone retry on same region → get_requested_regions() includes consecutive entries (region, INITIAL) then (region, OPERATION_RETRY).
  • AC5 Region failover via _TimeoutFailoverRetryPolicy / _endpoint_discovery_retry_policyget_requested_regions() includes (originalRegion, INITIAL) then (secondaryRegion, REGION_FAILOVER).
  • AC6 Unknown reason from a future-version SDK → RequestedRegionReason("future_value") returns RequestedRegionReason.UNKNOWN via _missing_.
  • AC7 All-regions-down error → CosmosHttpResponseError.get_requested_regions() non-empty; get_responded_regions() may be empty.
  • AC8 Deepcopy regression — explicit test that walks the dispatch path with copy.deepcopy(request_params) in place and confirms appends still reach the final state (SE-002).
  • AC9 Sync↔async parity — every test file tests/test_hedging_detection.py has a twin tests/test_hedging_detection_async.py exercising the same scenario via azure.cosmos.aio.
  • AC10 Existing CosmosHttpLoggingPolicy log format and client.last_response_headers behavior unchanged.
  • AC11 Type-stub test — mypy --strict azure.cosmos (or equivalent existing infrastructure) passes.
  • AC12 APIView snapshot regenerated; reviewers consulted.
  • AC13 Live multi-region smoke test (≥1) — runs against the team's multi-region test account with hedging enabled, injects primary-slow latency, asserts on a wrapper (e.g., CosmosDict) and an exception (e.g., CosmosHttpResponseError) call site: is_hedging_started() == True, get_requested_regions() includes both regions, get_responded_regions() includes the secondary region. Both sync + async twin.

Files in scope

  • New: sdk/cosmos/azure-cosmos/azure/cosmos/_diagnostics_types.py (dataclass + enum), sdk/cosmos/azure-cosmos/azure/cosmos/_diagnostics.py (private _HedgingDetectionState + helpers)
  • Modify: sdk/cosmos/azure-cosmos/azure/cosmos/__init__.py (exports), the five wrapper / exception types (CosmosDict, CosmosList, CosmosItemPaged, CosmosAsyncItemPaged, CosmosHttpResponseError, CosmosBatchOperationError, CosmosClientTimeoutError), _availability_strategy_handler.py:116, aio/_asynchronous_availability_strategy_handler.py:126, _retry_utility.Execute:59, aio/_retry_utility_async.ExecuteAsync:63, __init__.pyi type stubs, sdk/cosmos/azure-cosmos/CHANGELOG.md
  • Tests: new tests/test_hedging_detection.py + tests/test_hedging_detection_async.py + tests/test_diagnostics_types.py, live-account multi-region test
  • Samples: sdk/cosmos/azure-cosmos/samples/ — usage examples for if response.is_hedging_started(): ...

Out of scope

  • Restoring the deprecated azure/cosmos/diagnostics.py (_RecordDiagnostics) — stays deprecated.
  • Changing CosmosHttpLoggingPolicy log format.
  • Adding a new public CosmosDiagnostics class (explicitly rejected at the cross-SDK gate in favor of three methods per type).
  • Wiring into OpenTelemetry — separate work item.

Cross-SDK companion issues

Notes for the implementer

  • Full internal spec, landscape research, plan, risk register (side-effects.json), and questions+answers are available from the workflow author (@NaluTripician) on request — they are team-only and not linked here.
  • Phase 1 review gate completed on 2026-05-14. The chosen Python shape is "three methods per type backed by shared private _HedgingDetectionState" — no new public CosmosDiagnostics class. Earlier closed PR Cosmos Diagnostics #25678 is historical context only; it proposed an Option A shape that the gate rejected.
  • This issue is being dispatched to the Coding Agent Harness for end-to-end implementation; reviewers may receive a draft PR shortly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions