Skip to content

.NET: [.Net][ADR] OTel BCL-native emission for ChatClientAgent (ADR 0027) + implementation#5971

Open
rogerbarreto wants to merge 3 commits into
microsoft:mainfrom
rogerbarreto:adr/agent-otel-auto-instrumentation
Open

.NET: [.Net][ADR] OTel BCL-native emission for ChatClientAgent (ADR 0027) + implementation#5971
rogerbarreto wants to merge 3 commits into
microsoft:mainfrom
rogerbarreto:adr/agent-otel-auto-instrumentation

Conversation

@rogerbarreto
Copy link
Copy Markdown
Member

@rogerbarreto rogerbarreto commented May 20, 2026

What

This PR now contains two ADRs and the implementation for the second one:

  • ADR 0026 (docs/decisions/0026-agent-otel-auto-instrumentation.md) — original proposal for a convention-aligned AddAgentFrameworkInstrumentation() extension on TracerProviderBuilder / MeterProviderBuilder. Kept as historical context for the discussion that led to ADR 0027.
  • ADR 0027 (docs/decisions/0027-agent-otel-bcl-native-emission.md) — the proposal we landed on, plus its implementation: bare ChatClientAgent emits OpenTelemetry natively when a tracer subscribes to Experimental.Microsoft.Agents.AI, with no other configuration change.

Why ADR 0027

ADR 0027 matches the BCL pattern (HttpClient, System.Net.Http): subscribing to a known source name via AddSource(...) is enough; the library emits natively. No DI container required, no extra extension method, no opt-in beyond the existing source name.

The bare path and the explicit new OpenTelemetryAgent(agent) decorator path produce identical 2-span output (invoke_agent + chat). Today's dashboards keep working unchanged.

How it works

ChatClientAgent lazily holds an OpenTelemetryAgent(this, defaultSource) in a _selfTelemetryWrap field. RunCoreAsync (and the streaming counterpart) checks SuppressSelfTelemetryWrap() and either:

  1. Calls RunChatClientCoreAsync directly (suppression path), or
  2. Delegates to the self-wrap, which re-enters RunCoreAsync exactly once with a per-instance marker stamped on the outer invoke_agent activity.

Suppression triggers (any one is enough):

  • ChatClientAgentOptions.UseProvidedChatClientAsIs == true (existing per-agent opt-out).
  • OpenTelemetryConsts.AgentActivitySource.HasListeners() == false (no tracer subscribed to the default source). This fast path is required: without it, when no listeners exist, no Activity is created, the marker cannot be stamped, and the re-entrant call recurses indefinitely.
  • The Activity parent chain contains the per-instance marker (OwnedInvokeAgentScopeMarker custom property) whose value is reference-equal to this. This handles both the bare path's own re-entry and the explicit decorator path (where the user wraps the agent themselves).

The marker is set in exactly one place: OpenTelemetryAgent.UpdateCurrentActivity resolves this.InnerAgent.GetService<ChatClientAgent>() and stamps that reference as a custom property. Custom properties are process-local; they are not exported as OTLP span attributes.

Per-instance scoping via ReferenceEquals ensures nested sub-agent calls via tools emit their own invoke_agent spans (different instance, different marker, no suppression).

What changed

File Change
dotnet/src/Microsoft.Agents.AI/ChatClient/ChatClientAgent.cs _selfTelemetryWrap field, SuppressSelfTelemetryWrap, EnsureSelfTelemetryWrap, RunCoreAsync split into RunChatClientCoreAsync (streaming counterpart same shape).
dotnet/src/Microsoft.Agents.AI/OpenTelemetryAgent.cs UpdateCurrentActivity stamps the per-instance marker on the outer invoke_agent activity.
dotnet/src/Microsoft.Agents.AI/OpenTelemetryConsts.cs Added OwnedInvokeAgentScopeMarker custom-property key and AgentActivitySource static field for the HasListeners fast path.
dotnet/tests/Microsoft.Agents.AI.UnitTests/ChatClient/ChatClientAgentOpenTelemetryTests.cs (new, 16 tests) Bare-path emission, sub-agent isolation, concurrent first-calls, sensitive-data default, provider-style wrapper, decorator-path no triple emission.
dotnet/tests/Microsoft.Agents.AI.UnitTests/ChatClient/OwnerScopedActivityCapture.cs (new) Test helper. Raw ActivityListener plus parent-chain marker filter. Makes tests fully parallel-safe against the global ActivitySource listener registry without serialization attributes. Lives in the test project only; production users keep using TracerProvider.AddSource(...) as today.
dotnet/tests/Microsoft.Agents.AI.UnitTests/OpenTelemetryAgentTests.cs One pre-existing test migrated to use OwnerScopedActivityCapture so parallel execution stays clean.
docs/decisions/0027-agent-otel-bcl-native-emission.md (new) The ADR itself.

Validation

Suite Result
Full Microsoft.Agents.AI.UnitTests in parallel (no [Collection]) 1564 pass, 0 fail, 0 skipped
ChatClientAgentOpenTelemetryTests 16 pass
OpenTelemetryAgentTests (existing + migrated) 44 pass
CI-parity dotnet format --verify-no-changes (WSL2 + Docker, mcr.microsoft.com/dotnet/sdk:10.0) clean on both changed projects

Known trade-offs (documented in the ADR)

  • Metrics-only subscribers (no tracer) bypass the self-wrap. Required by the HasListeners fast path to prevent infinite recursion. Users in that mode must explicitly wrap with new OpenTelemetryAgent(agent).
  • Per-instance cost. One OpenTelemetryAgent allocated lazily per ChatClientAgent, held until the agent is collected. Per-instance, not per-call.
  • Custom source with no listeners edge case. If a user wraps with OpenTelemetryAgent(agent, "MyCustomSource") and only the default source has listeners, the inner self-wrap will emit on the default source. Mitigation documented in the ADR.

Out of scope (follow-up)

  • invoke_agent span enrichment (system instructions, conversation id, aggregated multi-turn token usage, full chat options on the invoke_agent span). Tracked as placeholder ADR 0028.
  • Agents without an inner ChatClientAgent (A2AAgent, DurableAIAgent, WorkflowHostAgent).

Refs

Copilot AI review requested due to automatic review settings May 20, 2026 10:24
@moonbox3 moonbox3 added the documentation Improvements or additions to documentation label May 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds ADR 0026 proposing convention-aligned OpenTelemetry “auto-instrumentation” activation for the .NET Agent Framework (provider-builder AddXxxInstrumentation() entrypoints + DI-aware AsAIAgent() wrapping + env-var kill switch). This is documentation-only and is intended to set direction before implementation work begins.

Changes:

  • Introduces a new ADR describing provider-builder activation APIs for agent + workflow telemetry.
  • Proposes DI-driven auto-wrapping behavior for AsAIAgent() factories and an env-var kill switch.
  • Outlines validation/testing expectations and open packaging questions (keep OTel primitives in-core vs move to dedicated packages).
Comments suppressed due to low confidence (1)

docs/decisions/0026-agent-otel-auto-instrumentation.md:139

  • In Consequences, “Adds optional IServiceProvider? services = null to every AsAIAgent() overload” reads as if this parameter is universally new. In practice, several providers already have a services parameter today, while a few (e.g., A2A / Copilot) do not. Suggest clarifying the impact as “adds services to the remaining AsAIAgent overloads that lack it” and optionally call out representative examples, so the scope/cost is accurately represented.
- **Bad:** Adds optional `IServiceProvider? services = null` to every
  `AsAIAgent()` overload across ~10–15 provider packages.
  Source-compatible + binary-compatible. Coordinated release needed.

Comment thread docs/decisions/0026-agent-otel-auto-instrumentation.md
@rogerbarreto rogerbarreto self-assigned this May 20, 2026
rogerbarreto and others added 3 commits May 26, 2026 13:23
Big OTel rock. Tribe pattern say AddXxxInstrumentation() on provider, kill switch env var. Agent Framework miss.

ADR 0026 land options + decision. Recommended: embed extensions in core assemblies. AddAgentFrameworkInstrumentation() on TracerProviderBuilder + MeterProviderBuilder. Dep small: OpenTelemetry.Api.ProviderBuilderExtensions only, 2 net-new transitive packages.

AsAIAgent() factories auto-wrap via IServiceProvider param when AgentFrameworkInstrumentationOptions registered. Workflow.AsAIAgent() join same. Workflow internal spans still opt-in via WorkflowBuilder.WithOpenTelemetry().

Kill switch: OTEL_DOTNET_AGENTFRAMEWORK_INSTRUMENTATION_ENABLED. Multi-call: last-wins.

ADR-0003 not superseded. Source-naming questions deferred to follow-on ADR.

Refs microsoft#5852
ADR 0027 propose BCL-native emission for bare ChatClientAgent. AddSource("Experimental.Microsoft.Agents.AI") alone enough — match HttpClient pattern. Symmetric 2-span (invoke_agent + chat) for bare AND explicit OpenTelemetryAgent decorator path.

Design: ChatClientAgent lazily hold OpenTelemetryAgent(this, defaultSource). RunCoreAsync delegate to self-wrap unless suppressed. Suppression: UseProvidedChatClientAsIs, no listeners on default source, or per-instance marker found on parent chain. OpenTelemetryAgent.UpdateCurrentActivity stamp marker = inner ChatClientAgent ref. Re-entry find marker, do chat work. Per-instance ReferenceEquals avoid over-suppress sub-agent calls.

HasListeners fast path mandatory — without it metrics-only subscribers cause infinite recursion (no Activity → no marker → re-enter). Documented trade-off: metrics-only users must explicit-wrap with OpenTelemetryAgent.

Tests: ChatClientAgentOpenTelemetryTests (16 new, RED→GREEN). OwnerScopedActivityCapture helper filter activities by per-instance marker via parent-chain walk. Test-only — production users unaffected. Make tests parallel-safe vs global ActivitySource listener registry contamination.

Files:
- docs/decisions/0027-agent-otel-bcl-native-emission.md (new)
- ChatClientAgent.cs: _selfTelemetryWrap, SuppressSelfTelemetryWrap, EnsureSelfTelemetryWrap, RunChatClientCoreAsync split
- OpenTelemetryAgent.cs: UpdateCurrentActivity stamp marker
- OpenTelemetryConsts.cs: OwnedInvokeAgentScopeMarker, AgentActivitySource
- OwnerScopedActivityCapture.cs (new, test helper)
- ChatClientAgentOpenTelemetryTests.cs (new)

All 1564 unit tests pass parallel. CI-parity dotnet format clean.
@rogerbarreto rogerbarreto force-pushed the adr/agent-otel-auto-instrumentation branch from fd73566 to 125da26 Compare May 27, 2026 14:59
@rogerbarreto rogerbarreto changed the title [.Net][ADR] OTel auto-instrumentation proposal [.Net][ADR] OTel BCL-native emission for ChatClientAgent (ADR 0027) + implementation May 27, 2026
@moonbox3 moonbox3 added the .NET label May 27, 2026
@github-actions github-actions Bot changed the title [.Net][ADR] OTel BCL-native emission for ChatClientAgent (ADR 0027) + implementation .NET: [.Net][ADR] OTel BCL-native emission for ChatClientAgent (ADR 0027) + implementation May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation .NET

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants