.NET: [.Net][ADR] OTel BCL-native emission for ChatClientAgent (ADR 0027) + implementation#5971
Open
rogerbarreto wants to merge 3 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds ADR 0026 proposing convention-aligned OpenTelemetry “auto-instrumentation” activation for the .NET Agent Framework (provider-builder AddXxxInstrumentation() entrypoints + DI-aware AsAIAgent() wrapping + env-var kill switch). This is documentation-only and is intended to set direction before implementation work begins.
Changes:
- Introduces a new ADR describing provider-builder activation APIs for agent + workflow telemetry.
- Proposes DI-driven auto-wrapping behavior for
AsAIAgent()factories and an env-var kill switch. - Outlines validation/testing expectations and open packaging questions (keep OTel primitives in-core vs move to dedicated packages).
Comments suppressed due to low confidence (1)
docs/decisions/0026-agent-otel-auto-instrumentation.md:139
- In Consequences, “Adds optional
IServiceProvider? services = nullto everyAsAIAgent()overload” reads as if this parameter is universally new. In practice, several providers already have aservicesparameter today, while a few (e.g., A2A / Copilot) do not. Suggest clarifying the impact as “addsservicesto the remainingAsAIAgentoverloads that lack it” and optionally call out representative examples, so the scope/cost is accurately represented.
- **Bad:** Adds optional `IServiceProvider? services = null` to every
`AsAIAgent()` overload across ~10–15 provider packages.
Source-compatible + binary-compatible. Coordinated release needed.
Big OTel rock. Tribe pattern say AddXxxInstrumentation() on provider, kill switch env var. Agent Framework miss. ADR 0026 land options + decision. Recommended: embed extensions in core assemblies. AddAgentFrameworkInstrumentation() on TracerProviderBuilder + MeterProviderBuilder. Dep small: OpenTelemetry.Api.ProviderBuilderExtensions only, 2 net-new transitive packages. AsAIAgent() factories auto-wrap via IServiceProvider param when AgentFrameworkInstrumentationOptions registered. Workflow.AsAIAgent() join same. Workflow internal spans still opt-in via WorkflowBuilder.WithOpenTelemetry(). Kill switch: OTEL_DOTNET_AGENTFRAMEWORK_INSTRUMENTATION_ENABLED. Multi-call: last-wins. ADR-0003 not superseded. Source-naming questions deferred to follow-on ADR. Refs microsoft#5852
ADR 0027 propose BCL-native emission for bare ChatClientAgent. AddSource("Experimental.Microsoft.Agents.AI") alone enough — match HttpClient pattern. Symmetric 2-span (invoke_agent + chat) for bare AND explicit OpenTelemetryAgent decorator path.
Design: ChatClientAgent lazily hold OpenTelemetryAgent(this, defaultSource). RunCoreAsync delegate to self-wrap unless suppressed. Suppression: UseProvidedChatClientAsIs, no listeners on default source, or per-instance marker found on parent chain. OpenTelemetryAgent.UpdateCurrentActivity stamp marker = inner ChatClientAgent ref. Re-entry find marker, do chat work. Per-instance ReferenceEquals avoid over-suppress sub-agent calls.
HasListeners fast path mandatory — without it metrics-only subscribers cause infinite recursion (no Activity → no marker → re-enter). Documented trade-off: metrics-only users must explicit-wrap with OpenTelemetryAgent.
Tests: ChatClientAgentOpenTelemetryTests (16 new, RED→GREEN). OwnerScopedActivityCapture helper filter activities by per-instance marker via parent-chain walk. Test-only — production users unaffected. Make tests parallel-safe vs global ActivitySource listener registry contamination.
Files:
- docs/decisions/0027-agent-otel-bcl-native-emission.md (new)
- ChatClientAgent.cs: _selfTelemetryWrap, SuppressSelfTelemetryWrap, EnsureSelfTelemetryWrap, RunChatClientCoreAsync split
- OpenTelemetryAgent.cs: UpdateCurrentActivity stamp marker
- OpenTelemetryConsts.cs: OwnedInvokeAgentScopeMarker, AgentActivitySource
- OwnerScopedActivityCapture.cs (new, test helper)
- ChatClientAgentOpenTelemetryTests.cs (new)
All 1564 unit tests pass parallel. CI-parity dotnet format clean.
fd73566 to
125da26
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
This PR now contains two ADRs and the implementation for the second one:
docs/decisions/0026-agent-otel-auto-instrumentation.md) — original proposal for a convention-alignedAddAgentFrameworkInstrumentation()extension onTracerProviderBuilder/MeterProviderBuilder. Kept as historical context for the discussion that led to ADR 0027.docs/decisions/0027-agent-otel-bcl-native-emission.md) — the proposal we landed on, plus its implementation: bareChatClientAgentemits OpenTelemetry natively when a tracer subscribes toExperimental.Microsoft.Agents.AI, with no other configuration change.Why ADR 0027
ADR 0027 matches the BCL pattern (HttpClient,
System.Net.Http): subscribing to a known source name viaAddSource(...)is enough; the library emits natively. No DI container required, no extra extension method, no opt-in beyond the existing source name.The bare path and the explicit
new OpenTelemetryAgent(agent)decorator path produce identical 2-span output (invoke_agent+chat). Today's dashboards keep working unchanged.How it works
ChatClientAgentlazily holds anOpenTelemetryAgent(this, defaultSource)in a_selfTelemetryWrapfield.RunCoreAsync(and the streaming counterpart) checksSuppressSelfTelemetryWrap()and either:RunChatClientCoreAsyncdirectly (suppression path), orRunCoreAsyncexactly once with a per-instance marker stamped on the outerinvoke_agentactivity.Suppression triggers (any one is enough):
ChatClientAgentOptions.UseProvidedChatClientAsIs == true(existing per-agent opt-out).OpenTelemetryConsts.AgentActivitySource.HasListeners() == false(no tracer subscribed to the default source). This fast path is required: without it, when no listeners exist, no Activity is created, the marker cannot be stamped, and the re-entrant call recurses indefinitely.OwnedInvokeAgentScopeMarkercustom property) whose value is reference-equal tothis. This handles both the bare path's own re-entry and the explicit decorator path (where the user wraps the agent themselves).The marker is set in exactly one place:
OpenTelemetryAgent.UpdateCurrentActivityresolvesthis.InnerAgent.GetService<ChatClientAgent>()and stamps that reference as a custom property. Custom properties are process-local; they are not exported as OTLP span attributes.Per-instance scoping via
ReferenceEqualsensures nested sub-agent calls via tools emit their owninvoke_agentspans (different instance, different marker, no suppression).What changed
dotnet/src/Microsoft.Agents.AI/ChatClient/ChatClientAgent.cs_selfTelemetryWrapfield,SuppressSelfTelemetryWrap,EnsureSelfTelemetryWrap,RunCoreAsyncsplit intoRunChatClientCoreAsync(streaming counterpart same shape).dotnet/src/Microsoft.Agents.AI/OpenTelemetryAgent.csUpdateCurrentActivitystamps the per-instance marker on the outerinvoke_agentactivity.dotnet/src/Microsoft.Agents.AI/OpenTelemetryConsts.csOwnedInvokeAgentScopeMarkercustom-property key andAgentActivitySourcestatic field for theHasListenersfast path.dotnet/tests/Microsoft.Agents.AI.UnitTests/ChatClient/ChatClientAgentOpenTelemetryTests.cs(new, 16 tests)dotnet/tests/Microsoft.Agents.AI.UnitTests/ChatClient/OwnerScopedActivityCapture.cs(new)ActivityListenerplus parent-chain marker filter. Makes tests fully parallel-safe against the globalActivitySourcelistener registry without serialization attributes. Lives in the test project only; production users keep usingTracerProvider.AddSource(...)as today.dotnet/tests/Microsoft.Agents.AI.UnitTests/OpenTelemetryAgentTests.csOwnerScopedActivityCaptureso parallel execution stays clean.docs/decisions/0027-agent-otel-bcl-native-emission.md(new)Validation
Microsoft.Agents.AI.UnitTestsin parallel (no[Collection])ChatClientAgentOpenTelemetryTestsOpenTelemetryAgentTests(existing + migrated)dotnet format --verify-no-changes(WSL2 + Docker,mcr.microsoft.com/dotnet/sdk:10.0)Known trade-offs (documented in the ADR)
HasListenersfast path to prevent infinite recursion. Users in that mode must explicitly wrap withnew OpenTelemetryAgent(agent).OpenTelemetryAgentallocated lazily perChatClientAgent, held until the agent is collected. Per-instance, not per-call.OpenTelemetryAgent(agent, "MyCustomSource")and only the default source has listeners, the inner self-wrap will emit on the default source. Mitigation documented in the ADR.Out of scope (follow-up)
invoke_agentspan enrichment (system instructions, conversation id, aggregated multi-turn token usage, full chat options on theinvoke_agentspan). Tracked as placeholder ADR 0028.ChatClientAgent(A2AAgent,DurableAIAgent,WorkflowHostAgent).Refs