feat(streaming): emit OTel metrics for ttft, tps, token counts#347
Open
feat(streaming): emit OTel metrics for ttft, tps, token counts#347
Conversation
…counts Adds six metrics to TemporalStreamingModel.get_response so applications that configure an OTel MeterProvider can see streaming-call behavior without per-app instrumentation: - agentex.llm.ttft (histogram, ms): time from request start to first content delta. Captured on the first ResponseTextDeltaEvent / ResponseReasoningTextDeltaEvent / ResponseReasoningSummaryTextDeltaEvent. - agentex.llm.tps (histogram, tokens/s): output_tokens / stream_duration. Use 1/tps for time-per-output-token (tpot). - agentex.llm.input_tokens / output_tokens / cached_input_tokens / reasoning_tokens (counters): pulled from the captured ResponsesAPI Usage at end-of-stream. Cache hit rate is computed at query time as rate(cached_input_tokens) / rate(input_tokens). Why - The data was already captured (line 854 captured_usage = response.usage) but never emitted as metrics. Apps could only see total LLM call duration, not the meaningful breakdowns. - Doing this in the SDK rather than each app means every consumer of TemporalStreamingModel gets the metrics for free. - Cardinality is bounded — only `model` is a metric attribute. Resource attributes (service.name, k8s.*, etc.) come from the application's configured OTel resource, so cross-app comparisons work cleanly in Mimir/Prometheus. The meter is a no-op when no MeterProvider is configured, so this is safe for apps that don't run with OTel.
Three changes from PR #347 review: 1. Move stream_start_perf above responses.create() so ttft captures the full request-to-first-token latency (HTTP round-trip + model TTFB), not just post-connect event-loop delay. Drop the duplicate assignment that was being overwritten after the await. 2. Track last_token_at alongside first_token_at and use the generation window (first→last delta) as the tps denominator. Total stream duration was inflated by tool-call argument deltas, reasoning events, and stream_update awaits — making tps under-report the model's actual generation speed for agentic responses. 3. Lazy-create the OTel instruments inside _StreamingMetrics rather than at import time, so a MeterProvider configured *after* this module is imported still binds correctly. Apps that bootstrap OTel in a startup hook (common with lazy-init patterns) would otherwise silently send to the no-op provider.
- Bookmark first/last token timestamps on ResponseFunctionCallArgumentsDeltaEvent too so the tps generation window covers all event types whose tokens land in usage.output_tokens. Previously the numerator counted argument tokens but the denominator excluded their generation time, inflating tps for tool-heavy responses. - Lifted the bookmarking out of the text-delta branch into a single up-front check covering all four token-producing event types — cleaner than duplicating across branches. - Documented the single-token skip case (window collapses to 0) inline at the guard. TPS is undefined for a one-token response so emitting nothing is correct; the comment makes the intent visible to future readers.
…t counter
Two follow-up changes from the PR review:
1. Move the LLM metric instruments from _StreamingMetrics in
temporal_streaming_model.py to a new module:
agentex.lib.core.observability.llm_metrics
Public API: get_llm_metrics() returns a singleton LLMMetrics with
the same six instruments (ttft, tps, input_tokens, output_tokens,
cached_input_tokens, reasoning_tokens) plus a new requests counter.
This makes the temporal+openai_agents plugin one of several future
call sites — the sync ACP path and the Claude SDK plugin can
record to the same instruments without redefining names, units,
or descriptions. Keeps cross-provider naming consistent.
2. Add agentex.llm.requests counter with a status label so 429s,
5xxs, timeouts, and other failures are observable on the SDK
side without scraping logs. classify_status() maps exception
types to a small fixed set (success / rate_limit / server_error
/ client_error / timeout / network_error / other_error) by class
name, so it works across OpenAI, Anthropic, and other provider
SDKs that use similar exception naming.
Recorded in two places: success path (alongside token counters)
and the existing get_response except handler (so terminal
failures emit a counter event before re-raising).
Cardinality remains bounded — model + status (7 values) on the
counter; all other metrics keep just `model`.
joshua0chen
approved these changes
May 7, 2026
ttft fires on the first content delta of any kind, which for reasoning models means the first reasoning chunk — arrives quickly even when the user-perceived latency is much longer. ttat fires only on the first user-visible answer token (text delta or tool-call arguments delta), excluding reasoning chunks. For non-reasoning models the two are equal; for gpt-5-class / o-series models they differ by the reasoning duration. This pairs with ttft for "did the model start thinking quickly?" vs "how long did the user wait for an answer?" — both are valuable signals that mean different things on reasoning workloads. Implementation: a third bookmark variable (``first_answer_at``) set inside the same up-front event-type check, restricted to ResponseTextDeltaEvent / ResponseFunctionCallArgumentsDeltaEvent. Adds one new histogram (``agentex.llm.ttat``) — same labels and units as ttft.
If get_llm_metrics().requests.add() raises (misbehaving exporter, OTel SDK bug, network blip mid-export), the original LLM exception would be shadowed by the metric error. Callers — retry logic, circuit breakers, the OpenAI Agents Temporal plugin's retryable/non-retryable classifier — inspect the typed exception (RateLimitError, APITimeoutError, etc.) and would silently break with an unexpected OTel exception in its place. Wrap the .add() call in a bare try/except so the metric is best-effort and the typed LLM exception always propagates.
Splits metric emission by what each layer can see: - agentex.lib.core.observability.llm_metrics_hooks.LLMMetricsHooks (RunHooks subclass) emits agentex.llm.requests + the four token counters in on_llm_end. Works for any RunHooks-aware path. - TemporalStreamingHooks now inherits from LLMMetricsHooks so the async path picks up the same metrics automatically. - TemporalStreamingModel keeps only the streaming-only metrics (ttft, ttat, tps) — those need per-chunk visibility hooks can't provide. Failure path uses the new record_llm_failure helper. This makes adding the sync ACP path trivial later: pass LLMMetricsHooks() to Runner.run from services/adk/providers/openai.py and it'll emit the same metrics with no double-counting. Tests cover: - classify_status branches (rate_limit / timeout / server_error / network_error / client_error / other_error / success) - get_llm_metrics singleton + instrument presence - LLMMetricsHooks.on_llm_end emits requests + token counters with the right model attribute - Both the hooks path and record_llm_failure swallow exporter exceptions so callers don't break when metrics fail
…roviders) If a provider returns a ModelResponse with a Usage shape the OpenAI Agents SDK didn't fully normalize — missing input_tokens_details, missing usage entirely, None token values — we want to record what we can and skip the rest, never crash the caller. - Move requests.add outside the usage-extraction try block so the success counter still fires when usage access raises (e.g., None). - Add three tests covering: response with raising .usage property, Usage missing input_tokens_details, and Usage with all-None token values.
smoreinis
approved these changes
May 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Metrics
Eight OTel metrics in
agentex.lib.core.observability. Histograms are streaming-only (need per-chunk visibility); counters are provider-agnostic and fire fromRunHooks.on_llm_end.agentex.llm.ttftagentex.llm.ttatagentex.llm.tpsfirst_token_at→last_token_at)agentex.llm.requestsstatus)LLMMetricsHooksagentex.llm.input_tokensLLMMetricsHooksagentex.llm.output_tokensLLMMetricsHooksagentex.llm.cached_input_tokensLLMMetricsHooksagentex.llm.reasoning_tokensLLMMetricsHooksstatusvalues:success/rate_limit/server_error/client_error/timeout/network_error/other_error.Why both layers
Hooks fire on every
Runner.run/Runner.run_streamedcall — async-Temporal and sync ACP both pick them up.TemporalStreamingHooksinheritsLLMMetricsHooksso existing async users get the counters automatically. Sync ACP opts in:Runner.run_streamed(..., hooks=LLMMetricsHooks()).The streaming model owns ttft/ttat/tps because hooks can't see individual chunks.
Cardinality
modelon histograms.model × statuson the request counter. No task-id, no per-call labels.Robustness
on_llm_endreaches into nested optional fields some providers don't populate (litellm/Anthropic). Defensive structure:requestsemits outside the usage-extraction try — success counter fires even if the response shape is unusual.... or 0handlesNone.record_llm_failure(used in error path) has its own try/except — exporter failures never shadow the original LLM exception.Tests cover: missing
.usage, missinginput_tokens_details, all-None token values.Tested
core/observability/tests/. All passing./v5/(litellm-equivalent) viaOpenAIChatCompletionsModel— 3 messages, no crashes.-k 2eval pinned to this commit — 2/2 examples, metrics visible in Mimir.Notes
stream_start_perfis bookmarked beforeresponses.create()so ttft/ttat include network RTT.If upstream OpenAI Agents SDK ever adds metrics hooks on
Model, this is a clean swap-out.Greptile Summary
ttft,ttat,tpshistograms;requests,input_tokens,output_tokens,cached_input_tokens,reasoning_tokenscounters) via a lazy singleton (get_llm_metrics()), wired intoLLMMetricsHooks(newRunHookssubclass) andTemporalStreamingModelfor streaming-only histograms.TemporalStreamingHooksnow inheritsLLMMetricsHooksso existing Temporal users pick up counters automatically.stream_start_perfis correctly placed beforeresponses.create()in the updated code, and instruments are lazily created so aMeterProviderconfigured after import still binds.try/exceptcausing silent cascade drops, thestr(agent.model)fallback can produce memory-address cardinality whenagent.modelis aModelinstance, and the streaming-metrics emit block is unwrapped — an OTel exporter exception there would be swallowed by the outerexcepthandler.Confidence Score: 4/5
Safe to merge with awareness that several P1s from the prior review round are still open; no new blocking issues introduced.
The prior review surfaced multiple P1 findings (single try/except cascade drop, str(agent.model) cardinality, unwrapped OTel emit block). Those conversations remain unresolved in the thread history, holding the score at 4.
src/agentex/lib/core/observability/llm_metrics_hooks.py (token counter cascade + cardinality), src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py (unwrapped OTel emit and record_llm_failure ordering).
Important Files Changed
Sequence Diagram
sequenceDiagram participant SDK as OpenAI Agents SDK participant TSM as TemporalStreamingModel participant OTel as OTel (get_llm_metrics) participant Hooks as LLMMetricsHooks.on_llm_end SDK->>TSM: get_response() TSM->>TSM: "stream_start_perf = perf_counter()" TSM->>TSM: await responses.create() loop Each stream event TSM->>TSM: Record first_token_at / last_token_at / first_answer_at end TSM->>OTel: ttft_ms.record() TSM->>OTel: ttat_ms.record() TSM->>OTel: tps.record() TSM-->>SDK: ModelResponse SDK->>Hooks: on_llm_end(agent, response) Hooks->>OTel: "requests.add(1, status=success)" Hooks->>OTel: input_tokens.add() Hooks->>OTel: output_tokens.add() Hooks->>OTel: cached_input_tokens.add() Hooks->>OTel: reasoning_tokens.add() note over TSM,OTel: Error path TSM-xTSM: Exception in streaming loop TSM->>OTel: record_llm_failure(model, exc) TSM-->>SDK: raise typed LLM exceptionPrompt To Fix All With AI
Reviews (8): Last reviewed commit: "fix: add @override on LLMMetricsHooks.on..." | Re-trigger Greptile