feat(telemetry): latency histograms for LLM request duration and TTFB (#463) by ajbozarth · Pull Request #782 · generative-computing/mellea

ajbozarth · 2026-04-02T21:34:00Z

Misc PR

Type of PR

New Feature

Description

Link to Issue: Fixes Implement latency histograms to track request duration distribution and time-to-first-token (TTFB) for streaming requests #463

Adds latency histograms for LLM request duration and time-to-first-token (TTFB)
as part of the metrics telemetry epic (#443).

LatencyMetricsPlugin hooks generation_post_call (FIRE_AND_FORGET) and records
mellea.llm.request.duration (every request) and mellea.llm.ttfb (streaming only)
Custom OTel View + ExplicitBucketHistogramAggregation bucket boundaries sized for
LLM latencies; both plugins auto-register alongside TokenMetricsPlugin
ModelOutputThunk gains streaming: bool and ttfb_ms: float | None; TTFB is
captured on first chunk in astream(), gated on self.streaming
Updated AGENTS.md, metrics.md, telemetry.md, and metrics_example.py

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code was added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

…generative-computing#463) Adds request duration and time-to-first-token (TTFB) latency histograms via the plugin pattern established in generative-computing#653. Includes custom OTel bucket views sized for LLM latencies, backend telemetry field assertions across all backends, and updated dev/published docs. Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

github-actions · 2026-04-02T21:34:15Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

ajbozarth · 2026-04-02T21:45:07Z

I'll be OOTO tomorrow (Fri April 3) through Monday (April 6). So I will address any review feedback on Tuesday. If this gets two committer approvals with no actionable feedback while I'm out the second approver can feel free to add it to the merge queue.

This has no rush on it so it should be fine to wait till I'm back for addressing feedback

jakelorocco

Question: for the "bucket" based metrics are the buckets inclusive of each other? Should the 10 second bucket include all results from the 5 second bucket?

ajbozarth · 2026-04-07T22:25:22Z

@jakelorocco and @planetf1 I believe I've addressed all your comments with either a fix in the recent commit, or a TODO item waiting for feedback.

@stream

- Fix TTFB measurement: capture timestamp on first chunk in both the non-blocking drain and blocking min-chunk loops, not after both loops complete (previously measured time-to-Nth-chunk, not time-to-first) - Add inline comment explaining @@@stream@@@ hard-coded string avoids circular import between core and backends - Remove invalid enable_metrics() calls from plugin docstring examples; metrics are enabled via MELLEA_METRICS_ENABLED env var only - Move _METRICS_PLUGIN_CLASSES to metrics_plugins.py alongside the classes it describes; import it in metrics.py registration block - Change LatencyMetricsPlugin priority from 50 to 51 so plugins have distinct execution order Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

ajbozarth · 2026-04-08T15:41:02Z

@jakelorocco sorry I missed this:

Question: for the "bucket" based metrics are the buckets inclusive of each other? Should the 10 second bucket include all results from the 5 second bucket?

here's a short explainer from Claude:

The OTel SDK stores non-cumulative counts internally — each bucket only holds measurements within its own range. However when exported to Prometheus the buckets are cumulative (le=10 includes everything ≤ 10s, so yes it includes the ≤ 5s results). Other exporters like OTLP may present them non-cumulatively depending on the visualization tool.

ajbozarth · 2026-04-08T16:19:36Z

I believe I have addressed all current review if everyone could take a second look

jakelorocco · 2026-04-08T17:12:03Z

@jakelorocco sorry I missed this:

Question: for the "bucket" based metrics are the buckets inclusive of each other? Should the 10 second bucket include all results from the 5 second bucket?

here's a short explainer from Claude:

The OTel SDK stores non-cumulative counts internally — each bucket only holds measurements within its own range. However when exported to Prometheus the buckets are cumulative (le=10 includes everything ≤ 10s, so yes it includes the ≤ 5s results). Other exporters like OTLP may present them non-cumulatively depending on the visualization tool.

@ajbozarth, can you double check this claim? I admit I'm not the best at navigating these OTEL tools; I was having significant difficulty trying to find the underlying counts in each bucket. I just want to make sure that we are handling this correctly.

ajbozarth · 2026-04-08T18:13:55Z

can you double check this claim?

I dug into this further for a more explicit explanation. Here's how the metrics travel though the code and into the buckets:

When a generation completes, LatencyMetricsPlugin fires and calls record_request_duration()
record_request_duration() calls _get_latency_histograms() to get (lazy create) the histogram object with our defined buckets
record_request_duration() then calls record() on that histogram object which determines which range in the bucket the value is in and increments that bucket value.

The OTEL histogram object can then be passed to various services like prometheus to visualize however they want.

I was having significant difficulty trying to find the underlying counts in each bucket

This seems to be because OTEL stores it in a python object and exports using protobuf over HTTP/gRPC, so the only way you could read it is via a UI or using the console output which calls to_json() before printing (only works if MELLEA_METRICS_CONSOLE=true is set)

jakelorocco

I was able to confirm that the buckets are stored separately and the display was merging. Added one small comment about the ttfb code changes.

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

ajbozarth self-assigned this Apr 2, 2026

ajbozarth added the area/telemetry OTel spans, metrics, tracing, semconv label Apr 2, 2026

github-actions Bot added the enhancement New feature or request label Apr 2, 2026

ajbozarth changed the title ~~feat(telemetry): latency histograms for LLM request duration and TTFB…~~ feat(telemetry): latency histograms for LLM request duration and TTFB (#463) Apr 2, 2026

ajbozarth marked this pull request as ready for review April 2, 2026 21:35

ajbozarth requested review from a team, jakelorocco and nrfulton as code owners April 2, 2026 21:35

ajbozarth mentioned this pull request Apr 2, 2026

predicates.py Apple Silicon VRAM estimate uses static heuristic instead of actual available memory, causing GPU-gated tests to run on under-resourced hardware #783

Closed

jakelorocco reviewed Apr 3, 2026

View reviewed changes

Comment thread mellea/core/base.py

Comment thread mellea/core/base.py

Comment thread mellea/telemetry/metrics.py

Comment thread mellea/telemetry/metrics.py Outdated

planetf1 reviewed Apr 7, 2026

View reviewed changes

Comment thread mellea/core/base.py Outdated

planetf1 reviewed Apr 7, 2026

View reviewed changes

Comment thread mellea/telemetry/metrics_plugins.py Outdated

ajbozarth requested a review from a team as a code owner April 7, 2026 22:28

ajbozarth requested a review from akihikokuroda April 7, 2026 22:28

ajbozarth mentioned this pull request Apr 8, 2026

ModelOutputThunk field refactor: partition fields into sub-structures #793

Closed

ajbozarth requested review from jakelorocco and planetf1 April 8, 2026 16:19

jakelorocco reviewed Apr 9, 2026

View reviewed changes

Comment thread mellea/core/base.py Outdated

refactor: extract _record_ttfb helper to deduplicate TTFB logic

ec0b615

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

jakelorocco approved these changes Apr 9, 2026

View reviewed changes

ajbozarth added this pull request to the merge queue Apr 9, 2026

Merged via the queue into generative-computing:main with commit 7bdb9f6 Apr 9, 2026
6 checks passed

ajbozarth deleted the feat/latency-histograms-463 branch April 9, 2026 20:58

Conversation

ajbozarth commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Misc PR

Type of PR

Description

Testing

Uh oh!

github-actions Bot commented Apr 2, 2026

Uh oh!

ajbozarth commented Apr 2, 2026

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajbozarth commented Apr 7, 2026

Uh oh!

ajbozarth commented Apr 8, 2026

Uh oh!

ajbozarth commented Apr 8, 2026

Uh oh!

jakelorocco commented Apr 8, 2026

Uh oh!

ajbozarth commented Apr 8, 2026

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ajbozarth commented Apr 2, 2026 •

edited

Loading