feat(telemetry): latency histograms for LLM request duration and TTFB (#463)#782
feat(telemetry): latency histograms for LLM request duration and TTFB (#463)#782ajbozarth wants to merge 1 commit intogenerative-computing:mainfrom
Conversation
…generative-computing#463) Adds request duration and time-to-first-token (TTFB) latency histograms via the plugin pattern established in generative-computing#653. Includes custom OTel bucket views sized for LLM latencies, backend telemetry field assertions across all backends, and updated dev/published docs. Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
|
I'll be OOTO tomorrow (Fri April 3) through Monday (April 6). So I will address any review feedback on Tuesday. If this gets two committer approvals with no actionable feedback while I'm out the second approver can feel free to add it to the merge queue. This has no rush on it so it should be fine to wait till I'm back for addressing feedback |
jakelorocco
left a comment
There was a problem hiding this comment.
Question: for the "bucket" based metrics are the buckets inclusive of each other? Should the 10 second bucket include all results from the 5 second bucket?
| ) | ||
|
|
||
| do_set_computed = False | ||
| self.streaming = bool((self._model_options or {}).get("@@@stream@@@", False)) |
There was a problem hiding this comment.
I assume we are using the string explicitly here to avoid importing non-core things into core?
| self.ttfb_ms: float | None = None | ||
| """Time to first token in milliseconds (streaming only). | ||
|
|
||
| Set when the first chunk is received from the backend. | ||
| None for non-streaming requests or when not measured. | ||
| """ | ||
|
|
||
| self.streaming: bool = False | ||
| """Whether this generation used streaming mode. | ||
|
|
||
| Set from model options at the start of astream(). | ||
| """ | ||
|
|
There was a problem hiding this comment.
Commenting this as a note to others (not as something to address in this review): I feel that ModelOutputThunk's fields have grown very complicated / numerous. We should investigate slimming these down / partitioning them into larger sub-structures with similar functionality contained in a single substructure.
| views = [ | ||
| View( # type: ignore | ||
| instrument_name="mellea.llm.request.duration", | ||
| aggregation=ExplicitBucketHistogramAggregation( # type: ignore | ||
| [0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 120] | ||
| ), | ||
| ), | ||
| View( # type: ignore | ||
| instrument_name="mellea.llm.ttfb", | ||
| aggregation=ExplicitBucketHistogramAggregation( # type: ignore | ||
| [0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10] | ||
| ), | ||
| ), | ||
| ] | ||
|
|
||
| provider = MeterProvider(resource=resource, metric_readers=readers, views=views) # type: ignore |
There was a problem hiding this comment.
Is this standard? Or should we allow users to set their own values here?
| warnings.warn( | ||
| f"TokenMetricsPlugin already registered: {e}", UserWarning, stacklevel=2 | ||
| ) | ||
| for _plugin_cls in (TokenMetricsPlugin, LatencyMetricsPlugin): |
There was a problem hiding this comment.
Feel free to disagree on this one, but I feel like we should extract this to a constant and then give that constant a docstring.
Misc PR
Type of PR
Description
Adds latency histograms for LLM request duration and time-to-first-token (TTFB)
as part of the metrics telemetry epic (#443).
LatencyMetricsPluginhooksgeneration_post_call(FIRE_AND_FORGET) and recordsmellea.llm.request.duration(every request) andmellea.llm.ttfb(streaming only)View+ExplicitBucketHistogramAggregationbucket boundaries sized forLLM latencies; both plugins auto-register alongside
TokenMetricsPluginModelOutputThunkgainsstreaming: boolandttfb_ms: float | None; TTFB iscaptured on first chunk in
astream(), gated onself.streamingAGENTS.md,metrics.md,telemetry.md, andmetrics_example.pyTesting