feat: Add vLLM instrumentation with server-side TTFT/TPOT metrics#4419
feat: Add vLLM instrumentation with server-side TTFT/TPOT metrics#4419Nik-Reddy wants to merge 1 commit intoopen-telemetry:mainfrom
Conversation
MikeGoldsmith
left a comment
There was a problem hiding this comment.
Looks like a good start - thanks @Nik-Reddy.
I've left some queries and suggestions. We also need to verify semconv includes vllm, plus add tox files, CI tests, changelog.
bc55265 to
4acfd9a
Compare
|
Hi @MikeGoldsmith, thank you for the thorough review! I've addressed all four comments:
Will open a semconv issue/PR for the \gen_ai.system\ value separately. Ready for re-review! |
4acfd9a to
208bc51
Compare
|
Rebased on latest main. All 4 review threads from @MikeGoldsmith have been addressed:
All threads resolved. Ready for re-review. |
|
Thanks for the updates @Nik-Reddy. There are a few things we still to address: For the client/server duration comment — I think there was a misunderstanding. My question was about Also:
|
208bc51 to
264bd87
Compare
There was a problem hiding this comment.
@Nik-Reddy can you please discuss in https://cloud-native.slack.com/archives/C06KR7ARS3X before adding new instrumentations?
Adding new instrumentation requires finding component owners that are ready to maintain it in the long run.
I'm not an expert on vLLM side, but I don't think this instrumentation belongs in this repo.
If you ask your AI to review this and ask it to question its usefullness, it'd give you something like this
vLLM already ships both natively:
- Prometheus /metrics endpoint with vllm:time_to_first_token_seconds, TPOT, iteration tokens, etc. (vllm/v1/metrics/loggers.py)
- OpenTelemetry tracing in vllm/tracing/utils.py with GEN_AI_LATENCY_TIME_TO_FIRST_TOKEN — emitted from vllm/v1/engine/output_processor.py
- Docs: https://github.com/vllm-project/vllm/blob/main/docs/usage/metrics.md
What the PR actually instruments:
- vllm.LLM.generate() and vllm.LLM.chat() — these are the offline/batch Python API, not the serving path (vllm serve → OpenAI-compatible HTTP server). So the "server-side instrumentation — the first in this repo" framing is misleading; it's still an in-process library wrap, just of a different entry point.
Resulting issues:
- Duplicates metrics vLLM already emits, with different names/semantics (risk of drift from upstream).
- Misses the serving code path entirely — where TTFT/TPOT actually matter in production.
- Reviewer already flagged gen_ai.system=vllm isn't registered in semconv, missing tox/CI wiring, and duplicate client/server duration recording.
- Upstream is the natural home for this: improving vLLM's built-in OTel emitter (which already exists) would benefit everyone without a monkey-patch layer.
The right path is probably: contribute missing semconv attributes upstream to vLLM, not add a wrapper here. Worth raising on the PR/issue #3932 before more work goes in.
|
@MikeGoldsmith All three points addressed. Removed server_request_duration. It was recording the same value as client_operation_duration for local vLLM inference, so it was purely redundant. Added semconv links in the module docstring and next to the gen_ai.system TODO, so the rationale for each attribute is traceable. Wired tox.ini following the same pattern used by the other GenAI instrumentations, with oldest and latest requirement files. Tests pass in both configurations. |
Description
Implements OpenTelemetry instrumentation for vLLM, the most widely used open-source LLM inference engine. This is a server-side instrumentation — the first in this repo — addressing the gap identified in #3932 where the community requested TTFT support for inference servers like vLLM and SGLang.
Unlike existing GenAI instrumentations in this repo (which are all client-side), this instruments the server/inference side, recording true server-side metrics that reflect actual model inference performance without network latency.
Ref #3932
Changes
New package:
instrumentation-genai/opentelemetry-instrumentation-vllm/Instrumented methods:
vllm.LLM.generate()vllm.LLM.chat()Server-side metrics (the key differentiator):
gen_ai.server.time_to_first_tokengen_ai.server.time_per_output_tokengen_ai.server.request.durationgen_ai.client.operation.durationgen_ai.client.token.usageSpans use
SpanKind.SERVERwith full GenAI semantic convention attributes:gen_ai.system=vllmgen_ai.operation.name=chat/generategen_ai.request.model,gen_ai.request.max_tokens,gen_ai.request.temperaturegen_ai.response.finish_reasons,gen_ai.usage.input_tokens,gen_ai.usage.output_tokensDesign decisions:
instrument()time if vLLM is not availableType of change
How Has This Been Tested?
35 mock-based tests covering spans, metrics (TTFT, TPOT, token usage, duration), error handling, uninstrument, and utility functions.
Does This PR Require a Core Repo Change?
Checklist: