feat: Add vLLM instrumentation with server-side TTFT/TPOT metrics by Nik-Reddy · Pull Request #4419 · open-telemetry/opentelemetry-python-contrib

Nik-Reddy · 2026-04-13T06:42:59Z

Description

Implements OpenTelemetry instrumentation for vLLM, the most widely used open-source LLM inference engine. This is a server-side instrumentation — the first in this repo — addressing the gap identified in #3932 where the community requested TTFT support for inference servers like vLLM and SGLang.

Unlike existing GenAI instrumentations in this repo (which are all client-side), this instruments the server/inference side, recording true server-side metrics that reflect actual model inference performance without network latency.

Ref #3932

Changes

New package: instrumentation-genai/opentelemetry-instrumentation-vllm/

Instrumented methods:

Method	Description
`vllm.LLM.generate()`	Offline/batch text generation
`vllm.LLM.chat()`	Chat completions

Server-side metrics (the key differentiator):

Metric	Type	Description
`gen_ai.server.time_to_first_token`	Histogram (s)	Time from request to first output token
`gen_ai.server.time_per_output_token`	Histogram (s)	Time per output token after the first
`gen_ai.server.request.duration`	Histogram (s)	Total server-side request duration
`gen_ai.client.operation.duration`	Histogram (s)	Operation duration
`gen_ai.client.token.usage`	Counter	Token usage (input/output)

Spans use SpanKind.SERVER with full GenAI semantic convention attributes:

gen_ai.system = vllm
gen_ai.operation.name = chat / generate
gen_ai.request.model, gen_ai.request.max_tokens, gen_ai.request.temperature
gen_ai.response.finish_reasons, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens

Design decisions:

vLLM is an optional dependency — the package installs without GPU hardware and fails gracefully at instrument() time if vLLM is not available
All tests are mock-based (no GPU required for CI)
Bucket boundaries follow semconv v1.38.0 specification

Type of change

New feature (non-breaking change which adds functionality)

How Has This Been Tested?

35 mock-based tests covering spans, metrics (TTFT, TPOT, token usage, duration), error handling, uninstrument, and utility functions.

cd instrumentation-genai/opentelemetry-instrumentation-vllm
pytest tests/ -v

Does This PR Require a Core Repo Change?

No.

Checklist:

Followed the style guidelines of this project
Changelogs have been updated
Unit tests have been added
Documentation has been updated (README.rst included)

MikeGoldsmith

Looks like a good start - thanks @Nik-Reddy.

I've left some queries and suggestions. We also need to verify semconv includes vllm, plus add tox files, CI tests, changelog.

Nik-Reddy · 2026-04-14T21:45:21Z

Hi @MikeGoldsmith, thank you for the thorough review! I've addressed all four comments:

Changed SpanKind to \INTERNAL\ (not making external requests)
Renamed attributes to \first_token_latency\ and \mean_time_per_output_token\ to match vLLM metrics
Clarified TTFT vs TPOT distinction with docstrings and inline comments
Added TODO noting \llm\ needs to be registered in semantic conventions

Will open a semconv issue/PR for the \gen_ai.system\ value separately. Ready for re-review!

Nik-Reddy · 2026-04-15T01:09:14Z

Rebased on latest main. All 4 review threads from @MikeGoldsmith have been addressed:

Fixed SpanKind to INTERNAL for server-side instrumentation
Corrected attribute names to use semantic conventions
Added documentation for TTFT/TPOT metric semantics
Added semconv TODO for pending attribute stabilization

All threads resolved. Ready for re-review.

MikeGoldsmith · 2026-04-15T13:43:53Z

Thanks for the updates @Nik-Reddy.

There are a few things we still to address:

For the client/server duration comment — I think there was a misunderstanding. My question was about client_operation_duration_histogram and server_request_duration_histogram both recording the same duration value (not about TTFT/TPOT). The comment in the code even says "same as client for local inference". Is recording both intentional, and do we expect them to diverge for other vLLM configurations?

Also:

Can you share a link to the semconv issue/PR for registering vllm as a gen_ai.system value?
The package is still missing tox.ini entries to wire it into CI. Please can you add those? You can use the one of the other libraries as a reference, eg openai.

lmolkova

@Nik-Reddy can you please discuss in https://cloud-native.slack.com/archives/C06KR7ARS3X before adding new instrumentations?

Adding new instrumentation requires finding component owners that are ready to maintain it in the long run.

I'm not an expert on vLLM side, but I don't think this instrumentation belongs in this repo.

If you ask your AI to review this and ask it to question its usefullness, it'd give you something like this

vLLM already ships both natively:

Prometheus /metrics endpoint with vllm:time_to_first_token_seconds, TPOT, iteration tokens, etc. (vllm/v1/metrics/loggers.py)
OpenTelemetry tracing in vllm/tracing/utils.py with GEN_AI_LATENCY_TIME_TO_FIRST_TOKEN — emitted from vllm/v1/engine/output_processor.py
Docs: https://github.com/vllm-project/vllm/blob/main/docs/usage/metrics.md

What the PR actually instruments:

vllm.LLM.generate() and vllm.LLM.chat() — these are the offline/batch Python API, not the serving path (vllm serve → OpenAI-compatible HTTP server). So the "server-side instrumentation — the first in this repo" framing is misleading; it's still an in-process library wrap, just of a different entry point.

Resulting issues:

Duplicates metrics vLLM already emits, with different names/semantics (risk of drift from upstream).
Misses the serving code path entirely — where TTFT/TPOT actually matter in production.
Reviewer already flagged gen_ai.system=vllm isn't registered in semconv, missing tox/CI wiring, and duplicate client/server duration recording.
Upstream is the natural home for this: improving vLLM's built-in OTel emitter (which already exists) would benefit everyone without a monkey-patch layer.

The right path is probably: contribute missing semconv attributes upstream to vLLM, not add a wrapper here. Worth raising on the PR/issue #3932 before more work goes in.

Nik-Reddy · 2026-04-17T00:09:25Z

@MikeGoldsmith All three points addressed.

Removed server_request_duration. It was recording the same value as client_operation_duration for local vLLM inference, so it was purely redundant.

Added semconv links in the module docstring and next to the gen_ai.system TODO, so the rationale for each attribute is traceable.

Wired tox.ini following the same pattern used by the other GenAI instrumentations, with oldest and latest requirement files. Tests pass in both configurations.

github-project-automation bot added this to Python PR digest Apr 13, 2026

Nik-Reddy requested a review from a team as a code owner April 13, 2026 06:43

Nik-Reddy mentioned this pull request Apr 13, 2026

Metric: gen_ai.server.time_to_first_token #3932

Open

xrmx added the gen-ai Related to generative AI label Apr 13, 2026

MikeGoldsmith requested changes Apr 13, 2026

View reviewed changes

github-project-automation bot moved this to Reviewed PRs that need fixes in Python PR digest Apr 13, 2026

Nik-Reddy force-pushed the feat/vllm-instrumentation branch 2 times, most recently from bc55265 to 4acfd9a Compare April 14, 2026 18:00

Nik-Reddy force-pushed the feat/vllm-instrumentation branch from 4acfd9a to 208bc51 Compare April 15, 2026 01:08

Nik-Reddy requested a review from MikeGoldsmith April 15, 2026 01:11

feat: Add vLLM instrumentation with server-side TTFT/TPOT metrics

264bd87

Nik-Reddy force-pushed the feat/vllm-instrumentation branch from 208bc51 to 264bd87 Compare April 16, 2026 21:04

lmolkova requested changes Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add vLLM instrumentation with server-side TTFT/TPOT metrics#4419

feat: Add vLLM instrumentation with server-side TTFT/TPOT metrics#4419
Nik-Reddy wants to merge 1 commit intoopen-telemetry:mainfrom
Nik-Reddy:feat/vllm-instrumentation

Nik-Reddy commented Apr 13, 2026

Uh oh!

MikeGoldsmith left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Nik-Reddy commented Apr 14, 2026 •

edited

Loading

Uh oh!

Nik-Reddy commented Apr 15, 2026

Uh oh!

MikeGoldsmith commented Apr 15, 2026

Uh oh!

lmolkova left a comment •

edited

Loading

Uh oh!

Nik-Reddy commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Nik-Reddy commented Apr 13, 2026

Description

Changes

Type of change

How Has This Been Tested?

Does This PR Require a Core Repo Change?

Checklist:

Uh oh!

MikeGoldsmith left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Nik-Reddy commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nik-Reddy commented Apr 15, 2026

Uh oh!

MikeGoldsmith commented Apr 15, 2026

Uh oh!

lmolkova left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nik-Reddy commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Nik-Reddy commented Apr 14, 2026 •

edited

Loading

lmolkova left a comment •

edited

Loading