Skip to content

feat: Add vLLM instrumentation with server-side TTFT/TPOT metrics#4419

Open
Nik-Reddy wants to merge 1 commit intoopen-telemetry:mainfrom
Nik-Reddy:feat/vllm-instrumentation
Open

feat: Add vLLM instrumentation with server-side TTFT/TPOT metrics#4419
Nik-Reddy wants to merge 1 commit intoopen-telemetry:mainfrom
Nik-Reddy:feat/vllm-instrumentation

Conversation

@Nik-Reddy
Copy link
Copy Markdown

Description

Implements OpenTelemetry instrumentation for vLLM, the most widely used open-source LLM inference engine. This is a server-side instrumentation — the first in this repo — addressing the gap identified in #3932 where the community requested TTFT support for inference servers like vLLM and SGLang.

Unlike existing GenAI instrumentations in this repo (which are all client-side), this instruments the server/inference side, recording true server-side metrics that reflect actual model inference performance without network latency.

Ref #3932

Changes

New package: instrumentation-genai/opentelemetry-instrumentation-vllm/

Instrumented methods:

Method Description
vllm.LLM.generate() Offline/batch text generation
vllm.LLM.chat() Chat completions

Server-side metrics (the key differentiator):

Metric Type Description
gen_ai.server.time_to_first_token Histogram (s) Time from request to first output token
gen_ai.server.time_per_output_token Histogram (s) Time per output token after the first
gen_ai.server.request.duration Histogram (s) Total server-side request duration
gen_ai.client.operation.duration Histogram (s) Operation duration
gen_ai.client.token.usage Counter Token usage (input/output)

Spans use SpanKind.SERVER with full GenAI semantic convention attributes:

  • gen_ai.system = vllm
  • gen_ai.operation.name = chat / generate
  • gen_ai.request.model, gen_ai.request.max_tokens, gen_ai.request.temperature
  • gen_ai.response.finish_reasons, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens

Design decisions:

  • vLLM is an optional dependency — the package installs without GPU hardware and fails gracefully at instrument() time if vLLM is not available
  • All tests are mock-based (no GPU required for CI)
  • Bucket boundaries follow semconv v1.38.0 specification

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

35 mock-based tests covering spans, metrics (TTFT, TPOT, token usage, duration), error handling, uninstrument, and utility functions.

cd instrumentation-genai/opentelemetry-instrumentation-vllm
pytest tests/ -v

Does This PR Require a Core Repo Change?

  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated (README.rst included)

@Nik-Reddy Nik-Reddy requested a review from a team as a code owner April 13, 2026 06:43
@xrmx xrmx added the gen-ai Related to generative AI label Apr 13, 2026
Copy link
Copy Markdown
Member

@MikeGoldsmith MikeGoldsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good start - thanks @Nik-Reddy.

I've left some queries and suggestions. We also need to verify semconv includes vllm, plus add tox files, CI tests, changelog.

@github-project-automation github-project-automation bot moved this to Reviewed PRs that need fixes in Python PR digest Apr 13, 2026
@Nik-Reddy Nik-Reddy force-pushed the feat/vllm-instrumentation branch 2 times, most recently from bc55265 to 4acfd9a Compare April 14, 2026 18:00
@Nik-Reddy
Copy link
Copy Markdown
Author

Nik-Reddy commented Apr 14, 2026

Hi @MikeGoldsmith, thank you for the thorough review! I've addressed all four comments:

  1. Changed SpanKind to \INTERNAL\ (not making external requests)
  2. Renamed attributes to \first_token_latency\ and \mean_time_per_output_token\ to match vLLM metrics
  3. Clarified TTFT vs TPOT distinction with docstrings and inline comments
  4. Added TODO noting \llm\ needs to be registered in semantic conventions

Will open a semconv issue/PR for the \gen_ai.system\ value separately. Ready for re-review!

@Nik-Reddy Nik-Reddy force-pushed the feat/vllm-instrumentation branch from 4acfd9a to 208bc51 Compare April 15, 2026 01:08
@Nik-Reddy
Copy link
Copy Markdown
Author

Rebased on latest main. All 4 review threads from @MikeGoldsmith have been addressed:

  • Fixed SpanKind to INTERNAL for server-side instrumentation
  • Corrected attribute names to use semantic conventions
  • Added documentation for TTFT/TPOT metric semantics
  • Added semconv TODO for pending attribute stabilization

All threads resolved. Ready for re-review.

@Nik-Reddy Nik-Reddy requested a review from MikeGoldsmith April 15, 2026 01:11
@MikeGoldsmith
Copy link
Copy Markdown
Member

Thanks for the updates @Nik-Reddy.

There are a few things we still to address:

For the client/server duration comment — I think there was a misunderstanding. My question was about client_operation_duration_histogram and server_request_duration_histogram both recording the same duration value (not about TTFT/TPOT). The comment in the code even says "same as client for local inference". Is recording both intentional, and do we expect them to diverge for other vLLM configurations?

Also:

  • Can you share a link to the semconv issue/PR for registering vllm as a gen_ai.system value?
  • The package is still missing tox.ini entries to wire it into CI. Please can you add those? You can use the one of the other libraries as a reference, eg openai.

@Nik-Reddy Nik-Reddy force-pushed the feat/vllm-instrumentation branch from 208bc51 to 264bd87 Compare April 16, 2026 21:04
Copy link
Copy Markdown
Member

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nik-Reddy can you please discuss in https://cloud-native.slack.com/archives/C06KR7ARS3X before adding new instrumentations?

Adding new instrumentation requires finding component owners that are ready to maintain it in the long run.

I'm not an expert on vLLM side, but I don't think this instrumentation belongs in this repo.

If you ask your AI to review this and ask it to question its usefullness, it'd give you something like this

vLLM already ships both natively:

  • Prometheus /metrics endpoint with vllm:time_to_first_token_seconds, TPOT, iteration tokens, etc. (vllm/v1/metrics/loggers.py)
  • OpenTelemetry tracing in vllm/tracing/utils.py with GEN_AI_LATENCY_TIME_TO_FIRST_TOKEN — emitted from vllm/v1/engine/output_processor.py
  • Docs: https://github.com/vllm-project/vllm/blob/main/docs/usage/metrics.md

What the PR actually instruments:

  • vllm.LLM.generate() and vllm.LLM.chat() — these are the offline/batch Python API, not the serving path (vllm serve → OpenAI-compatible HTTP server). So the "server-side instrumentation — the first in this repo" framing is misleading; it's still an in-process library wrap, just of a different entry point.

Resulting issues:

  1. Duplicates metrics vLLM already emits, with different names/semantics (risk of drift from upstream).
  2. Misses the serving code path entirely — where TTFT/TPOT actually matter in production.
  3. Reviewer already flagged gen_ai.system=vllm isn't registered in semconv, missing tox/CI wiring, and duplicate client/server duration recording.
  4. Upstream is the natural home for this: improving vLLM's built-in OTel emitter (which already exists) would benefit everyone without a monkey-patch layer.

The right path is probably: contribute missing semconv attributes upstream to vLLM, not add a wrapper here. Worth raising on the PR/issue #3932 before more work goes in.

@Nik-Reddy
Copy link
Copy Markdown
Author

@MikeGoldsmith All three points addressed.

Removed server_request_duration. It was recording the same value as client_operation_duration for local vLLM inference, so it was purely redundant.

Added semconv links in the module docstring and next to the gen_ai.system TODO, so the rationale for each attribute is traceable.

Wired tox.ini following the same pattern used by the other GenAI instrumentations, with oldest and latest requirement files. Tests pass in both configurations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gen-ai Related to generative AI

Projects

Status: Reviewed PRs that need fixes

Development

Successfully merging this pull request may close these issues.

4 participants