[TRTLLM-9523][chore] e2e tests for KV manager v2 with dis-agg by Shixiaowei02 · Pull Request #12689 · NVIDIA/TensorRT-LLM

Shixiaowei02 · 2026-04-02T11:23:20Z

Summary by CodeRabbit

Tests
- Added new accuracy test cases for disaggregated serving with KV cache v2 configuration across multiple GPU variants (H100, B300).
- Expanded test coverage for specialized transceiver backends to ensure reliability.
Chores
- Enhanced test infrastructure environment setup for disaggregated serving processes.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Shixiaowei02 · 2026-04-02T11:49:41Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-04-02T11:55:05Z

PR_Github #41429 [ run ] triggered by Bot. Commit: dfee1bf Link to invocation

coderabbitai · 2026-04-02T12:04:56Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7668ae7c-0993-4200-94a4-e56c8e8a227d

📥 Commits

Reviewing files that changed from the base of the PR and between 2b4f54c and dfee1bf.

📒 Files selected for processing (4)

tensorrt_llm/_torch/pyexecutor/scheduler/scheduler_v2.py
tests/integration/defs/accuracy/test_disaggregated_serving.py
tests/integration/test_lists/test-db/l0_dgx_b300.yml
tests/integration/test_lists/test-db/l0_dgx_h100.yml

📝 Walkthrough

Walkthrough

The pull request adds KV-cache resource allocation logic to the disaggregated generation scheduler initialization path, establishes UCX InfiniBand transport exclusion in disaggregated serving environments, and introduces new accuracy tests for KV cache v2 with NIXL backend across multiple GPU configurations.

Changes

Cohort / File(s)	Summary
Scheduler KV-Cache Allocation `tensorrt_llm/_torch/pyexecutor/scheduler/scheduler_v2.py`	Added inline KV-cache resource allocation (`prepare_context` and `resize_context` calls) within the `DISAGG_GENERATION_INIT` state path to validate availability before appending requests to disagg candidates; failures skip the request and continue iteration.
Disaggregated Serving Tests `tests/integration/defs/accuracy/test_disaggregated_serving.py`	Updated worker process environment setup to enforce `UCX_TLS="^ib"` for both context and generation servers; added new `test_kv_cache_v2_nixl_python` accuracy tests across multiple harness classes with KV cache v2 manager enabled, block reuse disabled, and NIXL Python transceiver backend.
Test List Configuration `tests/integration/test_lists/test-db/l0_dgx_b300.yml`, `tests/integration/test_lists/test-db/l0_dgx_h100.yml`	Added `test_kv_cache_v2_nixl_python` test entries for DeepSeekV3Lite (B300) and multiple model variants (H100, gpu2 configuration).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is largely a template with placeholder sections unfilled; critical sections like Description and Test Coverage lack substantive content.	Complete the Description section explaining the changes and their rationale, and clearly list all test cases added (test_kv_cache_v2_nixl_python for multiple harness classes).
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding end-to-end tests for KV manager v2 with disaggregated serving.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd · 2026-04-02T17:58:52Z

PR_Github #41429 [ run ] completed with state SUCCESS. Commit: dfee1bf
/LLM/main/L0_MergeRequest_PR pipeline #32361 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Shixiaowei02 · 2026-04-03T02:10:24Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-04-03T02:16:58Z

PR_Github #41547 [ run ] triggered by Bot. Commit: ae247d1 Link to invocation

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

Shixiaowei02 · 2026-04-03T06:18:44Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-04-03T06:24:16Z

PR_Github #41607 [ run ] triggered by Bot. Commit: 00bd9a9 Link to invocation

xinhe-nv · 2026-04-03T09:22:14Z

tests/integration/test_lists/test-db/l0_dgx_b300.yml

  - disaggregated/test_disaggregated.py::test_disaggregated_benchmark_on_diff_backends[DeepSeek-V3-Lite-fp8]
  - accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_nixl_backend
  - accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_nixl_backend
+  - accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_kv_cache_v2_nixl_python


don't forget to add new tests into qa test list, too

Will do it. Thanks for the reminder.

tensorrt-cicd · 2026-04-03T13:55:42Z

PR_Github #41607 [ run ] completed with state SUCCESS. Commit: 00bd9a9
/LLM/main/L0_MergeRequest_PR pipeline #32516 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Shixiaowei02 requested a review from chuangz0 April 2, 2026 11:23

Shixiaowei02 assigned Shixiaowei02 and chuangz0 Apr 2, 2026

Shixiaowei02 changed the title ~~[TRTLLM-9523][chore] add tests for KV manager with dis-agg~~ [TRTLLM-9523][chore] add tests for KV manager v2 with dis-agg Apr 2, 2026

Shixiaowei02 changed the title ~~[TRTLLM-9523][chore] add tests for KV manager v2 with dis-agg~~ [TRTLLM-9523][chore] e2e tests for KV manager v2 with dis-agg Apr 2, 2026

Shixiaowei02 force-pushed the user/xiaoweis/kv_v2_e2e branch 2 times, most recently from 9aaad7d to dfee1bf Compare April 2, 2026 11:49

Shixiaowei02 marked this pull request as ready for review April 2, 2026 11:49

Shixiaowei02 requested review from a team as code owners April 2, 2026 11:49

Shixiaowei02 requested a review from lancelly April 2, 2026 11:49

Shixiaowei02 requested a review from pcastonguay April 2, 2026 11:50

Shixiaowei02 force-pushed the user/xiaoweis/kv_v2_e2e branch from dfee1bf to ae247d1 Compare April 3, 2026 02:10

Shixiaowei02 added 2 commits April 3, 2026 14:18

add tests for KV manager with dis-agg

64bdb1d

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

update

00bd9a9

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

Shixiaowei02 force-pushed the user/xiaoweis/kv_v2_e2e branch from ae247d1 to 00bd9a9 Compare April 3, 2026 06:18

xinhe-nv reviewed Apr 3, 2026

View reviewed changes

Conversation

Shixiaowei02 commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

Shixiaowei02 commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

coderabbitai bot commented Apr 2, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

Shixiaowei02 commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

Shixiaowei02 commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

xinhe-nv Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Shixiaowei02 Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Shixiaowei02 commented Apr 2, 2026 •

edited by coderabbitai bot

Loading