Skip to content

[TRTLLM-9523][chore] e2e tests for KV manager v2 with dis-agg#12689

Open
Shixiaowei02 wants to merge 2 commits intoNVIDIA:mainfrom
Shixiaowei02:user/xiaoweis/kv_v2_e2e
Open

[TRTLLM-9523][chore] e2e tests for KV manager v2 with dis-agg#12689
Shixiaowei02 wants to merge 2 commits intoNVIDIA:mainfrom
Shixiaowei02:user/xiaoweis/kv_v2_e2e

Conversation

@Shixiaowei02
Copy link
Copy Markdown
Collaborator

@Shixiaowei02 Shixiaowei02 commented Apr 2, 2026

Summary by CodeRabbit

  • Tests

    • Added new accuracy test cases for disaggregated serving with KV cache v2 configuration across multiple GPU variants (H100, B300).
    • Expanded test coverage for specialized transceiver backends to ensure reliability.
  • Chores

    • Enhanced test infrastructure environment setup for disaggregated serving processes.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@Shixiaowei02 Shixiaowei02 requested a review from chuangz0 April 2, 2026 11:23
@Shixiaowei02 Shixiaowei02 changed the title [TRTLLM-9523][chore] add tests for KV manager with dis-agg [TRTLLM-9523][chore] add tests for KV manager v2 with dis-agg Apr 2, 2026
@Shixiaowei02 Shixiaowei02 changed the title [TRTLLM-9523][chore] add tests for KV manager v2 with dis-agg [TRTLLM-9523][chore] e2e tests for KV manager v2 with dis-agg Apr 2, 2026
@Shixiaowei02 Shixiaowei02 force-pushed the user/xiaoweis/kv_v2_e2e branch 2 times, most recently from 9aaad7d to dfee1bf Compare April 2, 2026 11:49
@Shixiaowei02 Shixiaowei02 marked this pull request as ready for review April 2, 2026 11:49
@Shixiaowei02 Shixiaowei02 requested review from a team as code owners April 2, 2026 11:49
@Shixiaowei02 Shixiaowei02 requested a review from lancelly April 2, 2026 11:49
@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@Shixiaowei02 Shixiaowei02 requested a review from pcastonguay April 2, 2026 11:50
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41429 [ run ] triggered by Bot. Commit: dfee1bf Link to invocation

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7668ae7c-0993-4200-94a4-e56c8e8a227d

📥 Commits

Reviewing files that changed from the base of the PR and between 2b4f54c and dfee1bf.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/pyexecutor/scheduler/scheduler_v2.py
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tests/integration/test_lists/test-db/l0_dgx_b300.yml
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml

📝 Walkthrough

Walkthrough

The pull request adds KV-cache resource allocation logic to the disaggregated generation scheduler initialization path, establishes UCX InfiniBand transport exclusion in disaggregated serving environments, and introduces new accuracy tests for KV cache v2 with NIXL backend across multiple GPU configurations.

Changes

Cohort / File(s) Summary
Scheduler KV-Cache Allocation
tensorrt_llm/_torch/pyexecutor/scheduler/scheduler_v2.py
Added inline KV-cache resource allocation (prepare_context and resize_context calls) within the DISAGG_GENERATION_INIT state path to validate availability before appending requests to disagg candidates; failures skip the request and continue iteration.
Disaggregated Serving Tests
tests/integration/defs/accuracy/test_disaggregated_serving.py
Updated worker process environment setup to enforce UCX_TLS="^ib" for both context and generation servers; added new test_kv_cache_v2_nixl_python accuracy tests across multiple harness classes with KV cache v2 manager enabled, block reuse disabled, and NIXL Python transceiver backend.
Test List Configuration
tests/integration/test_lists/test-db/l0_dgx_b300.yml, tests/integration/test_lists/test-db/l0_dgx_h100.yml
Added test_kv_cache_v2_nixl_python test entries for DeepSeekV3Lite (B300) and multiple model variants (H100, gpu2 configuration).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is largely a template with placeholder sections unfilled; critical sections like Description and Test Coverage lack substantive content. Complete the Description section explaining the changes and their rationale, and clearly list all test cases added (test_kv_cache_v2_nixl_python for multiple harness classes).
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding end-to-end tests for KV manager v2 with disaggregated serving.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41429 [ run ] completed with state SUCCESS. Commit: dfee1bf
/LLM/main/L0_MergeRequest_PR pipeline #32361 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Shixiaowei02 Shixiaowei02 force-pushed the user/xiaoweis/kv_v2_e2e branch from dfee1bf to ae247d1 Compare April 3, 2026 02:10
@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41547 [ run ] triggered by Bot. Commit: ae247d1 Link to invocation

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
@Shixiaowei02 Shixiaowei02 force-pushed the user/xiaoweis/kv_v2_e2e branch from ae247d1 to 00bd9a9 Compare April 3, 2026 06:18
@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41607 [ run ] triggered by Bot. Commit: 00bd9a9 Link to invocation

- disaggregated/test_disaggregated.py::test_disaggregated_benchmark_on_diff_backends[DeepSeek-V3-Lite-fp8]
- accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_nixl_backend
- accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_nixl_backend
- accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_kv_cache_v2_nixl_python
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to add new tests into qa test list, too

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do it. Thanks for the reminder.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41607 [ run ] completed with state SUCCESS. Commit: 00bd9a9
/LLM/main/L0_MergeRequest_PR pipeline #32516 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants