[None][feat] optimize ucx build connection by chuangz0 · Pull Request #12722 · NVIDIA/TensorRT-LLM

chuangz0 · 2026-04-03T02:13:06Z

Summary by CodeRabbit

Refactor
- Improved cache transmission connection establishment by streamlining connection ID negotiation.
- Simplified connection synchronization mechanism for better efficiency.
- Enhanced connection setup robustness and reliability.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-03T02:15:07Z

📝 Walkthrough

Walkthrough

Refactored UCX connection ID negotiation by shifting from tag-based exchange in the constructor to ZMQ-based handshake during setup. Updated UcxConnection to accept a peer connection ID parameter, removed negotiation logic from the constructor, and restructured UcxConnectionManager to accept explicit connection IDs rather than generating them asynchronously. Eliminated future-based connection tracking in favor of direct connection registration.

Changes

Cohort / File(s)	Summary
Connection class signatures `cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.h`, `cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp`	Updated `UcxConnection` constructor to accept `connectionIdInPeer` parameter. Removed tag-based UCX negotiation logic (`tagRecv`/`tagSend` branches) from constructor; connection ID exchange now delegated to ZMQ setup phase.
UCX Manager refactoring `cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h`, `cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp`	Changed `addConnection` to accept explicit `(workerAddress, connectionId, connectionIdInPeer)` instead of single-argument variant. Simplified `getNewConnectionId()` to accept no parameters. Extended `UcxCmMessage` to carry optional `connectionId` field in ZMQ handshake. Removed `mConnectionFutures` and async connection tracking; connections now registered directly upon validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is empty except for the template itself; no actual description, test coverage details, or checklist items were filled in by the author.	Fill in the Description section explaining why the UCX connection ID negotiation was moved from UCX tags to ZMQ handshake, provide test coverage details, and complete the PR checklist.
Docstring Coverage	⚠️ Warning	Docstring coverage is 6.25% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title '[None][feat] optimize ucx build connection' is vague and lacks specificity about the actual changes made.	Use a more descriptive title that clarifies the specific optimization, such as '[None][feat] Refactor UCX connection initialization to pass connection IDs via constructor' or similar.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.h (1)
1-3: ⚠️ Potential issue | 🟡 Minor

Update the NVIDIA copyright year to 2026.

This file was meaningfully modified in this PR, so the header should reflect the latest modification year. The same applies to the other modified UCX files in this change set.
Suggested fix
- * SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines, "Add NVIDIA copyright header to ALL new files; update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.h` around
lines 1 - 3, Update the SPDX header year from 2025 to 2026 in the file's
copyright header: replace the "SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA
CORPORATION & AFFILIATES." line with 2026 in connection.h (and do the same for
any other modified UCX-related files in this change set) so the header reflects
the latest modification year.

🧹 Nitpick comments (2)

cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp (2)

533-535: Name the fromRequester flag at both constructor call sites.

With two same-typed connection IDs already in the call, the trailing raw true/false is easy to misread. Inline-name the boolean parameter at both sites.

Suggested fix

-        auto connection = std::make_shared<UcxConnection>(connectionId, connectionIdInPeer, newEp, this, false);
+        auto connection = std::make_shared<UcxConnection>(
+            connectionId, connectionIdInPeer, newEp, this, /* fromRequester = */ false);
...
-            connection = std::make_shared<UcxConnection>(connectionId, serverConnectionId, newEp, this, true);
+            connection = std::make_shared<UcxConnection>(
+                connectionId, serverConnectionId, newEp, this, /* fromRequester = */ true);

As per coding guidelines, "In C++ function calls where parameters are not obvious from inspection, use an inline C++ comment to document the parameter: `doSomeOperation(/* checkForErrors = */ false);`"

Also applies to: 591-593

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp`
around lines 533 - 535, The UcxConnection constructor calls pass an unlabeled
boolean for the fromRequester flag which is easy to misread; update the two call
sites that create UcxConnection (the one currently using "false" after newEp and
the other call around the 591-593 region that uses "true"/"false") to use an
inline comment naming the parameter, e.g. replace the bare boolean with /*
fromRequester = */ false or /* fromRequester = */ true so the intent is explicit
when constructing UcxConnection and when inserting into mConnections.

63-66: Add a direct <optional> include for the new wire field.

UcxCmMessage now uses std::optional/std::nullopt directly, but this translation unit still relies on transitive includes for those symbols. Please include <optional> explicitly so the file stays self-contained.

Suggested fix

 `#include` <mutex>
 `#include` <numeric>
+#include <optional>
 `#include` <regex>
 `#include` <string>

Also applies to: 94-95

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp`
around lines 63 - 66, The translation unit uses std::optional and std::nullopt
directly (e.g., member mConnectionId and the UcxCmMessage constructor) but lacks
a direct `#include` <optional>, relying on transitive includes; add an explicit
include for <optional> at the top of the file so std::optional/std::nullopt are
available, and ensure any other uses of std::optional in this file (the other
UcxCmMessage usages around the other constructor/fields noted) are covered by
this include so the file is self-contained.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp`:
- Around line 457-465: The REP-thread handler must catch exceptions from
addConnection (which can propagate via TLLM_THROW) so the thread doesn't
terminate without replying; wrap the handler code that builds
serverMessage/UcxCmMessage, calls addConnection(workerAddress,
serverConnectionId, clientConnectionId), and sends mZmqRepSocket.send(...) in a
try-catch, and in the catch block serialize and send an explicit error response
back on mZmqRepSocket (using a clear error payload/string so the requester’s
recv() unblocks), then rethrow or cleanly exit; ensure you reference
addConnection, UcxCmMessage, serverMessageStr, and mZmqRepSocket when making the
changes.

---

Outside diff comments:
In `@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.h`:
- Around line 1-3: Update the SPDX header year from 2025 to 2026 in the file's
copyright header: replace the "SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA
CORPORATION & AFFILIATES." line with 2026 in connection.h (and do the same for
any other modified UCX-related files in this change set) so the header reflects
the latest modification year.

---

Nitpick comments:
In
`@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp`:
- Around line 533-535: The UcxConnection constructor calls pass an unlabeled
boolean for the fromRequester flag which is easy to misread; update the two call
sites that create UcxConnection (the one currently using "false" after newEp and
the other call around the 591-593 region that uses "true"/"false") to use an
inline comment naming the parameter, e.g. replace the bare boolean with /*
fromRequester = */ false or /* fromRequester = */ true so the intent is explicit
when constructing UcxConnection and when inserting into mConnections.
- Around line 63-66: The translation unit uses std::optional and std::nullopt
directly (e.g., member mConnectionId and the UcxCmMessage constructor) but lacks
a direct `#include` <optional>, relying on transitive includes; add an explicit
include for <optional> at the top of the file so std::optional/std::nullopt are
available, and ensure any other uses of std::optional in this file (the other
UcxCmMessage usages around the other constructor/fields noted) are covered by
this include so the file is self-contained.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 06994023-8027-419e-8b6e-a7ba80c0f603

📥 Commits

Reviewing files that changed from the base of the PR and between 61c8f5b and a34f020.

📒 Files selected for processing (4)

cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.h
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h

coderabbitai · 2026-04-03T02:28:27Z

cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp

+                        UcxCmMessage serverMessage(
+                            UcxCmMessage::MessageType::SERVER_WORKER_ADDRESS, selfWorkerAddress, serverConnectionId);
                        std::ostringstream oStream;
                        UcxCmMessage::serialize(serverMessage, oStream);
                        std::string serverMessageStr = oStream.str();
+                        // Register the connection (endpoint + future) BEFORE sending the ZMQ reply,
+                        // so the future is in mConnectionFutures when recvConnect looks it up.
+                        addConnection(workerAddress, serverConnectionId, clientConnectionId);
                        mZmqRepSocket.send(zmq::buffer(serverMessageStr), zmq::send_flags::none);


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE='cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp' echo "REP-thread handshake handler:" sed -n '437,487p' "$FILE" echo echo "Synchronous addConnection implementation:" sed -n '525,541p' "$FILE" echo echo "TLLM_THROW definition:" fd -i 'tllmException.h' cpp | xargs -r rg -n -C2 '#\s*define\s+TLLM_THROW|TLLM_THROW\('

Repository: NVIDIA/TensorRT-LLM

Length of output: 4284

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE='cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp' # Get full TLLM_THROW macro definition echo "=== Full TLLM_THROW definition ===" rg -A 10 '#\s*define\s+TLLM_THROW' cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/tllmException.h 2>/dev/null | head -20 echo echo "=== Handler code around addConnection() (line 457-467) ===" sed -n '455,470p' "$FILE" echo echo "=== Complete REP-thread handler structure (lines 437-490) ===" sed -n '437,490p' "$FILE"

Repository: NVIDIA/TensorRT-LLM

Length of output: 97

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find tllmException.h echo "=== Finding tllmException.h ===" fd -type f 'tllmException' cpp echo echo "=== Handler code around addConnection() ===" FILE='cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp' sed -n '455,470p' "$FILE" echo echo "=== Finding TLLM_THROW in cpp directory ===" rg -l 'define\s+TLLM_THROW' cpp | head -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 282

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find tllmException.h correctly echo "=== Finding tllmException files ===" find cpp -name '*tllmException*' -o -name '*Exception*' | grep -i tllm | head -10 echo echo "=== Handler code around addConnection() ===" FILE='cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp' sed -n '455,470p' "$FILE" echo echo "=== TLLM_THROW definition from initial results - showing macro behavior ===" echo "Macro starts with: do { ... }" echo "This means TLLM_THROW throws an exception and does NOT return silently"

Repository: NVIDIA/TensorRT-LLM

Length of output: 1764

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Full TLLM_THROW macro definition ===" sed -n '29,40p' cpp/include/tensorrt_llm/common/tllmException.h echo echo "=== Complete handler structure (437-490) ===" FILE='cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp' sed -n '437,490p' "$FILE" echo echo "=== Check addConnection for internal error handling ===" sed -n '525,545p' "$FILE"

Repository: NVIDIA/TensorRT-LLM

Length of output: 5031

Wrap the REP-thread handler in try-catch to prevent connection-setup failures from crashing the thread.

The addConnection() call at line 461 lacks exception handling. The function's internal catch-block calls TLLM_THROW(), which throws an exception. If an error occurs during endpoint setup, the exception will unwind out of the std::thread entry function with no handler to catch it, terminating the thread. The outer try-catch (lines 485–487) only protects thread creation, not execution.

When this happens, the ZMQ REP socket never sends a reply, and the requester blocks indefinitely on its recv() call.

Wrap the handler logic in a try-catch block to catch any failure from addConnection() and send an explicit error response back to the requester before the thread exits.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp` around lines 457 - 465, The REP-thread handler must catch exceptions from addConnection (which can propagate via TLLM_THROW) so the thread doesn't terminate without replying; wrap the handler code that builds serverMessage/UcxCmMessage, calls addConnection(workerAddress, serverConnectionId, clientConnectionId), and sends mZmqRepSocket.send(...) in a try-catch, and in the catch block serialize and send an explicit error response back on mZmqRepSocket (using a clear error payload/string so the requester’s recv() unblocks), then rethrow or cleanly exit; ensure you reference addConnection, UcxCmMessage, serverMessageStr, and mZmqRepSocket when making the changes.

chuangz0 · 2026-04-03T02:35:50Z

/bot run

tensorrt-cicd · 2026-04-03T02:42:06Z

PR_Github #41559 [ run ] triggered by Bot. Commit: a34f020 Link to invocation

tensorrt-cicd · 2026-04-03T04:16:22Z

PR_Github #41559 [ run ] completed with state SUCCESS. Commit: a34f020
/LLM/main/L0_MergeRequest_PR pipeline #32470 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

chuangz0 · 2026-04-03T06:31:24Z

/bot run

tensorrt-cicd · 2026-04-03T06:36:50Z

PR_Github #41612 [ run ] triggered by Bot. Commit: 1722a1c Link to invocation

tensorrt-cicd · 2026-04-03T08:39:47Z

PR_Github #41612 [ run ] completed with state SUCCESS. Commit: 1722a1c
/LLM/main/L0_MergeRequest_PR pipeline #32521 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

github-actions bot assigned chuangz0 Apr 3, 2026

chuangz0 force-pushed the optimize_ucx_build_connection branch from ebefe97 to a34f020 Compare April 3, 2026 02:13

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

optimize ucx build connection

1722a1c

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

chuangz0 force-pushed the optimize_ucx_build_connection branch from a34f020 to 1722a1c Compare April 3, 2026 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][feat] optimize ucx build connection#12722

[None][feat] optimize ucx build connection#12722
chuangz0 wants to merge 1 commit intoNVIDIA:mainfrom
chuangz0:optimize_ucx_build_connection

chuangz0 commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 3, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

chuangz0 commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

chuangz0 commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chuangz0 commented Apr 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

chuangz0 commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

chuangz0 commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chuangz0 commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading