Skip to content

[None][feat] optimize ucx build connection#12722

Open
chuangz0 wants to merge 1 commit intoNVIDIA:mainfrom
chuangz0:optimize_ucx_build_connection
Open

[None][feat] optimize ucx build connection#12722
chuangz0 wants to merge 1 commit intoNVIDIA:mainfrom
chuangz0:optimize_ucx_build_connection

Conversation

@chuangz0
Copy link
Copy Markdown
Collaborator

@chuangz0 chuangz0 commented Apr 3, 2026

Summary by CodeRabbit

  • Refactor
    • Improved cache transmission connection establishment by streamlining connection ID negotiation.
    • Simplified connection synchronization mechanism for better efficiency.
    • Enhanced connection setup robustness and reliability.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@chuangz0 chuangz0 force-pushed the optimize_ucx_build_connection branch from ebefe97 to a34f020 Compare April 3, 2026 02:13
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

Refactored UCX connection ID negotiation by shifting from tag-based exchange in the constructor to ZMQ-based handshake during setup. Updated UcxConnection to accept a peer connection ID parameter, removed negotiation logic from the constructor, and restructured UcxConnectionManager to accept explicit connection IDs rather than generating them asynchronously. Eliminated future-based connection tracking in favor of direct connection registration.

Changes

Cohort / File(s) Summary
Connection class signatures
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.h, cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp
Updated UcxConnection constructor to accept connectionIdInPeer parameter. Removed tag-based UCX negotiation logic (tagRecv/tagSend branches) from constructor; connection ID exchange now delegated to ZMQ setup phase.
UCX Manager refactoring
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h, cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
Changed addConnection to accept explicit (workerAddress, connectionId, connectionIdInPeer) instead of single-argument variant. Simplified getNewConnectionId() to accept no parameters. Extended UcxCmMessage to carry optional connectionId field in ZMQ handshake. Removed mConnectionFutures and async connection tracking; connections now registered directly upon validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is empty except for the template itself; no actual description, test coverage details, or checklist items were filled in by the author. Fill in the Description section explaining why the UCX connection ID negotiation was moved from UCX tags to ZMQ handshake, provide test coverage details, and complete the PR checklist.
Docstring Coverage ⚠️ Warning Docstring coverage is 6.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title '[None][feat] optimize ucx build connection' is vague and lacks specificity about the actual changes made. Use a more descriptive title that clarifies the specific optimization, such as '[None][feat] Refactor UCX connection initialization to pass connection IDs via constructor' or similar.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.h (1)

1-3: ⚠️ Potential issue | 🟡 Minor

Update the NVIDIA copyright year to 2026.

This file was meaningfully modified in this PR, so the header should reflect the latest modification year. The same applies to the other modified UCX files in this change set.

Suggested fix
- * SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines, "Add NVIDIA copyright header to ALL new files; update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.h` around
lines 1 - 3, Update the SPDX header year from 2025 to 2026 in the file's
copyright header: replace the "SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA
CORPORATION & AFFILIATES." line with 2026 in connection.h (and do the same for
any other modified UCX-related files in this change set) so the header reflects
the latest modification year.
🧹 Nitpick comments (2)
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp (2)

533-535: Name the fromRequester flag at both constructor call sites.

With two same-typed connection IDs already in the call, the trailing raw true/false is easy to misread. Inline-name the boolean parameter at both sites.

Suggested fix
-        auto connection = std::make_shared<UcxConnection>(connectionId, connectionIdInPeer, newEp, this, false);
+        auto connection = std::make_shared<UcxConnection>(
+            connectionId, connectionIdInPeer, newEp, this, /* fromRequester = */ false);
...
-            connection = std::make_shared<UcxConnection>(connectionId, serverConnectionId, newEp, this, true);
+            connection = std::make_shared<UcxConnection>(
+                connectionId, serverConnectionId, newEp, this, /* fromRequester = */ true);
As per coding guidelines, "In C++ function calls where parameters are not obvious from inspection, use an inline C++ comment to document the parameter: `doSomeOperation(/* checkForErrors = */ false);`"

Also applies to: 591-593

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp`
around lines 533 - 535, The UcxConnection constructor calls pass an unlabeled
boolean for the fromRequester flag which is easy to misread; update the two call
sites that create UcxConnection (the one currently using "false" after newEp and
the other call around the 591-593 region that uses "true"/"false") to use an
inline comment naming the parameter, e.g. replace the bare boolean with /*
fromRequester = */ false or /* fromRequester = */ true so the intent is explicit
when constructing UcxConnection and when inserting into mConnections.

63-66: Add a direct <optional> include for the new wire field.

UcxCmMessage now uses std::optional/std::nullopt directly, but this translation unit still relies on transitive includes for those symbols. Please include <optional> explicitly so the file stays self-contained.

Suggested fix
 `#include` <mutex>
 `#include` <numeric>
+#include <optional>
 `#include` <regex>
 `#include` <string>

Also applies to: 94-95

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp`
around lines 63 - 66, The translation unit uses std::optional and std::nullopt
directly (e.g., member mConnectionId and the UcxCmMessage constructor) but lacks
a direct `#include` <optional>, relying on transitive includes; add an explicit
include for <optional> at the top of the file so std::optional/std::nullopt are
available, and ensure any other uses of std::optional in this file (the other
UcxCmMessage usages around the other constructor/fields noted) are covered by
this include so the file is self-contained.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp`:
- Around line 457-465: The REP-thread handler must catch exceptions from
addConnection (which can propagate via TLLM_THROW) so the thread doesn't
terminate without replying; wrap the handler code that builds
serverMessage/UcxCmMessage, calls addConnection(workerAddress,
serverConnectionId, clientConnectionId), and sends mZmqRepSocket.send(...) in a
try-catch, and in the catch block serialize and send an explicit error response
back on mZmqRepSocket (using a clear error payload/string so the requester’s
recv() unblocks), then rethrow or cleanly exit; ensure you reference
addConnection, UcxCmMessage, serverMessageStr, and mZmqRepSocket when making the
changes.

---

Outside diff comments:
In `@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.h`:
- Around line 1-3: Update the SPDX header year from 2025 to 2026 in the file's
copyright header: replace the "SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA
CORPORATION & AFFILIATES." line with 2026 in connection.h (and do the same for
any other modified UCX-related files in this change set) so the header reflects
the latest modification year.

---

Nitpick comments:
In
`@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp`:
- Around line 533-535: The UcxConnection constructor calls pass an unlabeled
boolean for the fromRequester flag which is easy to misread; update the two call
sites that create UcxConnection (the one currently using "false" after newEp and
the other call around the 591-593 region that uses "true"/"false") to use an
inline comment naming the parameter, e.g. replace the bare boolean with /*
fromRequester = */ false or /* fromRequester = */ true so the intent is explicit
when constructing UcxConnection and when inserting into mConnections.
- Around line 63-66: The translation unit uses std::optional and std::nullopt
directly (e.g., member mConnectionId and the UcxCmMessage constructor) but lacks
a direct `#include` <optional>, relying on transitive includes; add an explicit
include for <optional> at the top of the file so std::optional/std::nullopt are
available, and ensure any other uses of std::optional in this file (the other
UcxCmMessage usages around the other constructor/fields noted) are covered by
this include so the file is self-contained.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 06994023-8027-419e-8b6e-a7ba80c0f603

📥 Commits

Reviewing files that changed from the base of the PR and between 61c8f5b and a34f020.

📒 Files selected for processing (4)
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.h
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h

Comment on lines +457 to 465
UcxCmMessage serverMessage(
UcxCmMessage::MessageType::SERVER_WORKER_ADDRESS, selfWorkerAddress, serverConnectionId);
std::ostringstream oStream;
UcxCmMessage::serialize(serverMessage, oStream);
std::string serverMessageStr = oStream.str();
// Register the connection (endpoint + future) BEFORE sending the ZMQ reply,
// so the future is in mConnectionFutures when recvConnect looks it up.
addConnection(workerAddress, serverConnectionId, clientConnectionId);
mZmqRepSocket.send(zmq::buffer(serverMessageStr), zmq::send_flags::none);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE='cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp'

echo "REP-thread handshake handler:"
sed -n '437,487p' "$FILE"

echo
echo "Synchronous addConnection implementation:"
sed -n '525,541p' "$FILE"

echo
echo "TLLM_THROW definition:"
fd -i 'tllmException.h' cpp | xargs -r rg -n -C2 '#\s*define\s+TLLM_THROW|TLLM_THROW\('

Repository: NVIDIA/TensorRT-LLM

Length of output: 4284


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE='cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp'

# Get full TLLM_THROW macro definition
echo "=== Full TLLM_THROW definition ==="
rg -A 10 '#\s*define\s+TLLM_THROW' cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/tllmException.h 2>/dev/null | head -20

echo
echo "=== Handler code around addConnection() (line 457-467) ==="
sed -n '455,470p' "$FILE"

echo
echo "=== Complete REP-thread handler structure (lines 437-490) ==="
sed -n '437,490p' "$FILE"

Repository: NVIDIA/TensorRT-LLM

Length of output: 97


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find tllmException.h
echo "=== Finding tllmException.h ==="
fd -type f 'tllmException' cpp

echo
echo "=== Handler code around addConnection() ==="
FILE='cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp'
sed -n '455,470p' "$FILE"

echo
echo "=== Finding TLLM_THROW in cpp directory ==="
rg -l 'define\s+TLLM_THROW' cpp | head -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 282


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find tllmException.h correctly
echo "=== Finding tllmException files ==="
find cpp -name '*tllmException*' -o -name '*Exception*' | grep -i tllm | head -10

echo
echo "=== Handler code around addConnection() ==="
FILE='cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp'
sed -n '455,470p' "$FILE"

echo
echo "=== TLLM_THROW definition from initial results - showing macro behavior ==="
echo "Macro starts with: do { ... }"
echo "This means TLLM_THROW throws an exception and does NOT return silently"

Repository: NVIDIA/TensorRT-LLM

Length of output: 1764


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Full TLLM_THROW macro definition ==="
sed -n '29,40p' cpp/include/tensorrt_llm/common/tllmException.h

echo
echo "=== Complete handler structure (437-490) ==="
FILE='cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp'
sed -n '437,490p' "$FILE"

echo
echo "=== Check addConnection for internal error handling ==="
sed -n '525,545p' "$FILE"

Repository: NVIDIA/TensorRT-LLM

Length of output: 5031


Wrap the REP-thread handler in try-catch to prevent connection-setup failures from crashing the thread.

The addConnection() call at line 461 lacks exception handling. The function's internal catch-block calls TLLM_THROW(), which throws an exception. If an error occurs during endpoint setup, the exception will unwind out of the std::thread entry function with no handler to catch it, terminating the thread. The outer try-catch (lines 485–487) only protects thread creation, not execution.

When this happens, the ZMQ REP socket never sends a reply, and the requester blocks indefinitely on its recv() call.

Wrap the handler logic in a try-catch block to catch any failure from addConnection() and send an explicit error response back to the requester before the thread exits.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp`
around lines 457 - 465, The REP-thread handler must catch exceptions from
addConnection (which can propagate via TLLM_THROW) so the thread doesn't
terminate without replying; wrap the handler code that builds
serverMessage/UcxCmMessage, calls addConnection(workerAddress,
serverConnectionId, clientConnectionId), and sends mZmqRepSocket.send(...) in a
try-catch, and in the catch block serialize and send an explicit error response
back on mZmqRepSocket (using a clear error payload/string so the requester’s
recv() unblocks), then rethrow or cleanly exit; ensure you reference
addConnection, UcxCmMessage, serverMessageStr, and mZmqRepSocket when making the
changes.

@chuangz0
Copy link
Copy Markdown
Collaborator Author

chuangz0 commented Apr 3, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41559 [ run ] triggered by Bot. Commit: a34f020 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41559 [ run ] completed with state SUCCESS. Commit: a34f020
/LLM/main/L0_MergeRequest_PR pipeline #32470 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
@chuangz0 chuangz0 force-pushed the optimize_ucx_build_connection branch from a34f020 to 1722a1c Compare April 3, 2026 06:31
@chuangz0
Copy link
Copy Markdown
Collaborator Author

chuangz0 commented Apr 3, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41612 [ run ] triggered by Bot. Commit: 1722a1c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41612 [ run ] completed with state SUCCESS. Commit: 1722a1c
/LLM/main/L0_MergeRequest_PR pipeline #32521 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants