[None][feat] optimize ucx build connection#12722
[None][feat] optimize ucx build connection#12722chuangz0 wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
ebefe97 to
a34f020
Compare
📝 WalkthroughWalkthroughRefactored UCX connection ID negotiation by shifting from tag-based exchange in the constructor to ZMQ-based handshake during setup. Updated Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.h (1)
1-3:⚠️ Potential issue | 🟡 MinorUpdate the NVIDIA copyright year to 2026.
This file was meaningfully modified in this PR, so the header should reflect the latest modification year. The same applies to the other modified UCX files in this change set.
As per coding guidelines, "Add NVIDIA copyright header to ALL new files; update year on modified files".Suggested fix
- * SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.h` around lines 1 - 3, Update the SPDX header year from 2025 to 2026 in the file's copyright header: replace the "SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES." line with 2026 in connection.h (and do the same for any other modified UCX-related files in this change set) so the header reflects the latest modification year.
🧹 Nitpick comments (2)
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp (2)
533-535: Name thefromRequesterflag at both constructor call sites.With two same-typed connection IDs already in the call, the trailing raw
true/falseis easy to misread. Inline-name the boolean parameter at both sites.As per coding guidelines, "In C++ function calls where parameters are not obvious from inspection, use an inline C++ comment to document the parameter: `doSomeOperation(/* checkForErrors = */ false);`"Suggested fix
- auto connection = std::make_shared<UcxConnection>(connectionId, connectionIdInPeer, newEp, this, false); + auto connection = std::make_shared<UcxConnection>( + connectionId, connectionIdInPeer, newEp, this, /* fromRequester = */ false); ... - connection = std::make_shared<UcxConnection>(connectionId, serverConnectionId, newEp, this, true); + connection = std::make_shared<UcxConnection>( + connectionId, serverConnectionId, newEp, this, /* fromRequester = */ true);Also applies to: 591-593
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp` around lines 533 - 535, The UcxConnection constructor calls pass an unlabeled boolean for the fromRequester flag which is easy to misread; update the two call sites that create UcxConnection (the one currently using "false" after newEp and the other call around the 591-593 region that uses "true"/"false") to use an inline comment naming the parameter, e.g. replace the bare boolean with /* fromRequester = */ false or /* fromRequester = */ true so the intent is explicit when constructing UcxConnection and when inserting into mConnections.
63-66: Add a direct<optional>include for the new wire field.
UcxCmMessagenow usesstd::optional/std::nulloptdirectly, but this translation unit still relies on transitive includes for those symbols. Please include<optional>explicitly so the file stays self-contained.Suggested fix
`#include` <mutex> `#include` <numeric> +#include <optional> `#include` <regex> `#include` <string>Also applies to: 94-95
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp` around lines 63 - 66, The translation unit uses std::optional and std::nullopt directly (e.g., member mConnectionId and the UcxCmMessage constructor) but lacks a direct `#include` <optional>, relying on transitive includes; add an explicit include for <optional> at the top of the file so std::optional/std::nullopt are available, and ensure any other uses of std::optional in this file (the other UcxCmMessage usages around the other constructor/fields noted) are covered by this include so the file is self-contained.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp`:
- Around line 457-465: The REP-thread handler must catch exceptions from
addConnection (which can propagate via TLLM_THROW) so the thread doesn't
terminate without replying; wrap the handler code that builds
serverMessage/UcxCmMessage, calls addConnection(workerAddress,
serverConnectionId, clientConnectionId), and sends mZmqRepSocket.send(...) in a
try-catch, and in the catch block serialize and send an explicit error response
back on mZmqRepSocket (using a clear error payload/string so the requester’s
recv() unblocks), then rethrow or cleanly exit; ensure you reference
addConnection, UcxCmMessage, serverMessageStr, and mZmqRepSocket when making the
changes.
---
Outside diff comments:
In `@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.h`:
- Around line 1-3: Update the SPDX header year from 2025 to 2026 in the file's
copyright header: replace the "SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA
CORPORATION & AFFILIATES." line with 2026 in connection.h (and do the same for
any other modified UCX-related files in this change set) so the header reflects
the latest modification year.
---
Nitpick comments:
In
`@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp`:
- Around line 533-535: The UcxConnection constructor calls pass an unlabeled
boolean for the fromRequester flag which is easy to misread; update the two call
sites that create UcxConnection (the one currently using "false" after newEp and
the other call around the 591-593 region that uses "true"/"false") to use an
inline comment naming the parameter, e.g. replace the bare boolean with /*
fromRequester = */ false or /* fromRequester = */ true so the intent is explicit
when constructing UcxConnection and when inserting into mConnections.
- Around line 63-66: The translation unit uses std::optional and std::nullopt
directly (e.g., member mConnectionId and the UcxCmMessage constructor) but lacks
a direct `#include` <optional>, relying on transitive includes; add an explicit
include for <optional> at the top of the file so std::optional/std::nullopt are
available, and ensure any other uses of std::optional in this file (the other
UcxCmMessage usages around the other constructor/fields noted) are covered by
this include so the file is self-contained.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 06994023-8027-419e-8b6e-a7ba80c0f603
📒 Files selected for processing (4)
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cppcpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.hcpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cppcpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h
| UcxCmMessage serverMessage( | ||
| UcxCmMessage::MessageType::SERVER_WORKER_ADDRESS, selfWorkerAddress, serverConnectionId); | ||
| std::ostringstream oStream; | ||
| UcxCmMessage::serialize(serverMessage, oStream); | ||
| std::string serverMessageStr = oStream.str(); | ||
| // Register the connection (endpoint + future) BEFORE sending the ZMQ reply, | ||
| // so the future is in mConnectionFutures when recvConnect looks it up. | ||
| addConnection(workerAddress, serverConnectionId, clientConnectionId); | ||
| mZmqRepSocket.send(zmq::buffer(serverMessageStr), zmq::send_flags::none); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE='cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp'
echo "REP-thread handshake handler:"
sed -n '437,487p' "$FILE"
echo
echo "Synchronous addConnection implementation:"
sed -n '525,541p' "$FILE"
echo
echo "TLLM_THROW definition:"
fd -i 'tllmException.h' cpp | xargs -r rg -n -C2 '#\s*define\s+TLLM_THROW|TLLM_THROW\('Repository: NVIDIA/TensorRT-LLM
Length of output: 4284
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE='cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp'
# Get full TLLM_THROW macro definition
echo "=== Full TLLM_THROW definition ==="
rg -A 10 '#\s*define\s+TLLM_THROW' cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/tllmException.h 2>/dev/null | head -20
echo
echo "=== Handler code around addConnection() (line 457-467) ==="
sed -n '455,470p' "$FILE"
echo
echo "=== Complete REP-thread handler structure (lines 437-490) ==="
sed -n '437,490p' "$FILE"Repository: NVIDIA/TensorRT-LLM
Length of output: 97
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find tllmException.h
echo "=== Finding tllmException.h ==="
fd -type f 'tllmException' cpp
echo
echo "=== Handler code around addConnection() ==="
FILE='cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp'
sed -n '455,470p' "$FILE"
echo
echo "=== Finding TLLM_THROW in cpp directory ==="
rg -l 'define\s+TLLM_THROW' cpp | head -5Repository: NVIDIA/TensorRT-LLM
Length of output: 282
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find tllmException.h correctly
echo "=== Finding tllmException files ==="
find cpp -name '*tllmException*' -o -name '*Exception*' | grep -i tllm | head -10
echo
echo "=== Handler code around addConnection() ==="
FILE='cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp'
sed -n '455,470p' "$FILE"
echo
echo "=== TLLM_THROW definition from initial results - showing macro behavior ==="
echo "Macro starts with: do { ... }"
echo "This means TLLM_THROW throws an exception and does NOT return silently"Repository: NVIDIA/TensorRT-LLM
Length of output: 1764
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Full TLLM_THROW macro definition ==="
sed -n '29,40p' cpp/include/tensorrt_llm/common/tllmException.h
echo
echo "=== Complete handler structure (437-490) ==="
FILE='cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp'
sed -n '437,490p' "$FILE"
echo
echo "=== Check addConnection for internal error handling ==="
sed -n '525,545p' "$FILE"Repository: NVIDIA/TensorRT-LLM
Length of output: 5031
Wrap the REP-thread handler in try-catch to prevent connection-setup failures from crashing the thread.
The addConnection() call at line 461 lacks exception handling. The function's internal catch-block calls TLLM_THROW(), which throws an exception. If an error occurs during endpoint setup, the exception will unwind out of the std::thread entry function with no handler to catch it, terminating the thread. The outer try-catch (lines 485–487) only protects thread creation, not execution.
When this happens, the ZMQ REP socket never sends a reply, and the requester blocks indefinitely on its recv() call.
Wrap the handler logic in a try-catch block to catch any failure from addConnection() and send an explicit error response back to the requester before the thread exits.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp`
around lines 457 - 465, The REP-thread handler must catch exceptions from
addConnection (which can propagate via TLLM_THROW) so the thread doesn't
terminate without replying; wrap the handler code that builds
serverMessage/UcxCmMessage, calls addConnection(workerAddress,
serverConnectionId, clientConnectionId), and sends mZmqRepSocket.send(...) in a
try-catch, and in the catch block serialize and send an explicit error response
back on mZmqRepSocket (using a clear error payload/string so the requester’s
recv() unblocks), then rethrow or cleanly exit; ensure you reference
addConnection, UcxCmMessage, serverMessageStr, and mZmqRepSocket when making the
changes.
|
/bot run |
|
PR_Github #41559 [ run ] triggered by Bot. Commit: |
|
PR_Github #41559 [ run ] completed with state
|
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
a34f020 to
1722a1c
Compare
|
/bot run |
|
PR_Github #41612 [ run ] triggered by Bot. Commit: |
|
PR_Github #41612 [ run ] completed with state
|
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.