Multi-Device TensorRT Runtime with Native NCCL Collectives by apbose · Pull Request #4157 · pytorch/TensorRT

apbose · 2026-04-01T23:52:04Z

C++ runtime: NCCL communicator init via c10d, rank/world_size serialization
Python runtime: distributed support in PythonTorchTensorRTModule and TorchTensorRTModule, NCCL library auto-detection
Conversion: native TRT DistCollective API (AllGather, ReduceScatter, AllReduce) with TRT-LLM plugin fallback
Graph lowering: fuse c10d_functional collectives + wait_tensor into single ops
Feature detection: native_trt_collectives flag, platform validation, graceful fallback chain
Build: conditional NCCL compilation via torch_nccl toolchain
Examples: tensor_parallel_simple_example.py, tensor_parallel_llama_llm.py

- C++ runtime: NCCL communicator init via c10d, rank/world_size serialization, DynamicOutputAllocator, ABI version bump to 8 - Python runtime: distributed support in PythonTorchTensorRTModule and TorchTensorRTModule, NCCL library auto-detection - Conversion: native TRT DistCollective API (AllGather, ReduceScatter, AllReduce) with TRT-LLM plugin fallback - Graph lowering: fuse c10d_functional collectives + wait_tensor into single ops - Feature detection: native_trt_collectives flag, platform validation, graceful fallback chain - Build: conditional NCCL compilation via torch_nccl toolchain - Examples: tensor_parallel_simple_example.py, tensor_parallel_llama_llm.py

narendasan

The high order bits are there is too much NCCL management happening all over the place and state is not managed tightly enough. Fundamentally setting up NCCL is not our job. The user tells us the information we need (the communicator and what rank the process is on) and we trust them to do the rest.

The runtime modules do not need to care about if nccl is set up as long as the information needed to deserialize and setup the engine is available.

Also seems like the C++ runtime only uses NCCL from c10d but then theres other code which falls back to nccl from the nccl python bindings. What do we actually want to support? just c10d? or both?

narendasan · 2026-04-02T19:59:54Z

core/runtime/runtime.h

  REQUIRES_OUTPUT_ALLOCATOR_IDX,
  RESOURCE_ALLOCATION_STRATEGY_IDX,
+  RANK_IDX,
+  WORLD_SIZE_IDX,


Make sure to bump the ABI version

Fields that are only sometimes applicable should be prefixed with OPTIONAL_ and need guards

narendasan · 2026-04-02T20:01:06Z

core/runtime/TRTEngine.cpp

               ? ResourceAllocationStrategy::kDynamic
-               : ResourceAllocationStrategy::kStatic)) {}
+               : ResourceAllocationStrategy::kStatic)) {
+  // Load distributed info if available (backward compatible with older ABI versions)


We dont need backwards compat unless some semantic definition changed, just bump the version

narendasan · 2026-04-02T20:04:43Z

core/runtime/TRTEngine.cpp

+    return false;
+  }
+
+  if (this->nccl_comm == nullptr) {


if it can be avoided, do not let anything be null. Should be a sentinel value or use a smart pointer

narendasan · 2026-04-02T20:06:12Z

core/runtime/TRTEngine.cpp

+  }
+
+  // Set NCCL communicator on TensorRT execution context
+  try {


We need like a real state machine or some true semantics here. Nothing else in the runtime just try catches

narendasan · 2026-04-02T20:13:21Z

core/runtime/TRTEngine.cpp

+  LOG_INFO("   Current world_size: " << this->world_size);
+  LOG_INFO("   Current device_id: " << this->device_info.id);
+
+  try {


same here, no nested try catches, you need to provide gaurentees about the state of the system.

narendasan · 2026-04-02T20:50:53Z

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

+                logger.error(f"Failed to set NCCL communicator: {e}")
+                raise
+
+    def get_nccl_communicator(self) -> Optional[Any]:


we dont need getters and setters if you want people to use this field just make it public

narendasan · 2026-04-02T20:51:14Z

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

+        """Get the NCCL communicator if set."""
+        return self._nccl_comm
+
+    def setup_nccl(self, use_pytorch_comm: bool = True) -> None:


This should be on the user

narendasan · 2026-04-02T20:52:46Z

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

+
+        uid = nccl.UniqueId.from_bytes(bytes(uid_tensor.cpu().numpy()))
+
+        comm = nccl.Communicator.init(world_size, rank, uid)


Is this a singleton owned by NCCL or can you have multiple, can a user just give this to us?

narendasan · 2026-04-02T20:59:42Z

core/runtime/TRTEngine.cpp

+  }
+}
+
+void TRTEngine::init_nccl_comm(const std::string& group_name) {


why do we need a method that transparently calls another method?

narendasan · 2026-04-02T21:01:58Z

core/runtime/TRTEngine.cpp

+    LOG_INFO("   Got NCCL backend from ProcessGroup");
+
+    // Cast the backend to ProcessGroupNCCL
+    auto* nccl_pg = dynamic_cast<c10d::ProcessGroupNCCL*>(backend.get());


Try to use smart pointers if possible

narendasan · 2026-04-02T21:06:54Z

py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py

        weight_name_map: Optional[dict[Any, Any]] = None,
        requires_output_allocator: bool = False,
        symbolic_shape_expressions: Optional[Dict[str, List[Dict[str, Any]]]] = None,
+        rank: int = -1,


Think all we need is that this is an md_engine and we can fetch the info from torch.distributed or the env var internally

meta-cla bot added the cla signed label Apr 1, 2026

apbose marked this pull request as draft April 1, 2026 23:52

narendasan reviewed Apr 2, 2026

View reviewed changes


		uid = nccl.UniqueId.from_bytes(bytes(uid_tensor.cpu().numpy()))

		comm = nccl.Communicator.init(world_size, rank, uid)

Conversation

apbose commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

narendasan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

apbose commented Apr 1, 2026 •

edited

Loading