-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
Add ml.p5e.48xlarge to EFA instance lists in sagemaker-train and sagemaker-core
Description
The SM_EFA_NCCL_INSTANCES and SM_EFA_RDMA_INSTANCES lists in the sagemaker-python-sdk are missing ml.p5e.48xlarge, causing NCCL hangs during distributed training initialization on P5e instances when using the SDK's container drivers.
Additionally, ml.p5.48xlarge is missing from SM_EFA_RDMA_INSTANCES (it's only in SM_EFA_NCCL_INSTANCES).
Current State
SM_EFA_NCCL_INSTANCES = [
"ml.g4dn.8xlarge",
"ml.g4dn.12xlarge",
"ml.g5.48xlarge",
"ml.p3dn.24xlarge",
"ml.p4d.24xlarge",
"ml.p4de.24xlarge",
"ml.p5.48xlarge",
"ml.trn1.32xlarge",
]
SM_EFA_RDMA_INSTANCES = [
"ml.p4d.24xlarge",
"ml.p4de.24xlarge",
"ml.trn1.32xlarge",
]Expected State
SM_EFA_NCCL_INSTANCES = [
"ml.g4dn.8xlarge",
"ml.g4dn.12xlarge",
"ml.g5.48xlarge",
"ml.p3dn.24xlarge",
"ml.p4d.24xlarge",
"ml.p4de.24xlarge",
"ml.p5.48xlarge",
"ml.p5e.48xlarge", # ADD
"ml.trn1.32xlarge",
]
SM_EFA_RDMA_INSTANCES = [
"ml.p4d.24xlarge",
"ml.p4de.24xlarge",
"ml.p5.48xlarge", # ADD
"ml.p5e.48xlarge", # ADD
"ml.trn1.32xlarge",
]Impact
Without these entries, the SDK's container drivers don't set the required EFA environment variables (FI_PROVIDER=efa, FI_EFA_USE_DEVICE_RDMA=1, RDMAV_FORK_SAFE=1) for P5e instances, causing NCCL to hang during collective initialization in multi-node distributed training.
Related
- sagemaker-training-toolkit issue: Add ml.p5e.48xlarge to EFA instance lists (SM_EFA_NCCL_INSTANCES and SM_EFA_RDMA_INSTANCES) sagemaker-training-toolkit#240
- sagemaker-training-toolkit PR: feat: Add ml.p5e.48xlarge and ml.p5.48xlarge to EFA instance lists sagemaker-training-toolkit#241
- P5e instances use EFA with RDMA support, same as P4d/P4de/P5
Questions
- Is there a specific process for testing EFA/instance-specific changes on actual hardware before merging?
- Should integration tests be added for P5e EFA configuration, or are unit tests sufficient?
Metadata
Metadata
Assignees
Labels
No labels