Skip to content

Add ml.p5e.48xlarge to EFA instance lists in sagemaker-train and sagemaker-core #5491

@srujithpoondla03

Description

@srujithpoondla03

Add ml.p5e.48xlarge to EFA instance lists in sagemaker-train and sagemaker-core

Description

The SM_EFA_NCCL_INSTANCES and SM_EFA_RDMA_INSTANCES lists in the sagemaker-python-sdk are missing ml.p5e.48xlarge, causing NCCL hangs during distributed training initialization on P5e instances when using the SDK's container drivers.

Additionally, ml.p5.48xlarge is missing from SM_EFA_RDMA_INSTANCES (it's only in SM_EFA_NCCL_INSTANCES).

Current State

SM_EFA_NCCL_INSTANCES = [
    "ml.g4dn.8xlarge",
    "ml.g4dn.12xlarge",
    "ml.g5.48xlarge",
    "ml.p3dn.24xlarge",
    "ml.p4d.24xlarge",
    "ml.p4de.24xlarge",
    "ml.p5.48xlarge",
    "ml.trn1.32xlarge",
]

SM_EFA_RDMA_INSTANCES = [
    "ml.p4d.24xlarge",
    "ml.p4de.24xlarge",
    "ml.trn1.32xlarge",
]

Expected State

SM_EFA_NCCL_INSTANCES = [
    "ml.g4dn.8xlarge",
    "ml.g4dn.12xlarge",
    "ml.g5.48xlarge",
    "ml.p3dn.24xlarge",
    "ml.p4d.24xlarge",
    "ml.p4de.24xlarge",
    "ml.p5.48xlarge",
    "ml.p5e.48xlarge",  # ADD
    "ml.trn1.32xlarge",
]

SM_EFA_RDMA_INSTANCES = [
    "ml.p4d.24xlarge",
    "ml.p4de.24xlarge",
    "ml.p5.48xlarge",   # ADD
    "ml.p5e.48xlarge",  # ADD
    "ml.trn1.32xlarge",
]

Impact

Without these entries, the SDK's container drivers don't set the required EFA environment variables (FI_PROVIDER=efa, FI_EFA_USE_DEVICE_RDMA=1, RDMAV_FORK_SAFE=1) for P5e instances, causing NCCL to hang during collective initialization in multi-node distributed training.

Related

Questions

  1. Is there a specific process for testing EFA/instance-specific changes on actual hardware before merging?
  2. Should integration tests be added for P5e EFA configuration, or are unit tests sufficient?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions