GitHub - aws/sagemaker-hyperpod-cluster-setup: This repository provides setup assets to create Amazon SageMaker HyperPod clusters orchestrated with either Slurm or Amazon EKS. These clusters help you quickly scale model development tasks such as training, fine-tuning, or inference across a cluster of hundreds or thousands of AI accelerators.

SageMaker HyperPod cluster setup assets

This repository contains the setup assets required to create Amazon SageMaker HyperPod clusters using either Slurm or Amazon EKS for orchestration. You can create all the resources needed for large-scale AI/ML workloads—including networking, storage, compute, and IAM permissions.

SageMaker HyperPod clusters are purpose-built for scalability and resilience, designed to accelerate large-scale distributed training and deployment of complex machine learning models like LLMs and diffusion models, as well as customization of Amazon Nova foundation models.

Pre-requisites needed to setup a HyperPod cluster

The CloudFormation templates in this repository automate the provisioning of all necessary AWS resources along with your SageMaker HyperPod cluster. The templates are designed for flexibility, allowing you to either create a completely new stack of resources or integrate with your existing infrastructure by providing the IDs of existing components. The following resources will be managed by the templates:

Networking (VPC, Subnets, Security Groups) - Provides the network foundation optimized for high-performance.
S3 bucket for LifeCycle scripts - Stores the lifecycle scripts needed to bootstrap the cluster nodes for both Slurm and EKS.
FSx for Lustre - A high-performance, shared file system for datasets and model checkpoints.
Amazon EKS Cluster - A managed Kubernetes service provided by Amazon Web Services (AWS).
Helm charts - Deploys necessary Kubernetes components (e.g., health monitoring agent, training, inference operator) onto the EKS cluster required for HyperPod.
IAM role - Allows the HyperPod cluster to run and communicate with the necessary AWS resources on your behalf.

Configure resources and deploy using CloudFormation

You can configure resources and deploy using the CloudFormation templates for SageMaker HyperPod. Follow the steps mentioned in the AWS documentation to get started:

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
.github/workflows		.github/workflows
eks/cloudformation		eks/cloudformation
slurm/cloudformation		slurm/cloudformation
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SageMaker HyperPod cluster setup assets

Pre-requisites needed to setup a HyperPod cluster

Configure resources and deploy using CloudFormation

Security

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

aws/sagemaker-hyperpod-cluster-setup

Folders and files

Latest commit

History

Repository files navigation

SageMaker HyperPod cluster setup assets

Pre-requisites needed to setup a HyperPod cluster

Configure resources and deploy using CloudFormation

Security

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages