Merge latest changes from main to 'Documentation' branch #192

rsareddy0329 · 2025-08-05T23:05:11Z

PR Approval Steps

For Requester

Description
- Check the PR title and description for clarity. It should describe the changes made and the reason behind them.
- Ensure that the PR follows the contribution guidelines, if applicable.
Security requirements
- Ensure that a Pull Request (PR) does not expose passwords and other sensitive information by using git-secrets and upload relevant evidence: https://github.com/awslabs/git-secrets
- Ensure commit has GitHub Commit Signature
Manual review
1. Click on the Files changed tab to see the code changes. Review the changes thoroughly:
  - Code Quality: Check for coding standards, naming conventions, and readability.
  - Functionality: Ensure that the changes meet the requirements and that all necessary code paths are tested.
  - Security: Check for any security issues or vulnerabilities.
  - Documentation: Confirm that any necessary documentation (code comments, README updates, etc.) has been updated.
Check for Merge Conflicts:
- Verify if there are any merge conflicts with the base branch. GitHub will usually highlight this. If there are conflicts, you should resolve them.

For Reviewer

Go through For Requester section to double check each item.
Request Changes or Approve the PR:
1. If the PR is ready to be merged, click Review changes and select Approve.
2. If changes are required, select Request changes and provide feedback. Be constructive and clear in your feedback.
Merging the PR
1. Check the Merge Method:
  1. Decide on the appropriate merge method based on your repository's guidelines (e.g., Squash and merge, Rebase and merge, or Merge).
2. Merge the PR:
  1. Click the Merge pull request button.
  2. Confirm the merge by clicking Confirm merge.

Co-authored-by: adishaa <adishaa@amazon.com>

… with minor improvements and bug fixes (#137)

… with minor improvements and bug fixes. (#139)

…and ux (#136)

…ception count data (#140)

* manual release v3.0.1

…alarm fix (#147)

… regionalized HMA URI (#141)

* Add unique time string to integ test * Update syntax

* Training CLI & SDK: example notebook and README update * Update training cli example notebook --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>

* Update inferenece SDK examples * Update readme

* Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * CLI: Enable Telemetry * CLI: Enable Telemetry --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>

…102)

* update help text to avoid truncation * update volume flag to support hostPath and pvc, before e2e testing * clean up and e2e working * Minor updates after PR * update * Added unit tests for volume, all cli unit tests passed

Co-authored-by: pintaoz <pintaoz@amazon.com>

* Update inference config and integ tests * Update integ tests for new canaries

* Manual release v3.0.2 * Update changelog --------- Co-authored-by: pintaoz <pintaoz@amazon.com>

* Update readme for volume flag * Add schema pattern check to pytorch-job template, unit test added, all test passed locally

…8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes.

* Fix SDK training test: Add wait time before refresh * Fix training tests in canaries

…189) Co-authored-by: pintaoz <pintaoz@amazon.com>

* Update documentation-with-new-changes branch with latest changes from main (#190) * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <pintaoz@amazon.com> --------- Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: pintaoz <pintaoz@amazon.com> * Documentation Fixes (#191) Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * update documentation with new changes branch with latest changes (#194) * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <pintaoz@amazon.com> --------- Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: pintaoz <pintaoz@amazon.com> * Documentation Fixes (#195) * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * Documentation Fixes (#197) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * Documentation Fixes (#198) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * Documentation fixes (#199) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> --------- Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: pintaoz <pintaoz@amazon.com> Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>

…s to view SDK config code (#188) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>

* Add instance type support for ml.p6e-gb200.36xlarge Updated support for ml.p6-b200.48xlarge as well * Add ml.p6e-gb200.36xlarge to efa plugin

…holder value (#206) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>

Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>

* update documentation to add init experience for all templates * minor update * add example notebooks to documentation, add delete SDK command to readme, update init experience documentation flow

…ator (#332)

* integration test for jumpstart with mig profile * template fix for mig with jumpstart * skipped mig tests until instances setup finished --------- Co-authored-by: Chad Chiang <chadchc@amazon.com>

* set template-version flag to optional for cluster create, add support for efa for pytorch job, remove default request and limits when instance type is none * fix gpu allocation validation error * remove redundant * fix unit test and expand logic to memory and vcpu field * Follow up on merge conflict in release * consolidate all debug flags to show kubernates exception * Revert "Follow up on merge conflict in release" This reverts commit c816838. * fix unit and integ test for space * fix more unit test for space * change dependency for delete in init integ test

* integration test for jumpstart with mig profile * template fix for mig with jumpstart * skipped mig tests until instances setup finished * enable the mig integration tests --------- Co-authored-by: Chad Chiang <chadchc@amazon.com>

* Upgrade Inference Operator Version (#327) * pyproj version update (#328) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com> * version change (#329) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com> * elastic training to keynote3 (#307) * feat: Implement elastic training cli arguments (#273) * feat: Implement elastic training cli arguments * Add elastic training unified config and unit test * Add graceful shutdown and scaling timeout to cli args * Revert "feat: Implement elastic training cli arguments (#273)" This reverts commit 18428ef2b1c0562bf51a9a4b4aa2914eed441259. * feat: Implement elastic training cli arguments (#295) * feat: implement elastic training cli args * Rename args name to match crd for elastic training * Add unit test for replcia discrete values * Add integ test for elastic training cli --------- Co-authored-by: Sophia <yungwenh@amazon.com> Co-authored-by: Molly He <mollyhe@amazon.com> Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com> * version update for v3.5.0 --------- Co-authored-by: Shantanu Tripathi <shantanutripathi237@gmail.com> Co-authored-by: Mohamed Zeidan <81834882+mohamedzeidan2021@users.noreply.github.com> Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com> Co-authored-by: Sophia <yungwenh@amazon.com>

* Update documentation for elastic training arguments * nit: Add detail descriptions for array type

* Update documentation for elastic training arguments * nit: Add detail descriptions for array type * Add efa support for training jobs * address comment and add unit test for efa support * fix: add efa check in quota allocation test * Modify efa arg name and fix gpu integ test

* Skipping expensive integ tests * Properly skipping all tests

Co-authored-by: pintaoz <pintaoz@amazon.com>

* Add GPU operator MIG support with NVIDIA license notice * Add regional values for ap-northeast-2 and ca-central-1 GPU operator MIG support * Update MIG config for GPU operator --------- Co-authored-by: Sean Archer <searche@amazon.com>

) Fix for #359 Add support for 4 newly GA regions in health-monitoring-agent: - ca-central-1 (YUL): ECR account 843976229209 - ap-southeast-3 (CGK): ECR account 971422672635 - ap-southeast-4 (MEL): ECR account 084375568333 - eu-south-2 (ZAZ): ECR account 626887787726 This brings total supported regions to 17 GA regions. Fix fallback logic bug where unmapped regions would create malformed image URIs. Previously, when a region was not in the mapping, only the account ID would fallback to us-east-1 while the region remained unchanged, resulting in invalid URIs like below: 767398015722.dkr.ecr.ap-southeast-3.amazonaws.com/hyperpod-health-monitoring-agent:1.0.1038.0_1.0.305.0 Now both region AND account ID fallback together to us-east-1, generating valid URIs: 767398015722.dkr.ecr.us-east-1.amazonaws.com/... This ensures the health-monitoring-agent can pull images even when deployed in unmapped or future regions (if the cluster nodes have internet access).

…with minor improvements and bug fixes. (#361) * New Feature Adds more enhanced Nvidia Timeout analysis * Enhanced health reporting and job execution stability * Fix bugs in cluster health status reporting issues * Optimized error detection to reduce noise and focus on critical issues

Aditi2424 and others added 25 commits July 18, 2025 12:24

Update telemetry status to be Integer for parity (#130)

223af40

Co-authored-by: adishaa <adishaa@amazon.com>

Release new version for Health Monitoring Agent (1.0.643.0_1.0.192.0)…

cf77296

… with minor improvements and bug fixes (#137)

Release new version for Health Monitoring Agent (1.0.674.0_1.0.199.0)…

0342f60

… with minor improvements and bug fixes. (#139)

update inference CLI describe command print for better visualization …

631ddf9

…and ux (#136)

Update inference integ test to add dependency to improve telemetry ex…

dc440c3

…ception count data (#140)

Manual release v3.0.1 (#143)

cc08405

* manual release v3.0.1

change security-monitoring metrics data destination to us-east-2 for …

079fafd

…alarm fix (#147)

feat: Add region detection to install Health Monitoring Agent and use…

29a16c5

… regionalized HMA URI (#141)

Add unique time string to integ test (#150)

66232ed

* Add unique time string to integ test * Update syntax

update example notebook for inference CLI (#151)

9fbec4a

Training: Main documentation update (#153)

8034a24

* Training CLI & SDK: example notebook and README update * Update training cli example notebook --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>

Update inferenece SDK examples (#155)

0bcee6d

* Update inferenece SDK examples * Update readme

update help text to avoid truncation (#158)

d2130e9

Add an option to disable the deployment of KubeFlow TrainingOperator (#…

293f9b9

…102)

Remove unused param from documentation (#170)

9f534b4

Update volume flag to support hostPath and pvc (#171)

ec8800d

* update help text to avoid truncation * update volume flag to support hostPath and pvc, before e2e testing * clean up and e2e working * Minor updates after PR * update * Added unit tests for volume, all cli unit tests passed

Restructure list-cluster output (#173)

95e073e

Co-authored-by: pintaoz <pintaoz@amazon.com>

Update inference config and integ tests (#167)

a8a2baf

* Update inference config and integ tests * Update integ tests for new canaries

Update readme for volume flag (#176)

2908a62

Manual release v3.0.2 (#177)

9b7220c

* Manual release v3.0.2 * Update changelog --------- Co-authored-by: pintaoz <pintaoz@amazon.com>

Add schema pattern check to pytorch-job template (#178)

36fac66

* Update readme for volume flag * Add schema pattern check to pytorch-job template, unit test added, all test passed locally

Fix training test (#184)

dcbc8fb

* Fix SDK training test: Add wait time before refresh * Fix training tests in canaries

Update logging information for submitting and deleting training job (#…

28424e4

…189) Co-authored-by: pintaoz <pintaoz@amazon.com>

rsareddy0329 requested a review from a team as a code owner August 5, 2025 23:05

rsareddy0329 and others added 4 commits August 6, 2025 13:51

Added new column 'deploymeny configs' to the itable that allows user'…

6553766

…s to view SDK config code (#188) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>

Add instance type support for ml.p6e-gb200.36xlarge (#204)

63ff3b4

* Add instance type support for ml.p6e-gb200.36xlarge Updated support for ml.p6-b200.48xlarge as well * Add ml.p6e-gb200.36xlarge to efa plugin

changed endpoint name from value user has to manually insert to place…

e3f697a

…holder value (#206) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>

Mohamed Zeidan and others added 13 commits November 21, 2025 00:52

custom template pr

4b22625

Merge branch 'main' of https://github.com/aws/sagemaker-hyperpod-cli

73b71bc

pyproj version update (#328)

e0f6732

Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>

version change (#329)

0eba08f

Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>

Documentation update for init experience (#271) (#331)

75933a6

* update documentation to add init experience for all templates * minor update * add example notebooks to documentation, add delete SDK command to readme, update init experience documentation flow

Update to include additional permissions needed by the inference oper…

36140e3

…ator (#332)

integration test for jumpstart with mig profile (#334)

3b93c33

* integration test for jumpstart with mig profile * template fix for mig with jumpstart * skipped mig tests until instances setup finished --------- Co-authored-by: Chad Chiang <chadchc@amazon.com>

Patch fix release v1.1.1 for template 1.1 issue (#336)

530792a

Enable the integration tests for MIG (#337)

1aafd60

* integration test for jumpstart with mig profile * template fix for mig with jumpstart * skipped mig tests until instances setup finished * enable the mig integration tests --------- Co-authored-by: Chad Chiang <chadchc@amazon.com>

Update documentation for elastic training arguments (#343)

f89b989

* Update documentation for elastic training arguments * nit: Add detail descriptions for array type

Upgrade Inference Operator helm chart (#346)

170bf15

mollyheamazon had a problem deploying to manual-approval December 11, 2025 22:47 — with GitHub Actions Error

Remove command flag from init pytorch job integ test: (#351)

a824151

mollyheamazon had a problem deploying to manual-approval December 13, 2025 07:54 — with GitHub Actions Error

pintaoz-aws had a problem deploying to manual-approval December 18, 2025 18:16 — with GitHub Actions Error

Skipping expensive integ tests (#355)

7095543

* Skipping expensive integ tests * Properly skipping all tests

aviruthen had a problem deploying to manual-approval December 19, 2025 20:27 — with GitHub Actions Error

Add cleanup in test_invalid_no_node_count_or_quota_parameter() (#356)

c6b408c

Co-authored-by: pintaoz <pintaoz@amazon.com>

pintaoz-aws had a problem deploying to manual-approval December 20, 2025 00:33 — with GitHub Actions Error

Add end-to-end example (#350)

a4551aa

nicolasj92 had a problem deploying to manual-approval January 8, 2026 06:09 — with GitHub Actions Error

Changes to default MIG config (#358)

c6b2eed

* Add GPU operator MIG support with NVIDIA license notice * Add regional values for ap-northeast-2 and ca-central-1 GPU operator MIG support * Update MIG config for GPU operator --------- Co-authored-by: Sean Archer <searche@amazon.com>

zhaoqizqwang had a problem deploying to manual-approval January 15, 2026 16:40 — with GitHub Actions Error

zhaoqizqwang had a problem deploying to manual-approval January 15, 2026 19:11 — with GitHub Actions Error

zhaoqizqwang requested a deployment to manual-approval January 15, 2026 23:53 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge latest changes from main to 'Documentation' branch #192

Merge latest changes from main to 'Documentation' branch #192

Uh oh!

rsareddy0329 commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Merge latest changes from main to 'Documentation' branch #192

Are you sure you want to change the base?

Merge latest changes from main to 'Documentation' branch #192

Uh oh!

Conversation

rsareddy0329 commented Aug 5, 2025

PR Approval Steps

For Requester

For Reviewer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants