Skip to content

Conversation

@haardm
Copy link
Contributor

@haardm haardm commented Jan 14, 2026

Fix for #359

This ensures the health-monitoring-agent can pull images even when deployed in unmapped or future regions (if the cluster nodes have internet access).

What's changing and why?

Add support for 4 newly GA regions in health-monitoring-agent:

  • ca-central-1 (YUL): ECR account 843976229209
  • ap-southeast-3 (CGK): ECR account 971422672635
  • ap-southeast-4 (MEL): ECR account 084375568333
  • eu-south-2 (ZAZ): ECR account 626887787726

This brings total supported regions to 17 GA regions.

Fix fallback logic bug where unmapped regions would create malformed image URIs. Previously, when a region was not in the mapping, only the account ID would fallback to us-east-1 while the region remained unchanged, resulting in invalid URIs like below:
767398015722.dkr.ecr.ap-southeast-3.amazonaws.com/hyperpod-health-monitoring-agent:1.0.1038.0_1.0.305.0

Now both region AND account ID fallback together to us-east-1, generating valid URIs:
767398015722.dkr.ecr.us-east-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.1038.0_1.0.305.0

Before/After UX

Before:
For regions missing in the mapping, malformed URI would cause ECR image pull failure.

After:
For region missing in the mapping, URI would be constructed correctly and fall back to IAD region, if the cluster nodes have internet access then image pull will succeed.

How was this change tested?

1/ Created a cluster in ca-central-1 which is among the new regions being added.
2/ Updated the helm chart dependencies after making the committed changes.

╭─haardm at dev-dsk-haardm-2b-2e2fb30a in /local/home/haardm/workplace/HyperPodEKSTests/src/HyperPodEKSTests/sagemaker-hyperpod-cli on main✘✘✘
╰─± helm dependencies update helm_chart/HyperPodHelmChart
Getting updates for unmanaged Helm repositories...
...Successfully got an update from the "https://nvidia.github.io/k8s-device-plugin" chart repository
...Successfully got an update from the "https://aws.github.io/eks-charts/" chart repository
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "gpu-helm-charts" chart repository
...Successfully got an update from the "nvidia" chart repository
...Successfully got an update from the "jetstack" chart repository
...Successfully got an update from the "prometheus-community" chart repository
Update Complete. ⎈Happy Helming!⎈
Saving 17 charts
Downloading cert-manager from repo oci://quay.io/jetstack/charts
Pulled: quay.io/jetstack/charts/cert-manager:v1.18.2
Digest: sha256:6b3b94c6af27a03390f0ee55a63d2524027d2517bf2f23d46de08211800d1a80
Downloading nvidia-device-plugin from repo https://nvidia.github.io/k8s-device-plugin
Downloading aws-efa-k8s-device-plugin from repo https://aws.github.io/eks-charts/
eDeleting outdated charts

3/ Installed the updated helm charts

╭─haardm at dev-dsk-haardm-2b-2e2fb30a in /local/home/haardm/workplace/HyperPodEKSTests/src/HyperPodEKSTests/sagemaker-hyperpod-cli on main✘✘✘
╰─± helm install dependencies helm_chart/HyperPodHelmChart --namespace kube-system --set health-monitoring-agent.region=ca-central-1
NAME: dependencies
LAST DEPLOYED: Wed Jan 14 21:36:14 2026
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None

4/ Validated that the health-monitoring-agent daemonset constructed image URI for the ca-central-1 region correctly.

╭─haardm at dev-dsk-haardm-2b-2e2fb30a in /local/home/haardm/workplace/HyperPodEKSTests/src/HyperPodEKSTests/sagemaker-hyperpod-cli on main✘✘✘
╰─± kubectl get daemonset health-monitoring-agent -n aws-hyperpod -o yaml | grep image:
        image: 843976229209.dkr.ecr.ca-central-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.1038.0_1.0.305.0

5/ Also validated fallback case where the install command did not provide region explicitly and hence should fall back to IAD region as expected

╭─haardm at dev-dsk-haardm-2b-2e2fb30a in /local/home/haardm/workplace/HyperPodEKSTests/src/HyperPodEKSTests/sagemaker-hyperpod-cli on main✘✘✘
╰─± helm install dependencies helm_chart/HyperPodHelmChart --namespace kube-system                                                  
NAME: dependencies
LAST DEPLOYED: Wed Jan 14 21:53:36 2026
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None

╭─haardm at dev-dsk-haardm-2b-2e2fb30a in /local/home/haardm/workplace/HyperPodEKSTests/src/HyperPodEKSTests/sagemaker-hyperpod-cli on main✘✘✘
╰─± kubectl get daemonset health-monitoring-agent -n aws-hyperpod -o yaml | grep image:
        image: 767398015722.dkr.ecr.us-east-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.1038.0_1.0.305.0

Are unit tests added?

N/A

Are integration tests added?

N/A

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

  • All automated PR checks pass
  • Failed tests include local run results/screenshots proving they work
  • Changes are documentation-only

Fix for aws#359

Add support for 4 newly GA regions in health-monitoring-agent:
- ca-central-1 (YUL): ECR account 843976229209
- ap-southeast-3 (CGK): ECR account 971422672635
- ap-southeast-4 (MEL): ECR account 084375568333
- eu-south-2 (ZAZ): ECR account 626887787726

This brings total supported regions to 17 GA regions.

Fix fallback logic bug where unmapped regions would create
malformed image URIs. Previously, when a region was not in the mapping,
only the account ID would fallback to us-east-1 while the region
remained unchanged, resulting in invalid URIs like below:
  767398015722.dkr.ecr.ap-southeast-3.amazonaws.com/hyperpod-health-monitoring-agent:1.0.1038.0_1.0.305.0

Now both region AND account ID fallback together to us-east-1,
generating valid URIs:
  767398015722.dkr.ecr.us-east-1.amazonaws.com/...

This ensures the health-monitoring-agent can pull images even
when deployed in unmapped or future regions (if the cluster nodes have internet access).
@haardm haardm requested a review from a team as a code owner January 14, 2026 22:05
@haardm haardm deployed to manual-approval January 14, 2026 22:05 — with GitHub Actions Active
Copy link
Collaborator

@emeraldbay emeraldbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhaoqizqwang zhaoqizqwang merged commit 700b5e1 into aws:main Jan 15, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants