Skip to content

Add MIG profile support for ml.p6-b300.48xlarge (Blackwell Ultra)#398

Open
KeitaW wants to merge 1 commit intoaws:mainfrom
KeitaW:feat/add-p6-b300-mig-profiles
Open

Add MIG profile support for ml.p6-b300.48xlarge (Blackwell Ultra)#398
KeitaW wants to merge 1 commit intoaws:mainfrom
KeitaW:feat/add-p6-b300-mig-profiles

Conversation

@KeitaW
Copy link
Copy Markdown
Contributor

@KeitaW KeitaW commented Mar 27, 2026

Summary

  • Add ml.p6-b300.48xlarge to INSTANCE_TYPE_MIG_PROFILES in constants.py with the B300 MIG profiles: mig-1g.34gb, mig-1g.67gb, mig-2g.67gb, mig-3g.135gb, mig-4g.135gb, mig-7g.269gb
  • Add 17 B300-specific MIG partition profiles (7 uniform + 10 mixed) to the Helm chart default-mig-config.yaml ConfigMap

Relationship to #396

PR #396 ("Added profiles for B300") was merged on 2026-03-23 and added 2 ConfigMap profiles (all-1g.67gb and mixed-2-1g.34gb-1-2g.67gb-1-3g.135gb). However, it left two critical gaps:

1. constants.py was not updated — MIG requests on B300 are rejected before the ConfigMap is ever consulted.

_validate_accelerator_partition_parameters() in accelerator_partition_util.py checks INSTANCE_TYPE_MIG_PROFILES at line 26 as a gate. Because ml.p6-b300.48xlarge is absent from that dict, the CLI returns:

"Instance type 'ml.p6-b300.48xlarge' does not support accelerator partitions."

This blocks all MIG usage on B300 — HyperPodPyTorchJob submissions, inference endpoints with acceleratorPartitionType, and hyp list-accelerator-partition-type. The ConfigMap profiles from #396 are unreachable.

2. 15 of 17 ConfigMap profiles are missing.

Cross-referencing against the NVIDIA GPU Operator v25.3.0 upstream ConfigMap (B300 section, device-filter 0x318210DE) and the NVIDIA MIG product page (Blackwell Ultra: 7x34GB, 4x69GB, 2x139GB, 1x279GB):

Profile Upstream After #396 This PR
all-1g.34gb (x7) Yes Missing Added
all-1g.67gb (x4) Yes Added
all-2g.67gb (x3) Yes Missing Added
all-3g.135gb (x2) Yes Missing Added
all-4g.135gb (x1) Yes Missing Added
all-7g.269gb (x1) Yes Missing Added
10 mixed profiles Yes 1 of 10 +9 added

Profile Source

MIG profiles are derived from the NVIDIA GPU Operator upstream ConfigMap (v25.3.0), which defines B300 profiles under the # B300 comment section with all-balanced device-filter 0x318210DE. The NVIDIA MIG User Guide (r580) has not been updated for B300 yet.

Additional Note

The existing p6-b200.48xlarge key in INSTANCE_TYPE_MIG_PROFILES is missing the ml. prefix (unlike all other entries). This PR does not address that issue to keep scope focused, but it may warrant a separate fix.

Test plan

  • Verify INSTANCE_TYPE_MIG_PROFILES['ml.p6-b300.48xlarge'] returns the correct 6 profiles
  • Verify ALLOWED_ACCELERATOR_PARTITION_TYPES includes all B300 MIG types (mig-1g.34gb, mig-1g.67gb, mig-2g.67gb, mig-3g.135gb, mig-4g.135gb, mig-7g.269gb)
  • Verify default-mig-config.yaml parses as valid YAML
  • Verify _validate_accelerator_partition("mig-1g.34gb", ..., "ml.p6-b300.48xlarge") passes validation
  • Integration test: deploy a MIG-enabled instance group with ml.p6-b300.48xlarge and nvidia.com/mig.config: all-1g.34gb

@KeitaW KeitaW requested a review from a team as a code owner March 27, 2026 12:21
Add ml.p6-b300.48xlarge to INSTANCE_TYPE_MIG_PROFILES in constants.py
with the correct B300 MIG profiles derived from the NVIDIA GPU Operator
v25.3.0 upstream ConfigMap (device-filter 0x318210DE):

- mig-1g.34gb, mig-1g.67gb, mig-2g.67gb
- mig-3g.135gb, mig-4g.135gb, mig-7g.269gb

Also add the corresponding uniform and mixed MIG partition profiles
to the Helm chart default-mig-config.yaml ConfigMap, following the
same pattern used for existing GPU types (H100, H200, B200).

The B300 GPU (288GB HBM3e, ~269GB usable) was already registered in
INSTANCE_RESOURCES but had no MIG profile mapping, causing HyperPod
MIG validation to reject accelerator partition requests on this
instance type.
@KeitaW KeitaW force-pushed the feat/add-p6-b300-mig-profiles branch from 045470a to c98fd6e Compare March 28, 2026 00:06
KeitaW added a commit to KeitaW/sagemaker-hyperpod-cli that referenced this pull request Mar 28, 2026
Covers ml.p6-b300.48xlarge MIG profile support added in PR aws#398:
- Profile presence in INSTANCE_TYPE_MIG_PROFILES
- Complete profile list verification (6 profiles)
- All profiles in ALLOWED_ACCELERATOR_PARTITION_TYPES
- GPU slice extraction for all B300 profiles (1g→1, 2g→2, ..., 7g→7)
- CPU/memory default calculation for each profile at max instances
- Validation acceptance for valid B300 profiles
- Validation rejection for invalid profiles on B300 instance type
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant