Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
78ae6c9
feat: make --num-slices and --num-cubes optional for reservations
jamOne- Mar 3, 2026
a967251
Make --num-slices optional when using reservation
jamOne- Mar 4, 2026
1da1274
fix: avoid double calculation of assess_available_slices for GPU when…
jamOne- Mar 5, 2026
cd97fa5
Revert changes to feature_flags.py and workload.py
jamOne- Mar 6, 2026
e795644
Remove setting of num_cubes when assessing available slices
jamOne- Mar 6, 2026
c8da4b4
Fix tests and apply refactoring for cluster capacity and defaulting
jamOne- Mar 9, 2026
5824841
Refactor cluster capacity default logic per requirements
jamOne- Mar 10, 2026
b5bbb38
chore: add comments explaining vms_per_slice=1 for GPUs
jamOne- Mar 19, 2026
96f9f0d
Merge main into optional-num-slices, resolving conflicts
jamOne- Mar 19, 2026
733e480
chore: regenerate goldens and fix types
jamOne- Mar 19, 2026
3fed438
docs: add PyDoc to _set_cluster_topology_defaults
jamOne- Mar 19, 2026
482cf05
refactor: simplify _set_cluster_topology_defaults with helper functions
jamOne- Mar 19, 2026
6502a36
fix: actually gate auto-capacity behind OPTIONAL_NUM_SLICES flag
jamOne- Mar 20, 2026
d157ffb
fix: rename function and address test failures by ignoring feature fl…
jamOne- Mar 20, 2026
6204435
Remove conftest.py
jamOne- Mar 20, 2026
2bbe146
Update golden files to reflect optional num slices change
jamOne- Mar 20, 2026
6404f9e
fix: resolve empty reservation regression and replace feature flag
jamOne- Mar 20, 2026
3e6c629
fix: change dry_run_json reservation count to 1 to reduce golden reci…
jamOne- Mar 23, 2026
34545db
refactor: simplify conditions and remove redundant checks in _assess_…
jamOne- Mar 23, 2026
7ab3b3e
Fix TypeError and add validation for optional num-slices/num-nodes (#…
jamOne- Mar 23, 2026
05f66ba
Address reviewer feedback: move validation, fix sorting, simplify for…
jamOne- Mar 25, 2026
528751d
Support num-slices/num-nodes inference for cluster adapt
jamOne- Mar 26, 2026
d213c1b
fix(capacity): allow zero capacity during slice assessment
jamOne- Mar 26, 2026
e62f3e1
revert: Revert changes to reservation.py and system_characteristics.p…
jamOne- Mar 26, 2026
afb6598
test(recipes): restore num-slices omissions for non-reservation tests
jamOne- Mar 26, 2026
ca836a7
test(recipes): update gb200-4 recipe to use 2 nodes
jamOne- Mar 26, 2026
399c1a9
test(recipes): update gb200-4 recipe to use --spot instead of explici…
jamOne- Mar 27, 2026
15ec0db
refactor(cluster_adapt): move topology inference to private function …
jamOne- Mar 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions recipes/Basic_cluster_create.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,9 @@ kubectl wait deployment/coredns --for=condition=Available=true --namespace=kube-
[XPK] Task: `Determine current gke master version` is implemented by the following command not running since it is a dry run.
gcloud beta container clusters describe golden-cluster --location us-central1 --project golden-project --format="value(currentMasterVersion)"
[XPK] Creating 1 node pool or pools of tpu7x-8
We assume that the underlying system is: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=2, pathways_tpu_version='tpu7x')
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=2, pathways_tpu_version='tpu7x')
[XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run.
gcloud beta container node-pools list --cluster golden-cluster --project=golden-project --location=us-central1 --format="csv[no-heading](name)"
[XPK] Creating 1 node pool or pools of tpu7x-8
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=2, pathways_tpu_version='tpu7x')
[XPK] Task: `Get Node Pool Zone` is implemented by the following command not running since it is a dry run.
gcloud beta container node-pools describe 0 --cluster golden-cluster --project=golden-project --location=us-central1 --format="value(locations)"
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
Expand Down
9 changes: 4 additions & 5 deletions recipes/Cluster_create_RayCluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,16 @@ Creates a GKE cluster optimized for Ray workloads, installing KubeRay component.

# Running the command
```shell #golden
xpk cluster create-ray --project=golden-project --zone=us-central1-a --cluster=golden-cluster --tpu-type=tpu7x-8 --ray-version=2.39.0 --reservation=golden-reservation
xpk cluster create-ray --num-slices=1 --project=golden-project --zone=us-central1-a --cluster=golden-cluster --tpu-type=tpu7x-8 --ray-version=2.39.0 --reservation=golden-reservation
```
<!--
$ xpk cluster create-ray --project=golden-project --zone=us-central1-a --cluster=golden-cluster --tpu-type=tpu7x-8 --ray-version=2.39.0 --reservation=golden-reservation
$ xpk cluster create-ray --num-slices=1 --project=golden-project --zone=us-central1-a --cluster=golden-cluster --tpu-type=tpu7x-8 --ray-version=2.39.0 --reservation=golden-reservation
[XPK] Starting xpk v0.0.0
[XPK] Starting cluster create for cluster golden-cluster:
[XPK] Working on golden-project and us-central1-a
[XPK] Task: `Get reservation golden-reservation` is implemented by the following command not running since it is a dry run.
gcloud beta compute reservations describe golden-reservation --project=golden-project --zone=us-central1-a --format="json(specificReservation,aggregateReservation,status,deploymentType,resourcePolicies)"
[XPK] Assessing reservation capacity...
[XPK] Task: `Determine server supported GKE versions for default gke version` is implemented by the following command not running since it is a dry run.
gcloud container get-server-config --project=golden-project --region=us-central1 --flatten="channels" --filter="channels.channel=RAPID" --format="value(channels.defaultVersion)"
[XPK] Task: `Determine server supported GKE versions for valid versions` is implemented by the following command not running since it is a dry run.
Expand Down Expand Up @@ -47,11 +48,9 @@ kubectl wait deployment/coredns --for=condition=Available=true --namespace=kube-
[XPK] Task: `Determine current gke master version` is implemented by the following command not running since it is a dry run.
gcloud beta container clusters describe golden-cluster --location us-central1 --project golden-project --format="value(currentMasterVersion)"
[XPK] Creating 1 node pool or pools of tpu7x-8
We assume that the underlying system is: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=2, pathways_tpu_version='tpu7x')
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=2, pathways_tpu_version='tpu7x')
[XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run.
gcloud beta container node-pools list --cluster golden-cluster --project=golden-project --location=us-central1 --format="csv[no-heading](name)"
[XPK] Creating 1 node pool or pools of tpu7x-8
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=2, pathways_tpu_version='tpu7x')
[XPK] Task: `Get Node Pool Zone` is implemented by the following command not running since it is a dry run.
gcloud beta container node-pools describe 0 --cluster golden-cluster --project=golden-project --location=us-central1 --format="value(locations)"
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
Expand Down
4 changes: 1 addition & 3 deletions recipes/Cluster_create_for_multi-host_nodepool.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,9 @@ kubectl wait deployment/coredns --for=condition=Available=true --namespace=kube-
[XPK] Task: `Determine current gke master version` is implemented by the following command not running since it is a dry run.
gcloud beta container clusters describe golden-cluster --location us-central1 --project golden-project --format="value(currentMasterVersion)"
[XPK] Creating 1 node pool or pools of tpu7x-16
We assume that the underlying system is: SystemCharacteristics(topology='2x2x2', vms_per_slice=2, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-16', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=True, gpu_config=None, parallel_containers=2, pathways_tpu_version='tpu7x')
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x2', vms_per_slice=2, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-16', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=True, gpu_config=None, parallel_containers=2, pathways_tpu_version='tpu7x')
[XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run.
gcloud beta container node-pools list --cluster golden-cluster --project=golden-project --location=us-central1 --format="csv[no-heading](name)"
[XPK] Creating 1 node pool or pools of tpu7x-16
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x2', vms_per_slice=2, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-16', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=True, gpu_config=None, parallel_containers=2, pathways_tpu_version='tpu7x')
[XPK] Task: `Get Node Pool Zone` is implemented by the following command not running since it is a dry run.
gcloud beta container node-pools describe 0 --cluster golden-cluster --project=golden-project --location=us-central1 --format="value(locations)"
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
Expand Down
4 changes: 1 addition & 3 deletions recipes/Cluster_create_for_single-host_nodepool.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,9 @@ kubectl wait deployment/coredns --for=condition=Available=true --namespace=kube-
[XPK] Task: `Determine current gke master version` is implemented by the following command not running since it is a dry run.
gcloud beta container clusters describe golden-cluster --location us-central1 --project golden-project --format="value(currentMasterVersion)"
[XPK] Creating 1 node pool or pools of v4-8
We assume that the underlying system is: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu-v4-podslice', gce_machine_type='ct4p-hightpu-4t', chips_per_vm=4, accelerator_type=TPU, device_type='v4-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=1, pathways_tpu_version='tpuv4')
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu-v4-podslice', gce_machine_type='ct4p-hightpu-4t', chips_per_vm=4, accelerator_type=TPU, device_type='v4-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=1, pathways_tpu_version='tpuv4')
[XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run.
gcloud beta container node-pools list --cluster golden-cluster --project=golden-project --location=us-central1 --format="csv[no-heading](name)"
[XPK] Creating 1 node pool or pools of v4-8
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu-v4-podslice', gce_machine_type='ct4p-hightpu-4t', chips_per_vm=4, accelerator_type=TPU, device_type='v4-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=1, pathways_tpu_version='tpuv4')
[XPK] Task: `Get Node Pool Zone` is implemented by the following command not running since it is a dry run.
gcloud beta container node-pools describe 0 --cluster golden-cluster --project=golden-project --location=us-central1 --format="value(locations)"
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
Expand Down
9 changes: 4 additions & 5 deletions recipes/Cluster_create_private.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,11 @@ $ xpk cluster create-pathways --project=golden-project --zone=us-central1-a --cl
[XPK] Starting xpk v0.0.0
[XPK] Starting cluster create for cluster golden-cluster-private:
[XPK] Working on golden-project and us-central1-a
[XPK] Task: `Retrieve available pathways machine types` is implemented by the following command not running since it is a dry run.
gcloud compute machine-types list --filter "guestCpus >= 49 AND memoryMb >= 238592 AND zone = 'us-central1-a'" --format="value(name)" --project=golden-project
[XPK] Task: `Get reservation golden-reservation` is implemented by the following command not running since it is a dry run.
gcloud beta compute reservations describe golden-reservation --project=golden-project --zone=us-central1-a --format="json(specificReservation,aggregateReservation,status,deploymentType,resourcePolicies)"
[XPK] Assessing reservation capacity...
[XPK] Task: `Retrieve available pathways machine types` is implemented by the following command not running since it is a dry run.
gcloud compute machine-types list --filter "guestCpus >= 49 AND memoryMb >= 238592 AND zone = 'us-central1-a'" --format="value(name)" --project=golden-project
[XPK] Task: `Determine server supported GKE versions for default gke version` is implemented by the following command not running since it is a dry run.
gcloud container get-server-config --project=golden-project --region=us-central1 --flatten="channels" --filter="channels.channel=RAPID" --format="value(channels.defaultVersion)"
[XPK] Task: `Determine server supported GKE versions for valid versions` is implemented by the following command not running since it is a dry run.
Expand Down Expand Up @@ -51,11 +52,9 @@ kubectl wait deployment/coredns --for=condition=Available=true --namespace=kube-
[XPK] Task: `Determine current gke master version` is implemented by the following command not running since it is a dry run.
gcloud beta container clusters describe golden-cluster-private --location us-central1 --project golden-project --format="value(currentMasterVersion)"
[XPK] Creating 1 node pool or pools of v5p-8
We assume that the underlying system is: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu-v5p-slice', gce_machine_type='ct5p-hightpu-4t', chips_per_vm=4, accelerator_type=TPU, device_type='v5p-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=1, pathways_tpu_version='tpuv5')
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu-v5p-slice', gce_machine_type='ct5p-hightpu-4t', chips_per_vm=4, accelerator_type=TPU, device_type='v5p-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=1, pathways_tpu_version='tpuv5')
[XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run.
gcloud beta container node-pools list --cluster golden-cluster-private --project=golden-project --location=us-central1 --format="csv[no-heading](name)"
[XPK] Creating 1 node pool or pools of v5p-8
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu-v5p-slice', gce_machine_type='ct5p-hightpu-4t', chips_per_vm=4, accelerator_type=TPU, device_type='v5p-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=1, pathways_tpu_version='tpuv5')
[XPK] Task: `Get Node Pool Zone` is implemented by the following command not running since it is a dry run.
gcloud beta container node-pools describe 0 --cluster golden-cluster-private --project=golden-project --location=us-central1 --format="value(locations)"
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
Expand Down
Loading