Skip to content

Add DGXCloud Ray backend and improve DGXCloudExecutor workload management#480

Draft
rapaul-nv wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
rapaul-nv:rapaul/dgxcloud-runai-fix
Draft

Add DGXCloud Ray backend and improve DGXCloudExecutor workload management#480
rapaul-nv wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
rapaul-nv:rapaul/dgxcloud-runai-fix

Conversation

@rapaul-nv
Copy link
Copy Markdown

@rapaul-nv rapaul-nv commented Apr 4, 2026

Summary

  • Add DGXCloud Ray backend (nemo_run/run/ray/dgxcloud.py): New DGXCloudRayCluster and DGXCloudRayJob classes that enable Ray orchestration on DGX Cloud via distributed workloads. Pods self-organise into a Ray head + worker topology using hostname-derived rank and a shared PVC for head-IP discovery.
  • Improve DGXCloudExecutor workload management (nemo_run/core/execution/dgxcloud.py): Migrate auth from app_token to client_credentials grant type; add generic _run_workspace_and_wait helper for polling workspace workloads; support large data transfers via chunked per-file fallback when the tarball exceeds the API arg limit; improve fetch_logs with terminal-state detection and non-cluster log polling; add deploy_script_to_pvc and largeShmRequest support.
  • Register DGXCloud backend in RayCluster and RayJob factory maps (run/ray/cluster.py, run/ray/job.py).

Fixes: #478

Test plan

  • Verify auth token retrieval works with client_credentials grant type
  • Test single-node and multi-node training job creation on DGX Cloud
  • Test data movement with tarball under and over the MAX_ARGS_CHARS limit (chunked fallback)
  • Test Ray cluster bootstrap on DGX Cloud (head discovery, worker join)
  • Test fetch_logs for both cluster-launched and non-cluster-launched jobs
  • Verify deploy_script_to_pvc deploys and sets executable permissions correctly

- Renamed app_id and app_secret to client_id and client_secret for clarity.
- Introduced new methods for deleting workloads and checking workspace status.
- Enhanced data movement functionality to use a tarball when within character limits, falling back to individual file deployment otherwise.
- Updated RayCluster and RayJob to integrate DGXCloudExecutor and its corresponding classes.

Fixes NVIDIA-NeMo#478

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Run:ai executor is not working out of the box

1 participant