Skip to content

cp: fix(kubeflow): tail only first and last node logs (538) into r0.10.0#544

Merged
ko3n1g merged 1 commit into
r0.10.0from
cherry-pick-538-r0.10.0
Jun 4, 2026
Merged

cp: fix(kubeflow): tail only first and last node logs (538) into r0.10.0#544
ko3n1g merged 1 commit into
r0.10.0from
cherry-pick-538-r0.10.0

Conversation

@svcnvidia-nemo-ci

Copy link
Copy Markdown
Contributor

beep boop [🤖]: Hi @ko3n1g 👋,

we've cherry picked #538 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

* fix(kubeflow): stream only rank 0 + last rank, write all ranks to disk

KubeflowExecutor.fetch_logs followed every replica and forwarded all ranks to
the caller, so at scale the aggregate output overran CI/runner job-log size
limits (a 16-node x 8-GPU run exceeded GitLab's 128MB cap). Now it still tails
every rank (kubectl logs -l <jobset> --prefix --max-log-requests num_nodes) and
writes the complete multi-rank output to <job_dir>/log-allranks_0.out, but
forwards only global rank 0 (node 0, [default0]) and the last global rank
(node num_nodes-1, [default nproc_per_node-1]) to stdout. Downstream log
validation that globs log*.out still sees every rank via the on-disk file.

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix(kubeflow): resolve last pod via completion-index label + full-history streaming

fetch_logs identified the last global rank's pod by parsing the pod name and
tailed only the last `--tail <lines>` window, so on (re)attach the last rank's
mid-run canonical "iteration | lm loss | ..." line (print_rank_last) was
dropped — on K8s the job log showed only rank 0's "Step Time" line.

Resolve the first/last pod from the authoritative
batch.kubernetes.io/job-completion-index label (== torchrun PET_NODE_RANK),
mapped from the --prefix pod name and refreshed on every (re)connect (gang
restarts spawn new pod names), and stream each pod's full history (--tail=-1)
so no mid-run line is missed. All ranks are still written to
log-allranks_0.out; only global rank 0 and the true last global rank are
forwarded to stdout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix(kubeflow): forward by global rank (node_rank*nproc+local), not pod heuristic

Kubeflow Trainer sets torchrun PET_NODE_RANK statically from the JobSet
batch.kubernetes.io/job-completion-index, so global_rank = completion_index *
nproc_per_node + local_rank. Compute that explicitly and forward only global
rank 0 and world_size-1 to stdout (all ranks still go to log-allranks_0.out).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix(kubeflow): make TrainJob launch idempotent on 409 conflict

When a TrainJob with the target name already exists, launch() raised and aborted. On CI the name is derived from the experiment id (commit SHA), so a 409 is a stale leftover from a prior attempt the launcher declared FAILED after a slow pod start. That blocked setup_experiment's 'attempt N of M' retry — every retry re-collided. Now launch() deletes the stale job (cancel(wait=True)) and recreates, so the retry can actually recover.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix(kubeflow): reload kube client across cert rotation for long runs

The kubernetes SDK bakes the client cert into its SSLContext at client
construction and never re-reads it. When credentials come from a rotating
source (Teleport tbot refreshing the cert on disk), a KubeflowExecutor
created once at launch keeps presenting the original cert until it expires
mid-run, so status polls fail with SSLV3_ALERT_CERTIFICATE_EXPIRED once the
run outlives the cert TTL (~60 min). Short jobs finish in time; multi-hour
jobs go blind.

Rebuild the API clients from the on-disk kubeconfig past a refresh interval
(below the cert TTL) via lazy properties, and reactively reload+retry once
in status() on a non-API connection error. fetch_logs already shells out to
kubectl, which re-reads creds per call, so it was unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix(kubeflow): scope code_dir per job to avoid concurrent clobber

code_dir was scoped only per user (<pvc>/<username>/code), but package()
rsyncs each job's job_dir into it. Two concurrent jobs from the same user
(e.g. parallel CI test cases) therefore overwrite each other's launcher code
mid-run. Scope it per job (<username>/<experiment_id>/<job_name>/code),
matching how dgxcloud/lepton mirror job_dir into a per-job PVC subdir and how
slurm keys packaging by experiment_id:job_name.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix(kubeflow): unique TrainJob name + forward all ranks (deduped)

- TrainJob name is now <basename>-<uuid6> (RFC-1123, <=33 chars) via the new
  train_job_basename field, decoupled from the experiment name. The uuid makes
  every launch unique, so concurrent/retried jobs never collide on the API
  server (the descriptive experiment name is intentionally non-unique).
- fetch_logs now forwards every rank to stdout, de-duplicated: torchrun runs
  the same entrypoint on all ranks so startup/config/NCCL lines are identical;
  we strip the per-rank [pod/...]/[defaultN] markers and forward each distinct
  message once. This stops dropping the per-step loss line and wandb URL, which
  Megatron emits from a single layout-dependent rank (neither rank 0 nor last).
  The full per-rank stream still goes to log-allranks_0.out untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix(kubeflow): stream logs once, not per replica

torchx calls scheduler.log_iter(app_id, role_name, k=...) once per replica
(k = 0..num_nodes-1). The Kubeflow log_iter ignored k and re-ran
fetch_logs — which tails the entire jobset via the jobset-name selector — for
every replica, producing N independent tail streams (each with its own dedup
state) and N-fold-duplicating every console line (prefixed <role>/<k>). At 16
nodes that's 16x the log volume, which also overruns the CI job-log limit on
long runs. Stream only for k == 0; that single tail already covers all ranks
(and writes log-allranks_0.out once).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix(kubeflow): forward rank 0 + last rank to stdout (not all-ranks dedup)

Revert the all-ranks sliding-window dedup back to forwarding only global rank 0
(setup/config) and the last global rank (print_rank_last per-step loss), like a
SLURM job log. The last rank is resolved at stream time from each pod's
batch.kubernetes.io/job-completion-index label (== torchrun --node-rank
$PET_NODE_RANK), so global_rank = completion_index * nproc_per_node +
local_rank is deterministic without any topology enforcement. The full per-rank
stream is still captured in log-allranks_0.out. Combined with the per-replica
log_iter guard, this stops the N-fold duplication and yields a clean two-rank
console.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix(kubeflow): forward rank-0 + the actual loss-rank slot to stdout

The c10d rendezvous assigns torch ranks by join order, not by JobSet
completion-index, so torch's world_size-1 (print_rank_last's loss line) does
NOT land on the highest completion-index. Verified on a live 16-node job:
the loss prints on completion-index 9 (= num_nodes//2 + 1), local rank
nproc-1 — not index 15. Forward exactly (index 0, local 0) and
(index num_nodes//2 + 1, local nproc-1) so the console shows rank 0 setup +
the per-step loss/throughput. Full per-rank capture remains in
log-allranks_0.out. A deterministic completion-index->rank mapping
(topology/static rank ordering) would let us compute this rather than match
the observed slot.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix(kubeflow): robust log streaming across pod/container restarts

fetch_logs ran a single 'kubectl logs -l -f --max-log-requests <num_nodes>'.
That follow only attaches to pods present at start, never re-attaches to a
container that restarts, and --max-log-requests == pod count has no headroom —
so a gang/NCCL-init restart that transiently doubled the matching-pod count
errored the whole command ('maximum allowed concurrency') and silently dropped
pods. Observed: a 16-node job streamed only node-0-0; the loss rank (node-0-9)
never appeared even though it was emitting per-step loss.

- --max-log-requests = max(num_nodes*2, 8): headroom for restart-transient pods.
- Periodically re-attach (threading.Timer terminates the follow every 120s) so
  pods that (re)started after the initial attach are picked up.
- Resume reconnects with --since-time (via --timestamps), tracking the max
  RFC3339 stamp, so re-attaching never replays already-emitted history; only the
  first attach uses --tail=-1. The kubectl timestamp is stripped from each line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix(kubeflow): resolve rank-0/last pods from worker GROUP_RANK, not a heuristic

The console forwarded rank 0 + the loss rank using completion-index with an
empirical 'num_nodes//2+1' slot for world_size-1. That's fragile: the c10d
rendezvous assigns torch ranks by join order, not JobSet completion-index, so
the loss rank lands on an unpredictable pod (observed: completion-index 9 was
actually GROUP_RANK 15 = RANK 63 = world_size-1).

Read the ground truth instead: torchrun exports GROUP_RANK into every worker's
/proc/<pid>/environ, so 'kubectl exec <pod> -- ' reading it tells us exactly
which pod holds GROUP_RANK 0 (RANK 0, local 0) and GROUP_RANK num_nodes-1
(RANK world_size-1, local nproc-1). Resolve the pod->GROUP_RANK map once the
workers exist, cache it, and re-resolve when the rank-0/last pod is no longer
covered (gang restart reshuffles ranks). Until workers come up (empty map),
fall back to the completion-index-0 pod so early setup still streams.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix(kubeflow): emit forwarded log lines in timestamp order

'kubectl logs -l ... -f' multiplexes every pod into one stream in ARRIVAL
order, not timestamp order. Because the console forwards two pods (rank 0 and
the last rank), their lines could interleave wrong — e.g. two rank-0 'Step
Time' lines bunching before the last rank's 'iteration N' line, or a step time
landing under the next iteration.

Add a small reorder buffer on the forwarded (yielded) subset only: each line
already carries the kubelet --timestamps value (parsed to epoch via the new
_ts_epoch), so hold lines until they are older than reorder_hold_s (2s) and
emit sorted by timestamp. The window comfortably absorbs cross-node clock skew
+ flush jitter while keeping the console near-live. The buffer is drained in
order after each proc ends (re-attach) — outside finally, since yielding during
generator close is unsafe. The full all-ranks debug file is untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* feat(kubeflow): support pod-template annotations/labels (podTemplateOverrides metadata)

The executor's existing 'annotations' land on the TrainJob object. GKE multi-network
attach (networking.gke.io/interfaces, for GPUDirect-RDMA/gIB) is read off the trainer
POD, not the TrainJob — add pod_annotations (and pod_labels) that flow into
podTemplateOverrides[].metadata, which the Kubeflow Trainer v2 CRD supports.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix(kubeflow): resolve rank-0 and last rank before forwarding logs

On first attach the GROUP_RANK pod map is empty until the torchrun workers
finish rendezvous, so _forward_to_stdout fell back to rank-0-only and the
last rank's early per-step loss/throughput lines (replayed via --tail=-1)
were written to log-allranks but never forwarded to stdout — the CI log
silently dropped the beginning of the run until a re-attach ~120s later,
by which point --since-time skips the replayed history.

Poll on the first attach until both rank 0 and the last rank resolve before
forwarding, capped at 600s (then fall back). The wait is gated on a
non-empty pod list, so it is a no-op when pods can't be listed (no kubectl
/ unit tests) and engages only for real runs.

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix(kubeflow): wait for rank-0/last to resolve, never fall back to completion-index

The first-attach barrier capped the wait at 600s and then forwarded with the
completion-index heuristic, which streams the wrong rank. A job can legitimately
sit Pending (starved for nodes) far longer than 600s, so it would time out and
mis-forward. Drop the timeout/fallback: keep polling while the job is alive and
stop only when it reaches a terminal state. --tail=-1 on first attach replays
history, so waiting loses nothing.

Signed-off-by: oliver könig <okoenig@nvidia.com>

* style(kubeflow): ruff-format kubeflow.py

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test(kubeflow): update stale tests for uuid names, idempotent 409, rank-0/last log forwarding

The base kubeflow rewrite changed behavior the tests still asserted the old way:
TrainJob names are now <base>-<uuid6>; a 409 cancels the stale job and recreates
(idempotent) rather than raising; and fetch_logs writes every rank to
<job_dir>/log-allranks_0.out while forwarding only rank-0 + the last rank to
stdout. Set job_dir, patch status/time.sleep to avoid the retry-loop hang, and
assert the all-ranks file + uuid-suffixed names.

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test(kubeflow): cover GROUP_RANK resolution, log forwarding, client reload

Raises codecov/patch on the diff from 58% to ~98% by exercising the
previously-untested branches: GROUP_RANK resolution via worker environ
(incl. the first-attach resolve barrier), rank-0/last-rank forwarding +
reorder buffer, the completion-index fallback, pod-template labels/
annotations, stale kube-client reload, and the status() connection-error
retry path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* feat(kubeflow): add copy_to_workspace/copy_from_workspace for arbitrary PVC paths

package()/pull_results() already bridge launcher↔PVC via a throw-away data-mover
pod, but only for the per-job code_dir. Downloading results (or persisting any
auxiliary cross-run state) from another path on the volume had no public API.

Add copy_to_workspace(local, remote) and copy_from_workspace(remote, local) that
run the same data-mover against an arbitrary path under workdir_pvc_path, and
refactor package()/pull_results() to delegate to them (behavior unchanged). Tests
cover the happy path, the no-PVC no-op, and pod teardown on copy error.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants