Harden CI for self-hosted SLURM runners#1135
Harden CI for self-hosted SLURM runners#1135sbryngelson wants to merge 27 commits intoMFlowCode:masterfrom
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
read -t 0.1 (sub-second timeout) in a loop with process substitution file descriptors triggers a bash internal error (unwind_frame_run: read_builtin: frame not found) leading to a segfault. Use integer timeout (read -t 1) instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is reviewing your PR. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
Nitpicks 🔍
|
|
CodeAnt AI finished reviewing your PR. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRefactors SLURM monitoring to use state-driven polling, improves exit-code extraction and tailing behavior, adds cancellation on abnormal exit; updates GPU SBATCH requests to Changes
Sequence Diagram(s)sequenceDiagram
participant Monitor as monitor_slurm_job.sh
participant SlurmCtl as squeue/sacct/scontrol
participant JobFS as Job output file (filesystem)
participant Tail as tail process
Monitor->>SlurmCtl: get_job_state(job_id) (squeue → sacct fallback)
SlurmCtl-->>Monitor: STATE (PENDING/RUNNING/COMPLETED/UNKNOWN)
alt STATE PENDING/CONFIGURING
Monitor->>Monitor: sleep (~10s), continue polling
else STATE RUNNING/COMPLETING
Monitor->>Tail: start non-blocking tail (1s timeout, burst cap)
Tail->>JobFS: read new lines
JobFS-->>Tail: new output
Tail-->>Monitor: heartbeat + output
else STATE TERMINAL
Monitor->>Tail: stop tail, drain remaining lines (1s timeout, cap)
Monitor->>SlurmCtl: scontrol show job -> ExitCode (fallback sacct)
SlurmCtl-->>Monitor: ExitCode or UNKNOWN
alt ExitCode == 0
Monitor-->>Monitor: set monitor_success, exit 0
else
Monitor->>SlurmCtl: scancel job_id (if needed)
Monitor-->>Monitor: exit non-zero
end
else STATE UNKNOWN
Monitor->>Monitor: periodic warning, longer sleep, retry
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Updates the Phoenix (GT) GitHub Actions SLURM submission scripts to target H200 GPU resources for improved scheduling, and adjusts the SLURM job monitor script to avoid a Bash segfault caused by fractional read -t timeouts.
Changes:
- Update Phoenix GPU
sbatchdirectives to request H200 GPUs and adjust task count per node. - Replace fractional
read -ttimeouts with integer timeouts in the SLURM job monitor script.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| .github/workflows/phoenix/submit.sh | Switch GPU submission options to request H200 GPUs and increase --ntasks-per-node. |
| .github/workflows/phoenix/submit-bench.sh | Same as above for benchmark submissions. |
| .github/scripts/monitor_slurm_job.sh | Use integer read -t timeouts to prevent Bash segfaults during output streaming. |
PR MFlowCode#1124 changed bench.yml to use workflow_run (triggered after Test Suite completes), which broke the approve-to-run flow for fork PRs. Revert to the original pull_request + pull_request_review triggers while keeping improvements (frontier_amd matrix, concurrency group, timeout, run_parallel_benchmarks.sh). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1135 +/- ##
=======================================
Coverage 44.07% 44.07%
=======================================
Files 70 70
Lines 20431 20431
Branches 1974 1974
=======================================
Hits 9004 9004
Misses 10291 10291
Partials 1136 1136 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Write failed test UUIDs to tests/failed_uuids.txt after a test run. In CI, if 1-5 tests fail, automatically re-run just those tests. If 6+ fail, treat it as a real issue and fail immediately. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name=".github/workflows/test.yml">
<violation number="1" location=".github/workflows/test.yml:138">
P2: The initial test run is forced to succeed with `|| true`, so real failures can be masked when `tests/failed_uuids.txt` is missing. Preserve the exit code and fail the job if no retry is performed.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/test.yml:
- Around line 137-153: The test step currently uses "|| true" after invoking
mfc.sh which hides crashes; remove that and capture the test command's exit
status (run mfc.sh test -v ... and set TEST_EXIT=$? or use if ! ...; then
TEST_EXIT=$? else TEST_EXIT=0) so you can decide later; keep the existing retry
logic that checks tests/failed_uuids.txt (NUM_FAILED, FAILED) and, after that
block, if tests/failed_uuids.txt does not exist and TEST_EXIT is non‑zero then
exit with TEST_EXIT (or otherwise propagate the original non‑zero exit code) to
avoid reporting success on a crash. Ensure you reference the same mfc.sh
invocation and tests/failed_uuids.txt and use TEST_EXIT when making the final
exit decision.
🧹 Nitpick comments (2)
toolchain/mfc/test/test.py (1)
209-216: Race condition:failed_testsis mutated from worker threads without synchronization.
failed_testsis appended to inhandle_case(line 505) which runs in worker threads. While CPython's GIL makeslist.appendatomic, iterating overfailed_testshere (line 213) is safe only becausesched.sched(line 179) has already joined all workers by this point. This is fine but fragile — a future refactor that moves this block could introduce a bug. A brief comment would help..github/workflows/test.yml (1)
144-144: Minor: useless use ofcat.
tr '\n' ' ' < tests/failed_uuids.txtis slightly cleaner, though this is purely cosmetic.
Don't mask non-zero exit codes when tests crash before writing failed_uuids.txt. Only suppress the exit code when the file exists (meaning the test framework ran to completion and we can retry). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace squeue exit-code polling with get_job_state() that parses the actual state string (squeue + sacct fallback). Never give up on UNKNOWN state — CI timeout is the backstop. Cancel orphaned SLURM jobs on abnormal monitor exit. Include job state in heartbeats. Incorporates changes from PR MFlowCode#1140. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name=".github/scripts/monitor_slurm_job.sh">
<violation number="1" location=".github/scripts/monitor_slurm_job.sh:40">
P2: Guard the `squeue` pipeline so transient command failures don't abort the script under `set -euo pipefail`; otherwise a temporary SLURM outage exits the monitor instead of returning "UNKNOWN".</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
The H200 switch needs to land on master first so both PR and master benchmark builds use the same node type. Split into a separate PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use shopt dotglob/nullglob for cleaner glob expansion instead of manual dotfile patterns. Keep retry loop for Lustre ESTALE resilience. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add --requeue to Phoenix sbatch scripts so preempted embers-QOS jobs are automatically rescheduled. Remove PREEMPTED from the monitor's terminal state list so it keeps waiting through the requeue cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The rm -rf * was destroying the build cache, causing CMake to rebuild from scratch and hit Lustre ioctl errors. With clean: false on checkout, git clean is already disabled — no pre-cleanup needed. Keep full cleanup only in bench.yml where pr/master are fresh clones. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
Without || exit $?, a failed retry would silently exit 0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
Revert test.yml to clean: true (default) — the corrupted build cache from the ioctl failure was causing 100% test failures. The Lustre-safe cleanup is only needed for bench.yml where pr/master are separate trees. Also tune qodo PR reviewer: reduce max findings to 5, lower suggestion depth to medium, and add instructions to focus on correctness over style for CI scripts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Lustre-safe cleanup step was wiping the build cache (pr/build/, master/build/), forcing full rebuilds every run. This added ~32 min of build time and pushed NVHPC gpu-omp benchmarks past the 4h SLURM limit. Restore default checkout behavior to preserve build cache across runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
Split Frontier GPU test configs into 2 shards (~75 min each) so they fit within the batch partition's 2h wall time limit. This allows all Frontier SLURM jobs to run concurrently instead of serially on the extended partition (which has a 1-job-per-user limit), reducing total CI wall clock from ~4.5h to ~2h. Changes: - Add --shard CLI argument (e.g., --shard 1/2) with modulo-based round-robin distribution across shards - Switch Frontier submit scripts from extended to batch/hackathon (CFD154 account, 1h59m wall time) - Shard the 3 Frontier GPU matrix entries into 6 (2 shards each) - CPU entries remain unsharded Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
| if ARG("shard") is not None: | ||
| shard_idx, shard_count = (int(x) for x in ARG("shard").split("/")) | ||
| skipped_cases += [c for i, c in enumerate(cases) if i % shard_count != shard_idx - 1] | ||
| cases = [c for i, c in enumerate(cases) if i % shard_count == shard_idx - 1] |
There was a problem hiding this comment.
✅ Suggestion: Add robust parsing and validation for the shard argument to prevent crashes from empty or malformed input, such as an empty string from the CI matrix. [possible issue, importance: 8]
| if ARG("shard") is not None: | |
| shard_idx, shard_count = (int(x) for x in ARG("shard").split("/")) | |
| skipped_cases += [c for i, c in enumerate(cases) if i % shard_count != shard_idx - 1] | |
| cases = [c for i, c in enumerate(cases) if i % shard_count == shard_idx - 1] | |
| shard = ARG("shard") | |
| if shard: | |
| parts = shard.split("/", 1) | |
| if len(parts) != 2: | |
| raise MFCException(f"Invalid shard format: {shard!r} (expected 'i/n')") | |
| shard_idx, shard_count = (int(x) for x in parts) | |
| if shard_count <= 0 or not (1 <= shard_idx <= shard_count): | |
| raise MFCException(f"Invalid shard value: {shard!r} (expected 1 <= i <= n, n > 0)") | |
| skipped_cases += [c for i, c in enumerate(cases) if i % shard_count != shard_idx - 1] | |
| cases = [c for i, c in enumerate(cases) if i % shard_count == shard_idx - 1] |
There was a problem hiding this comment.
1 issue found across 7 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="toolchain/mfc/test/test.py">
<violation number="1" location="toolchain/mfc/test/test.py:103">
P2: Validate the `--shard` argument before using it; as written, invalid or `0` shard counts will raise exceptions (including ZeroDivisionError) during modulo operations.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
User description
Summary
Systematic hardening of CI infrastructure for self-hosted runners (Phoenix, Frontier) on Lustre filesystems:
get_job_state()(squeue primary, sacct fallback). Adds orphan job cleanup via EXIT trap, terminal state detection, and informative heartbeat messagesread -t 0.1to integerread -t 1to avoid process substitution FD corruptionrm -rfbefore checkout withclean: falseto avoid ESTALE/ENOTEMPTY errors fromgit cleanon Lustre-X -Pflags for parsable, job-level-only output|| trueto squeue/sacct command substitutionsshutil.rmtreeup to 5 times with backoff in Python toolchainpull_request_reviewdoesn't cancelpull_requestrunsworkflow_runback to directpull_request+pull_request_reviewtriggers@page/@refpatternsTest plan
🤖 Generated with Claude Code
CodeAnt-AI Description
Harden CI reliability for self-hosted SLURM runners and flaky tests
What Changed
Impact
✅ Fewer orphaned SLURM jobs after CI runner interruptions✅ Fewer sporadic CI failures due to automatic retry of small test flakiness✅ Shorter CI reruns and clearer job state logs during cluster runs💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.