[https://nvbugs/6029864][fix] Fix flaky ray test failure#12697
[https://nvbugs/6029864][fix] Fix flaky ray test failure#12697brb-nv wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unittest/conftest.py (1)
453-454: Prefer readiness polling instead of a fixed 2s sleep.
time.sleep(2)is still timing-dependent: slow CI nodes can remain flaky, while fast nodes always pay 2s. A bounded poll for Ray node readiness is more deterministic.♻️ Suggested change
- # Allow raylet to complete GCS registration before tests create actors - time.sleep(2) + # Wait for at least one alive Ray node to appear in GCS. + deadline_s = time.monotonic() + 10.0 + while time.monotonic() < deadline_s: + if any(node.get("Alive", False) for node in ray.nodes()): + break + time.sleep(0.1) + else: + raise RuntimeError("Ray cluster did not become ready within 10 seconds")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/conftest.py` around lines 453 - 454, Replace the fixed time.sleep(2) in tests/unittest/conftest.py with a bounded readiness poll that checks Ray node registration: implement a loop (use time.monotonic for timing) that repeatedly queries ray.nodes() (or ray.cluster_resources()/ray.state.cluster_resources) and returns once nodes are present and show an "alive"/registered status (e.g., check node["Alive"] or equivalent), sleeping a short interval (e.g., 0.1s) between attempts and raising/failed-asserting after a configurable timeout (e.g., 10s); update the code around the existing sleep call to use this polling logic so fast CI doesn't wait unnecessarily and slow CI gets a bounded wait.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unittest/conftest.py`:
- Around line 453-454: Replace the fixed time.sleep(2) in
tests/unittest/conftest.py with a bounded readiness poll that checks Ray node
registration: implement a loop (use time.monotonic for timing) that repeatedly
queries ray.nodes() (or ray.cluster_resources()/ray.state.cluster_resources) and
returns once nodes are present and show an "alive"/registered status (e.g.,
check node["Alive"] or equivalent), sleeping a short interval (e.g., 0.1s)
between attempts and raising/failed-asserting after a configurable timeout
(e.g., 10s); update the code around the existing sleep call to use this polling
logic so fast CI doesn't wait unnecessarily and slow CI gets a bounded wait.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9d9ff24d-b448-4e3e-9ec2-de2b8ec7a154
📒 Files selected for processing (1)
tests/unittest/conftest.py
1fb85a4 to
0081e3b
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #41464 [ run ] triggered by Bot. Commit: |
|
PR_Github #41464 [ run ] completed with state
|
Description
The test test_cp_tp_broadcast_object[cp_broadcast-dict] in tests/unittest/_torch/ray_orchestrator/multi_gpu/test_ops.py intermittently fails with a Ray cluster timeout error during fixture setup:
The
setup_ray_clusterfixture callsray.init()which returns successfully, but the raylet nodes have not yet fully registered their resources with the Ray Global Control Store (GCS). When tests immediately attempt to create Ray actors after the fixture yields, they fail because the GCS cannot find the node information.This is a race condition between ray.init() completing and the raylet finishing its registration with the GCS.
Add a 2-second delay after ray.init() to allow the raylet to complete GCS registration before tests create actors. This is a minimal, low-risk fix that addresses the timing issue without adding complex retry or polling logic.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.Summary by CodeRabbit
Tests