Skip to content

[https://nvbugs/6029864][fix] Fix flaky ray test failure#12697

Open
brb-nv wants to merge 1 commit intoNVIDIA:mainfrom
brb-nv:user/brb/unwaive-cp-tp-broadcast-unittest
Open

[https://nvbugs/6029864][fix] Fix flaky ray test failure#12697
brb-nv wants to merge 1 commit intoNVIDIA:mainfrom
brb-nv:user/brb/unwaive-cp-tp-broadcast-unittest

Conversation

@brb-nv
Copy link
Copy Markdown
Collaborator

@brb-nv brb-nv commented Apr 2, 2026

Description

  • Background
    The test test_cp_tp_broadcast_object[cp_broadcast-dict] in tests/unittest/_torch/ray_orchestrator/multi_gpu/test_ops.py intermittently fails with a Ray cluster timeout error during fixture setup:
Exception: The current node timed out during startup. This could happen because some of the raylet failed to startup or the GCS has become overloaded.
INFO node.py:375 -- Failed to get node info 'GCS cannot find the node with node ID ... The node registration may not be complete yet before the timeout.'
  • Root Cause
    The setup_ray_cluster fixture calls ray.init() which returns successfully, but the raylet nodes have not yet fully registered their resources with the Ray Global Control Store (GCS). When tests immediately attempt to create Ray actors after the fixture yields, they fail because the GCS cannot find the node information.

This is a race condition between ray.init() completing and the raylet finishing its registration with the GCS.

  • Fix
    Add a 2-second delay after ray.init() to allow the raylet to complete GCS registration before tests create actors. This is a minimal, low-risk fix that addresses the timing issue without adding complex retry or polling logic.

Test Coverage

$ pytest tests/unittest/_torch/ray_orchestrator/multi_gpu/test_ops.py::test_cp_tp_broadcast_object[cp_broadcast-dict] --run-ray -s -v

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

Tests

  • Improved test initialization reliability by adding a delay to ensure proper system registration completes before test execution begins.

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

The setup_ray_cluster function in the test configuration now imports the time module and introduces a fixed 2-second delay after Ray initialization to allow Raylet and GCS registration to complete before test execution begins.

Changes

Cohort / File(s) Summary
Ray Cluster Test Setup
tests/unittest/conftest.py
Added time module import and 2-second delay (time.sleep(2)) after Ray initialization and GCS address/port extraction to ensure registration completes before test execution.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: fixing a flaky Ray test failure with a specific NVBugs reference and fix type.
Description check ✅ Passed The description includes all required sections (background, root cause, fix, test coverage) and clearly explains the issue, solution, and testing approach with actionable details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unittest/conftest.py (1)

453-454: Prefer readiness polling instead of a fixed 2s sleep.

time.sleep(2) is still timing-dependent: slow CI nodes can remain flaky, while fast nodes always pay 2s. A bounded poll for Ray node readiness is more deterministic.

♻️ Suggested change
-        # Allow raylet to complete GCS registration before tests create actors
-        time.sleep(2)
+        # Wait for at least one alive Ray node to appear in GCS.
+        deadline_s = time.monotonic() + 10.0
+        while time.monotonic() < deadline_s:
+            if any(node.get("Alive", False) for node in ray.nodes()):
+                break
+            time.sleep(0.1)
+        else:
+            raise RuntimeError("Ray cluster did not become ready within 10 seconds")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/conftest.py` around lines 453 - 454, Replace the fixed
time.sleep(2) in tests/unittest/conftest.py with a bounded readiness poll that
checks Ray node registration: implement a loop (use time.monotonic for timing)
that repeatedly queries ray.nodes() (or
ray.cluster_resources()/ray.state.cluster_resources) and returns once nodes are
present and show an "alive"/registered status (e.g., check node["Alive"] or
equivalent), sleeping a short interval (e.g., 0.1s) between attempts and
raising/failed-asserting after a configurable timeout (e.g., 10s); update the
code around the existing sleep call to use this polling logic so fast CI doesn't
wait unnecessarily and slow CI gets a bounded wait.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unittest/conftest.py`:
- Around line 453-454: Replace the fixed time.sleep(2) in
tests/unittest/conftest.py with a bounded readiness poll that checks Ray node
registration: implement a loop (use time.monotonic for timing) that repeatedly
queries ray.nodes() (or ray.cluster_resources()/ray.state.cluster_resources) and
returns once nodes are present and show an "alive"/registered status (e.g.,
check node["Alive"] or equivalent), sleeping a short interval (e.g., 0.1s)
between attempts and raising/failed-asserting after a configurable timeout
(e.g., 10s); update the code around the existing sleep call to use this polling
logic so fast CI doesn't wait unnecessarily and slow CI gets a bounded wait.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9d9ff24d-b448-4e3e-9ec2-de2b8ec7a154

📥 Commits

Reviewing files that changed from the base of the PR and between b9ba730 and 1fb85a4.

📒 Files selected for processing (1)
  • tests/unittest/conftest.py

@brb-nv brb-nv force-pushed the user/brb/unwaive-cp-tp-broadcast-unittest branch from 1fb85a4 to 0081e3b Compare April 2, 2026 16:34
@brb-nv
Copy link
Copy Markdown
Collaborator Author

brb-nv commented Apr 2, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41464 [ run ] triggered by Bot. Commit: 0081e3b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41464 [ run ] completed with state SUCCESS. Commit: 0081e3b
/LLM/main/L0_MergeRequest_PR pipeline #32394 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants