[https://nvbugs/6029864][fix] Fix flaky ray test failure by brb-nv · Pull Request #12697 · NVIDIA/TensorRT-LLM

brb-nv · 2026-04-02T16:25:23Z

Description

Background
The test test_cp_tp_broadcast_object[cp_broadcast-dict] in tests/unittest/_torch/ray_orchestrator/multi_gpu/test_ops.py intermittently fails with a Ray cluster timeout error during fixture setup:

Exception: The current node timed out during startup. This could happen because some of the raylet failed to startup or the GCS has become overloaded.
INFO node.py:375 -- Failed to get node info 'GCS cannot find the node with node ID ... The node registration may not be complete yet before the timeout.'

Root Cause
The setup_ray_cluster fixture calls ray.init() which returns successfully, but the raylet nodes have not yet fully registered their resources with the Ray Global Control Store (GCS). When tests immediately attempt to create Ray actors after the fixture yields, they fail because the GCS cannot find the node information.

This is a race condition between ray.init() completing and the raylet finishing its registration with the GCS.

Fix
Add a 2-second delay after ray.init() to allow the raylet to complete GCS registration before tests create actors. This is a minimal, low-risk fix that addresses the timing issue without adding complex retry or polling logic.

Test Coverage

$ pytest tests/unittest/_torch/ray_orchestrator/multi_gpu/test_ops.py::test_cp_tp_broadcast_object[cp_broadcast-dict] --run-ray -s -v

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

Tests

Improved test initialization reliability by adding a delay to ensure proper system registration completes before test execution begins.

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

coderabbitai · 2026-04-02T16:28:56Z

📝 Walkthrough

Walkthrough

The setup_ray_cluster function in the test configuration now imports the time module and introduces a fixed 2-second delay after Ray initialization to allow Raylet and GCS registration to complete before test execution begins.

Changes

Cohort / File(s)	Summary
Ray Cluster Test Setup `tests/unittest/conftest.py`	Added `time` module import and 2-second delay (`time.sleep(2)`) after Ray initialization and GCS address/port extraction to ensure registration completes before test execution.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main change: fixing a flaky Ray test failure with a specific NVBugs reference and fix type.
Description check	✅ Passed	The description includes all required sections (background, root cause, fix, test coverage) and clearly explains the issue, solution, and testing approach with actionable details.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tests/unittest/conftest.py (1)

453-454: Prefer readiness polling instead of a fixed 2s sleep.

time.sleep(2) is still timing-dependent: slow CI nodes can remain flaky, while fast nodes always pay 2s. A bounded poll for Ray node readiness is more deterministic.

♻️ Suggested change

-        # Allow raylet to complete GCS registration before tests create actors
-        time.sleep(2)
+        # Wait for at least one alive Ray node to appear in GCS.
+        deadline_s = time.monotonic() + 10.0
+        while time.monotonic() < deadline_s:
+            if any(node.get("Alive", False) for node in ray.nodes()):
+                break
+            time.sleep(0.1)
+        else:
+            raise RuntimeError("Ray cluster did not become ready within 10 seconds")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/conftest.py` around lines 453 - 454, Replace the fixed
time.sleep(2) in tests/unittest/conftest.py with a bounded readiness poll that
checks Ray node registration: implement a loop (use time.monotonic for timing)
that repeatedly queries ray.nodes() (or
ray.cluster_resources()/ray.state.cluster_resources) and returns once nodes are
present and show an "alive"/registered status (e.g., check node["Alive"] or
equivalent), sleeping a short interval (e.g., 0.1s) between attempts and
raising/failed-asserting after a configurable timeout (e.g., 10s); update the
code around the existing sleep call to use this polling logic so fast CI doesn't
wait unnecessarily and slow CI gets a bounded wait.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unittest/conftest.py`:
- Around line 453-454: Replace the fixed time.sleep(2) in
tests/unittest/conftest.py with a bounded readiness poll that checks Ray node
registration: implement a loop (use time.monotonic for timing) that repeatedly
queries ray.nodes() (or ray.cluster_resources()/ray.state.cluster_resources) and
returns once nodes are present and show an "alive"/registered status (e.g.,
check node["Alive"] or equivalent), sleeping a short interval (e.g., 0.1s)
between attempts and raising/failed-asserting after a configurable timeout
(e.g., 10s); update the code around the existing sleep call to use this polling
logic so fast CI doesn't wait unnecessarily and slow CI gets a bounded wait.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9d9ff24d-b448-4e3e-9ec2-de2b8ec7a154

📥 Commits

Reviewing files that changed from the base of the PR and between b9ba730 and 1fb85a4.

📒 Files selected for processing (1)

tests/unittest/conftest.py

brb-nv · 2026-04-02T16:35:04Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-02T16:41:16Z

PR_Github #41464 [ run ] triggered by Bot. Commit: 0081e3b Link to invocation

tensorrt-cicd · 2026-04-02T21:09:29Z

PR_Github #41464 [ run ] completed with state SUCCESS. Commit: 0081e3b
/LLM/main/L0_MergeRequest_PR pipeline #32394 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

github-actions bot assigned brb-nv Apr 2, 2026

[https://nvbugs/6029864][fix] Fix flaky ray issue

0081e3b

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

brb-nv force-pushed the user/brb/unwaive-cp-tp-broadcast-unittest branch from 1fb85a4 to 0081e3b Compare April 2, 2026 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6029864][fix] Fix flaky ray test failure#12697

[https://nvbugs/6029864][fix] Fix flaky ray test failure#12697
brb-nv wants to merge 1 commit intoNVIDIA:mainfrom
brb-nv:user/brb/unwaive-cp-tp-broadcast-unittest

brb-nv commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 2, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

brb-nv commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brb-nv commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Summary by CodeRabbit

Tests

Uh oh!

coderabbitai bot commented Apr 2, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

brb-nv commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brb-nv commented Apr 2, 2026 •

edited by coderabbitai bot

Loading