Skip to content

feat: Add CUDA Graph configuration support to MegatronPolicyWorker#1736

Open
sahgerlad wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
sahgerlad:feat/cuda-graph-support
Open

feat: Add CUDA Graph configuration support to MegatronPolicyWorker#1736
sahgerlad wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
sahgerlad:feat/cuda-graph-support

Conversation

@sahgerlad
Copy link
Copy Markdown
Contributor

@sahgerlad sahgerlad commented Jan 7, 2026

What does this PR do ?

Add CUDA Graph configuration support to MegatronPolicyWorker

Issues

N/A - Feature addition

Usage

# In config YAML, under megatron_cfg:
megatron_cfg:
  enabled: true
  enable_cuda_graph: true           # Enable CUDA graph capture
  cuda_graph_scope: "attn"    # Optional: set the scope for CUDA graphs
  # ... other megatron config options

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

N/A

Summary by CodeRabbit

  • New Features

    • Added CUDA Graph configuration support for Megatron policy worker, allowing users to enable CUDA graphs and specify their scope (e.g., full_model).
    • Integrated RNG configuration that automatically enables RNG tracking when CUDA graphs are activated.
  • Tests

    • Added comprehensive test coverage for CUDA graph configuration parsing with various parameter combinations.

✏️ Tip: You can customize this high-level summary in your review settings.

@sahgerlad sahgerlad requested review from a team as code owners January 7, 2026 19:56
@sahgerlad sahgerlad changed the title Add CUDA Graph configuration support to MegatronPolicyWorker feat: Add CUDA Graph configuration support to MegatronPolicyWorker Jan 7, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

Added CUDA Graph and RNG configuration support to MegatronPolicyWorker by importing RNGConfig and conditionally propagating enable_cuda_graph, cuda_graph_scope, and rng settings into the Megatron ConfigContainer. Extended test configuration builder and added test coverage for CUDA graph configuration parsing.

Changes

Cohort / File(s) Summary
CUDA Graph and RNG Configuration
nemo_rl/models/policy/workers/megatron_policy_worker.py
Imported RNGConfig from megatron.bridge.training.config. Added conditional logic to extract enable_cuda_graph and cuda_graph_scope from megatron_cfg and propagate them into model_cfg. When enable_cuda_graph is true, set use_te_rng_tracker to True. Created RNGConfig instance based on enable_cuda_graph and passed it to Megatron ConfigContainer as rng parameter.
Test Configuration and Validation
tests/unit/models/policy/test_megatron_worker.py
Extended create_megatron_test_config signature with optional enable_cuda_graph and cuda_graph_scope parameters. Conditionally merge these parameters into megatron_cfg when provided. Added new test_cuda_graph_config_parsing unit test to verify CUDA graph configuration propagation across default, enabled-only, and enabled-with-scope scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR introduces major CUDA Graph configuration feature but lacks documented test execution results; PR states test execution was not fully confirmed. Document test execution results for test_cuda_graph_config_parsing and functional tests; include performance benchmarks demonstrating CUDA graph benefits without regressions.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding CUDA Graph configuration support to MegatronPolicyWorker, which matches the core functionality introduced across both modified files.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
nemo_rl/models/policy/workers/megatron_policy_worker.py (1)

632-639: Consider more explicit conditional logic for use_te_rng_tracker.

The current code only sets model_cfg.use_te_rng_tracker = True when CUDA graphs are enabled (line 638), but doesn't explicitly set it to False when disabled. While this may be correct if the default is False, the logic could be clearer:

     if model_cfg.enable_cuda_graph:
         model_cfg.use_te_rng_tracker = True
+    else:
+        model_cfg.use_te_rng_tracker = False

Or more concisely:

model_cfg.use_te_rng_tracker = model_cfg.enable_cuda_graph

Additionally, there's no validation that cuda_graph_scope (lines 635-636) is only set when enable_cuda_graph=True. While this may be acceptable, it could lead to confusion if someone configures a scope without enabling CUDA graphs.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2a39bd6 and 13751df.

📒 Files selected for processing (2)
  • nemo_rl/models/policy/workers/megatron_policy_worker.py
  • tests/unit/models/policy/test_megatron_worker.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code

Files:

  • tests/unit/models/policy/test_megatron_worker.py
  • nemo_rl/models/policy/workers/megatron_policy_worker.py
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • tests/unit/models/policy/test_megatron_worker.py
  • nemo_rl/models/policy/workers/megatron_policy_worker.py
**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

  • tests/unit/models/policy/test_megatron_worker.py
  • nemo_rl/models/policy/workers/megatron_policy_worker.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes

Files:

  • nemo_rl/models/policy/workers/megatron_policy_worker.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (5)
nemo_rl/models/policy/workers/megatron_policy_worker.py (2)

717-719: Verify consistency between model_cfg.use_te_rng_tracker and rng_config.te_rng_tracker.

The rng_config.te_rng_tracker is set based on enable_cuda_graph config (defaulting to False if not present), while model_cfg.use_te_rng_tracker at line 638 is only set to True conditionally. Ensure these two settings remain consistent:

  • When enable_cuda_graph=True: Both should be True
  • When enable_cuda_graph=False: rng_config.te_rng_tracker=False, but model_cfg.use_te_rng_tracker is not explicitly set
  • When key absent: rng_config.te_rng_tracker=False, but model_cfg.use_te_rng_tracker is not explicitly set

This should work correctly if model_cfg.use_te_rng_tracker defaults to False, but please verify this assumption holds in the Megatron Bridge configuration.


632-639: Verify documentation and example YAML updates per coding guidelines.

Based on the coding guidelines: "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/.yaml"*

Please confirm that:

  1. The new config keys (enable_cuda_graph and cuda_graph_scope) are documented with their purpose, valid values, and defaults
  2. Example YAMLs under examples/configs/ have been updated to reflect these new options
  3. If these keys are added to a TypedDict for PolicyConfig, they use typing.NotRequired to mark them as optional

As per coding guidelines.

tests/unit/models/policy/test_megatron_worker.py (3)

68-69: LGTM! Parameter declarations are well-typed.

The new optional parameters are properly declared with appropriate types and defaults.


137-138: Correct handling of optional config parameters.

The conditional dictionary unpacking correctly uses is not None checks, which ensures:

  • enable_cuda_graph=False is added to the config (correct, since False is not None)
  • enable_cuda_graph=None omits the key (correct)
  • cuda_graph_scope="value" adds the value (correct)
  • cuda_graph_scope=None omits the key (correct)

2593-2624: Good test coverage for CUDA graph config parsing.

The test validates the four main scenarios:

  1. Default config (keys absent)
  2. CUDA graph enabled (key present, scope absent)
  3. CUDA graph enabled with scope (both present)
  4. CUDA graph disabled (key present with False value)

This provides solid coverage for the config parsing logic. The test correctly verifies that optional config keys are only included when explicitly provided.

@sahgerlad sahgerlad force-pushed the feat/cuda-graph-support branch 2 times, most recently from 304fb1b to 240f6cf Compare January 7, 2026 20:05
@terrykong
Copy link
Copy Markdown
Collaborator

@shanmugamr1992 to review

Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
@sahgerlad sahgerlad force-pushed the feat/cuda-graph-support branch from 240f6cf to bbdeb7e Compare February 4, 2026 01:40
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Mar 21, 2026
Signed-off-by: sahgerlad <36946563+sahgerlad@users.noreply.github.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@sahgerlad
Copy link
Copy Markdown
Contributor Author

@shanmugamr1992 Wanted to follow up on this PR. It can provide performance improvement and would be great to get it reviewed/merged

@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Mar 21, 2026
sahgerlad added a commit to sahgerlad/RL that referenced this pull request Mar 21, 2026
…d, CUDA graph) onto v0.5.0

Apply Megatron policy changes from setup.py (main) into megatron_policy_worker.py
for the v0.5.0 layout: attention_backend, enable_cuda_graph/cuda_graph_scope,
RNGConfig for TE RNG tracker, tests and MegatronConfig typing.

Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
Made-with: Cursor
sahgerlad added a commit to sahgerlad/RL that referenced this pull request Mar 21, 2026
…d, CUDA graph) onto v0.5.0

Apply Megatron policy changes from setup.py (main) into megatron_policy_worker.py
for the v0.5.0 layout: attention_backend, enable_cuda_graph/cuda_graph_scope,
RNGConfig for TE RNG tracker, tests and MegatronConfig typing.

Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Mar 23, 2026
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guyueh1 @ananthsub can you review?

Comment thread nemo_rl/models/megatron/setup.py
@terrykong terrykong requested review from ananthsub and guyueh1 March 23, 2026 17:48
@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Mar 23, 2026
…id test scope

Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
@sahgerlad sahgerlad force-pushed the feat/cuda-graph-support branch from 8eff9fc to 6e1d8e0 Compare March 23, 2026 21:27
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Mar 26, 2026
Copy link
Copy Markdown
Contributor

@guyueh1 guyueh1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; not necessarily have to be noted in PR, but I wonder what recipes can use it. IIUC it can only be used in non-colocated, and the full_iteration only works for dense model?

@guyueh1
Copy link
Copy Markdown
Contributor

guyueh1 commented Mar 31, 2026

This is a simple interface addition, it LGTM, @seonjinn do you want to take a look as well

@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Mar 31, 2026
@sahgerlad
Copy link
Copy Markdown
Contributor Author

sahgerlad commented Apr 1, 2026

LGTM; not necessarily have to be noted in PR, but I wonder what recipes can use it. IIUC it can only be used in non-colocated, and the full_iteration only works for dense model?

I haven't played with various setups to optimize perf gains but you are likely correct (non-colocated; full_iteration for dense only). When I tested this a few months ago, I saw a small perf improvement with MoE type model on Blackwell but I was memory bandwidth bound. Would be good to keep exploring the possible gains

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Apr 1, 2026
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

⚠️ Merge conflicts detected — this PR has conflicts with main. Please rebase and resolve before further review.

This PR adds CUDA Graph configuration support — a valuable feature. However, there are 2 blocking issues and 6 suggestions that should be addressed before merge.

Blocking

  1. TypedDict fields must be NotRequiredenable_cuda_graph and cuda_graph_scope are declared Required, breaking ~160 config validation tests
  2. enable_cuda_graph is deprecated in Megatron-LM — replaced by cuda_graph_impl (emits DeprecationWarning, risks assertion crash)

Performance Evidence

Since this feature affects training performance, it would be helpful to include some benchmark numbers in the PR description (e.g. tokens/sec or step time with and without CUDA graphs, on a representative config). This helps reviewers and future users understand the expected benefit and any trade-offs. The .coderabbit.yaml pre-merge check also flags this.

Additional: Example YAMLs

Per docs/design-docs/design-and-philosophy.md, exemplar YAMLs under examples/configs/ should document new config keys. No example YAML includes enable_cuda_graph or cuda_graph_scope. Consider adding to representative Megatron configs (e.g. grpo_math_1B_megatron.yaml) with commented defaults.

Devil's Advocate

3 findings confirmed | 3 downgraded | 1 disputed | PR value: confirmed (real feature, not trivial)

Generated by Claude Code

Comment on lines +231 to +232
enable_cuda_graph: bool
cuda_graph_scope: CudaGraphScope
Copy link
Copy Markdown
Collaborator

@terrykong terrykong Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__init__.py:231-232

[BUG — blocking] These fields are declared as Required but setup.py:486 treats them as optional (if "enable_cuda_graph" in config[...]).

This will break test_all_config_files_have_required_keys for all ~160 example YAMLs that don't include these keys — that test uses Pydantic TypeAdapter which enforces Required vs NotRequired on TypedDicts.

Suggested change
enable_cuda_graph: bool
cuda_graph_scope: CudaGraphScope
enable_cuda_graph: NotRequired[bool]
cuda_graph_scope: NotRequired[CudaGraphScope]


# CUDA Graph configuration
if "enable_cuda_graph" in config["megatron_cfg"]:
model_cfg.enable_cuda_graph = config["megatron_cfg"]["enable_cuda_graph"]
Copy link
Copy Markdown
Collaborator

@terrykong terrykong Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup.py:486-487

[BUG — blocking] TransformerConfig.enable_cuda_graph is explicitly deprecated in Megatron-LM, replaced by cuda_graph_impl. At runtime, __post_init__ emits a DeprecationWarning and asserts cuda_graph_impl == "none" before migrating.

Megatron-Bridge already uses the non-deprecated API — see _set_cuda_graph_overrides:

# How Megatron-Bridge does it (overrides.py:108-113):
if cuda_graph_impl is not None:
    recipe.model.cuda_graph_impl = cuda_graph_impl
    if cuda_graph_impl != "none":
        recipe.rng.te_rng_tracker = recipe.model.use_te_rng_tracker = True
    else:
        recipe.rng.te_rng_tracker = recipe.model.use_te_rng_tracker = False

Recommend following this same pattern: replace enable_cuda_graph: bool with cuda_graph_impl: str (values: "none", "local", "transformer_engine") and set model_cfg.cuda_graph_impl directly.

cluster.shutdown()


def test_cuda_graph_config_parsing():
Copy link
Copy Markdown
Collaborator

@terrykong terrykong Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_megatron_worker.py:2695-2726

[TEST — nit] The test verifies the config helper builds the dict correctly, which is useful. One thing to consider: it might also be worth adding a case for an invalid cuda_graph_scope value to make sure the ValueError at setup.py:492` is exercised — but that's optional and could come in a follow-up.

f"Valid options are: {valid_scopes}"
)
model_cfg.cuda_graph_scope = scope
if not model_cfg.enable_cuda_graph:
Copy link
Copy Markdown
Collaborator

@terrykong terrykong Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup.py:496-501

[BUG] Warning says "cuda_graph_scope will have no effect" but line 496 already assigned model_cfg.cuda_graph_scope = scope. If cuda_graph_impl is non-"none" from another source (Megatron supports cuda_graph_impl: "local" | "transformer_engine"), the scope WILL take effect despite the warning. Either skip the assignment when disabled, or correct the warning.

from nemo_rl.models.generation.interfaces import GenerationConfig

CudaGraphScope = Literal[
"full_iteration", "attn", "mlp", "moe", "moe_router", "moe_preprocess", "mamba"
Copy link
Copy Markdown
Collaborator

@terrykong terrykong Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__init__.py:19-21

[BUG] This Literal is brittle — it's already missing full_iteration_inference and can't express comma-separated compound scopes like "attn,mlp". It will go out of sync whenever Megatron-LM updates the enum.

Recommendation: Drop the CudaGraphScope type alias. Declare both fields as str in the TypedDict with inline comments pointing to the upstream source of truth. Following the Megatron-Bridge pattern, the fields should be:

    # CUDA graph implementation. Valid values: "none", "local", "transformer_engine".
    # See: https://github.com/NVIDIA/Megatron-LM/blob/23dd639cf3de30f3b9d8d0fae71ee31180be9ddd/megatron/core/transformer/transformer_config.py#L791-L797
    cuda_graph_impl: NotRequired[str]
    # CUDA graph capture scope. Supports single values (e.g. "attn", "full_iteration")
    # or comma-separated compounds (e.g. "attn,mlp"). Scope validation depends on impl.
    # See: https://github.com/NVIDIA/Megatron-LM/blob/23dd639cf3de30f3b9d8d0fae71ee31180be9ddd/megatron/core/transformer/enums.py#L70-L80
    cuda_graph_scope: NotRequired[str]

The get_args(CudaGraphScope) validation at setup.py:490 can then be removed — Megatron's own __post_init__ already validates and normalizes scopes.

More broadly: new config fields that proxy upstream Megatron args should have inline comments referencing the upstream docs for full semantics and valid values (similar to defer_fp32_logits and use_linear_ce_fusion_loss).

)
if model_cfg.enable_cuda_graph:
model_cfg.use_te_rng_tracker = True
else:
Copy link
Copy Markdown
Collaborator

@terrykong terrykong Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup.py:502-505

[BUG] Setting enable_cuda_graph=False relies on the deprecated flag to disable CUDA graphs. In Megatron-LM, the actual runtime check is cuda_graph_impl != "none" — this gates all CUDA graph setup, scope validation, and layer wrapping. The deprecated enable_cuda_graph is only consulted in __post_init__ to migrate to cuda_graph_impl="local".

Currently this works because nemo-rl builds model_cfg from scratch (so cuda_graph_impl starts at its default "none"). But it's fragile — if cuda_graph_impl is ever added to the nemo-rl config surface or set via a checkpoint, setting enable_cuda_graph=False would not override it. This is another reason to migrate to cuda_graph_impl directly.

@chtruong814 chtruong814 added waiting-on-customer Waiting on the original author to respond needs-follow-up Issue needs follow-up and removed needs-follow-up Issue needs follow-up waiting-on-customer Waiting on the original author to respond labels Apr 18, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-maintainers Waiting on maintainers to respond and removed needs-follow-up Issue needs follow-up labels Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants