feat: Add CUDA Graph configuration support to MegatronPolicyWorker#1736
feat: Add CUDA Graph configuration support to MegatronPolicyWorker#1736sahgerlad wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
Conversation
📝 WalkthroughWalkthroughAdded CUDA Graph and RNG configuration support to MegatronPolicyWorker by importing RNGConfig and conditionally propagating enable_cuda_graph, cuda_graph_scope, and rng settings into the Megatron ConfigContainer. Extended test configuration builder and added test coverage for CUDA graph configuration parsing. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Important Action Needed: IP Allowlist UpdateIf your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:
Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
nemo_rl/models/policy/workers/megatron_policy_worker.py (1)
632-639: Consider more explicit conditional logic foruse_te_rng_tracker.The current code only sets
model_cfg.use_te_rng_tracker = Truewhen CUDA graphs are enabled (line 638), but doesn't explicitly set it toFalsewhen disabled. While this may be correct if the default isFalse, the logic could be clearer:if model_cfg.enable_cuda_graph: model_cfg.use_te_rng_tracker = True + else: + model_cfg.use_te_rng_tracker = FalseOr more concisely:
model_cfg.use_te_rng_tracker = model_cfg.enable_cuda_graphAdditionally, there's no validation that
cuda_graph_scope(lines 635-636) is only set whenenable_cuda_graph=True. While this may be acceptable, it could lead to confusion if someone configures a scope without enabling CUDA graphs.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_rl/models/policy/workers/megatron_policy_worker.pytests/unit/models/policy/test_megatron_worker.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code
Files:
tests/unit/models/policy/test_megatron_worker.pynemo_rl/models/policy/workers/megatron_policy_worker.py
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
tests/unit/models/policy/test_megatron_worker.pynemo_rl/models/policy/workers/megatron_policy_worker.py
**/*.{py,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)
Files:
tests/unit/models/policy/test_megatron_worker.pynemo_rl/models/policy/workers/megatron_policy_worker.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes
Files:
nemo_rl/models/policy/workers/megatron_policy_worker.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (5)
nemo_rl/models/policy/workers/megatron_policy_worker.py (2)
717-719: Verify consistency betweenmodel_cfg.use_te_rng_trackerandrng_config.te_rng_tracker.The
rng_config.te_rng_trackeris set based onenable_cuda_graphconfig (defaulting toFalseif not present), whilemodel_cfg.use_te_rng_trackerat line 638 is only set toTrueconditionally. Ensure these two settings remain consistent:
- When
enable_cuda_graph=True: Both should beTrue✓- When
enable_cuda_graph=False:rng_config.te_rng_tracker=False, butmodel_cfg.use_te_rng_trackeris not explicitly set- When key absent:
rng_config.te_rng_tracker=False, butmodel_cfg.use_te_rng_trackeris not explicitly setThis should work correctly if
model_cfg.use_te_rng_trackerdefaults toFalse, but please verify this assumption holds in the Megatron Bridge configuration.
632-639: Verify documentation and example YAML updates per coding guidelines.Based on the coding guidelines: "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/.yaml"*
Please confirm that:
- The new config keys (
enable_cuda_graphandcuda_graph_scope) are documented with their purpose, valid values, and defaults- Example YAMLs under
examples/configs/have been updated to reflect these new options- If these keys are added to a TypedDict for PolicyConfig, they use
typing.NotRequiredto mark them as optionalAs per coding guidelines.
tests/unit/models/policy/test_megatron_worker.py (3)
68-69: LGTM! Parameter declarations are well-typed.The new optional parameters are properly declared with appropriate types and defaults.
137-138: Correct handling of optional config parameters.The conditional dictionary unpacking correctly uses
is not Nonechecks, which ensures:
enable_cuda_graph=Falseis added to the config (correct, sinceFalse is not None)enable_cuda_graph=Noneomits the key (correct)cuda_graph_scope="value"adds the value (correct)cuda_graph_scope=Noneomits the key (correct)
2593-2624: Good test coverage for CUDA graph config parsing.The test validates the four main scenarios:
- Default config (keys absent)
- CUDA graph enabled (key present, scope absent)
- CUDA graph enabled with scope (both present)
- CUDA graph disabled (key present with False value)
This provides solid coverage for the config parsing logic. The test correctly verifies that optional config keys are only included when explicitly provided.
304fb1b to
240f6cf
Compare
|
@shanmugamr1992 to review |
Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
240f6cf to
bbdeb7e
Compare
Signed-off-by: sahgerlad <36946563+sahgerlad@users.noreply.github.com>
|
@shanmugamr1992 Wanted to follow up on this PR. It can provide performance improvement and would be great to get it reviewed/merged |
…d, CUDA graph) onto v0.5.0 Apply Megatron policy changes from setup.py (main) into megatron_policy_worker.py for the v0.5.0 layout: attention_backend, enable_cuda_graph/cuda_graph_scope, RNGConfig for TE RNG tracker, tests and MegatronConfig typing. Signed-off-by: Sahger Lad <lad.sahger@gmail.com> Made-with: Cursor
…d, CUDA graph) onto v0.5.0 Apply Megatron policy changes from setup.py (main) into megatron_policy_worker.py for the v0.5.0 layout: attention_backend, enable_cuda_graph/cuda_graph_scope, RNGConfig for TE RNG tracker, tests and MegatronConfig typing. Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
terrykong
left a comment
There was a problem hiding this comment.
@guyueh1 @ananthsub can you review?
…id test scope Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
8eff9fc to
6e1d8e0
Compare
guyueh1
left a comment
There was a problem hiding this comment.
LGTM; not necessarily have to be noted in PR, but I wonder what recipes can use it. IIUC it can only be used in non-colocated, and the full_iteration only works for dense model?
|
This is a simple interface addition, it LGTM, @seonjinn do you want to take a look as well |
I haven't played with various setups to optimize perf gains but you are likely correct (non-colocated; |
There was a problem hiding this comment.
Review Summary
⚠️ Merge conflicts detected — this PR has conflicts withmain. Please rebase and resolve before further review.
This PR adds CUDA Graph configuration support — a valuable feature. However, there are 2 blocking issues and 6 suggestions that should be addressed before merge.
Blocking
- TypedDict fields must be
NotRequired—enable_cuda_graphandcuda_graph_scopeare declared Required, breaking ~160 config validation tests enable_cuda_graphis deprecated in Megatron-LM — replaced bycuda_graph_impl(emits DeprecationWarning, risks assertion crash)
Performance Evidence
Since this feature affects training performance, it would be helpful to include some benchmark numbers in the PR description (e.g. tokens/sec or step time with and without CUDA graphs, on a representative config). This helps reviewers and future users understand the expected benefit and any trade-offs. The .coderabbit.yaml pre-merge check also flags this.
Additional: Example YAMLs
Per docs/design-docs/design-and-philosophy.md, exemplar YAMLs under examples/configs/ should document new config keys. No example YAML includes enable_cuda_graph or cuda_graph_scope. Consider adding to representative Megatron configs (e.g. grpo_math_1B_megatron.yaml) with commented defaults.
Devil's Advocate
3 findings confirmed | 3 downgraded | 1 disputed | PR value: confirmed (real feature, not trivial)
Generated by Claude Code
| enable_cuda_graph: bool | ||
| cuda_graph_scope: CudaGraphScope |
There was a problem hiding this comment.
[BUG — blocking] These fields are declared as Required but setup.py:486 treats them as optional (if "enable_cuda_graph" in config[...]).
This will break test_all_config_files_have_required_keys for all ~160 example YAMLs that don't include these keys — that test uses Pydantic TypeAdapter which enforces Required vs NotRequired on TypedDicts.
| enable_cuda_graph: bool | |
| cuda_graph_scope: CudaGraphScope | |
| enable_cuda_graph: NotRequired[bool] | |
| cuda_graph_scope: NotRequired[CudaGraphScope] |
|
|
||
| # CUDA Graph configuration | ||
| if "enable_cuda_graph" in config["megatron_cfg"]: | ||
| model_cfg.enable_cuda_graph = config["megatron_cfg"]["enable_cuda_graph"] |
There was a problem hiding this comment.
[BUG — blocking] TransformerConfig.enable_cuda_graph is explicitly deprecated in Megatron-LM, replaced by cuda_graph_impl. At runtime, __post_init__ emits a DeprecationWarning and asserts cuda_graph_impl == "none" before migrating.
Megatron-Bridge already uses the non-deprecated API — see _set_cuda_graph_overrides:
# How Megatron-Bridge does it (overrides.py:108-113):
if cuda_graph_impl is not None:
recipe.model.cuda_graph_impl = cuda_graph_impl
if cuda_graph_impl != "none":
recipe.rng.te_rng_tracker = recipe.model.use_te_rng_tracker = True
else:
recipe.rng.te_rng_tracker = recipe.model.use_te_rng_tracker = FalseRecommend following this same pattern: replace enable_cuda_graph: bool with cuda_graph_impl: str (values: "none", "local", "transformer_engine") and set model_cfg.cuda_graph_impl directly.
| cluster.shutdown() | ||
|
|
||
|
|
||
| def test_cuda_graph_config_parsing(): |
There was a problem hiding this comment.
test_megatron_worker.py:2695-2726
[TEST — nit] The test verifies the config helper builds the dict correctly, which is useful. One thing to consider: it might also be worth adding a case for an invalid cuda_graph_scope value to make sure the ValueError at setup.py:492` is exercised — but that's optional and could come in a follow-up.
| f"Valid options are: {valid_scopes}" | ||
| ) | ||
| model_cfg.cuda_graph_scope = scope | ||
| if not model_cfg.enable_cuda_graph: |
There was a problem hiding this comment.
[BUG] Warning says "cuda_graph_scope will have no effect" but line 496 already assigned model_cfg.cuda_graph_scope = scope. If cuda_graph_impl is non-"none" from another source (Megatron supports cuda_graph_impl: "local" | "transformer_engine"), the scope WILL take effect despite the warning. Either skip the assignment when disabled, or correct the warning.
| from nemo_rl.models.generation.interfaces import GenerationConfig | ||
|
|
||
| CudaGraphScope = Literal[ | ||
| "full_iteration", "attn", "mlp", "moe", "moe_router", "moe_preprocess", "mamba" |
There was a problem hiding this comment.
[BUG] This Literal is brittle — it's already missing full_iteration_inference and can't express comma-separated compound scopes like "attn,mlp". It will go out of sync whenever Megatron-LM updates the enum.
Recommendation: Drop the CudaGraphScope type alias. Declare both fields as str in the TypedDict with inline comments pointing to the upstream source of truth. Following the Megatron-Bridge pattern, the fields should be:
# CUDA graph implementation. Valid values: "none", "local", "transformer_engine".
# See: https://github.com/NVIDIA/Megatron-LM/blob/23dd639cf3de30f3b9d8d0fae71ee31180be9ddd/megatron/core/transformer/transformer_config.py#L791-L797
cuda_graph_impl: NotRequired[str]
# CUDA graph capture scope. Supports single values (e.g. "attn", "full_iteration")
# or comma-separated compounds (e.g. "attn,mlp"). Scope validation depends on impl.
# See: https://github.com/NVIDIA/Megatron-LM/blob/23dd639cf3de30f3b9d8d0fae71ee31180be9ddd/megatron/core/transformer/enums.py#L70-L80
cuda_graph_scope: NotRequired[str]The get_args(CudaGraphScope) validation at setup.py:490 can then be removed — Megatron's own __post_init__ already validates and normalizes scopes.
More broadly: new config fields that proxy upstream Megatron args should have inline comments referencing the upstream docs for full semantics and valid values (similar to defer_fp32_logits and use_linear_ce_fusion_loss).
| ) | ||
| if model_cfg.enable_cuda_graph: | ||
| model_cfg.use_te_rng_tracker = True | ||
| else: |
There was a problem hiding this comment.
[BUG] Setting enable_cuda_graph=False relies on the deprecated flag to disable CUDA graphs. In Megatron-LM, the actual runtime check is cuda_graph_impl != "none" — this gates all CUDA graph setup, scope validation, and layer wrapping. The deprecated enable_cuda_graph is only consulted in __post_init__ to migrate to cuda_graph_impl="local".
Currently this works because nemo-rl builds model_cfg from scratch (so cuda_graph_impl starts at its default "none"). But it's fragile — if cuda_graph_impl is ever added to the nemo-rl config surface or set via a checkpoint, setting enable_cuda_graph=False would not override it. This is another reason to migrate to cuda_graph_impl directly.
What does this PR do ?
Add CUDA Graph configuration support to MegatronPolicyWorker
Issues
N/A - Feature addition
Usage
Before your PR is "Ready for review"
Pre checks:
Additional Information
N/A
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.