Add Minitron pruning support for GatedDeltaNet, MLA, and latent MoE#1747
Add Minitron pruning support for GatedDeltaNet, MLA, and latent MoE#1747kevalmorabia97 wants to merge 4 commits into
Conversation
Support hidden_size pruning for Megatron-Core models with GatedDeltaNet (linear-attention, e.g. Qwen3.5 incl. MoE) and Multi-Latent Attention (MLA, e.g. DeepSeek) layers. Introduces a _DynamicAttention base whose policy decides whether only hidden_size is pruned or attention heads too, and extends it for gated attention, GatedDeltaNet and MLA. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
📝 WalkthroughWalkthroughExtends Minitron pruning support to GatedDeltaNet, Multi-Latent Attention (MLA), and Latent MoE Megatron-Core models. New dynamic module classes and a shared ChangesExtended Minitron pruning for GatedDeltaNet, MLA, and Latent MoE
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
/claude review |
|
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py (1)
561-564: ⚡ Quick winAssert all pruned MoE config fields in the new GDN+MoE test.
This test prunes
moe_shared_expert_intermediate_size, but it does not assertmodel.config.moe_shared_expert_intermediate_size. Adding that check will catch config-propagation regressions.Suggested assertion
assert model.config.hidden_size == pruned_hidden assert model.config.moe_ffn_hidden_size == pruned_moe_ffn assert model.config.num_moe_experts == pruned_experts + assert model.config.moe_shared_expert_intermediate_size == pruned_moe_shared🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py` around lines 561 - 564, The test in the GDN+MoE section is missing an assertion for the moe_shared_expert_intermediate_size config field even though the test prunes this value. Add an assertion after the existing assertions (following the pattern of asserting model.config.moe_ffn_hidden_size and model.config.num_moe_experts) to verify that model.config.moe_shared_expert_intermediate_size matches the expected pruned value, which will help catch config-propagation regressions for this field.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/_test_utils/torch/megatron/models.py`:
- Around line 167-180: The multi_latent_attention and
experimental_attention_variant flags are currently able to both be set to True
simultaneously, which creates an ambiguous configuration. Add a fast-fail
assertion guard at the beginning of the code section where these flags are
interpreted (before the if multi_latent_attention block) to ensure that both
flags cannot be True at the same time. The assertion should validate that these
two attention mode flags are mutually exclusive and raise an informative error
if both are enabled.
---
Nitpick comments:
In `@tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py`:
- Around line 561-564: The test in the GDN+MoE section is missing an assertion
for the moe_shared_expert_intermediate_size config field even though the test
prunes this value. Add an assertion after the existing assertions (following the
pattern of asserting model.config.moe_ffn_hidden_size and
model.config.num_moe_experts) to verify that
model.config.moe_shared_expert_intermediate_size matches the expected pruned
value, which will help catch config-propagation regressions for this field.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d2c81799-5b69-4444-83d4-5d3f81350a95
📒 Files selected for processing (6)
CHANGELOG.rstexamples/pruning/README.mdmodelopt/torch/nas/plugins/megatron.pymodelopt/torch/prune/plugins/mcore_minitron.pytests/_test_utils/torch/megatron/models.pytests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1747 +/- ##
===========================================
+ Coverage 58.45% 76.54% +18.08%
===========================================
Files 510 511 +1
Lines 56271 56399 +128
===========================================
+ Hits 32896 43172 +10276
+ Misses 23375 13227 -10148
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Support hidden_size pruning for latent MoE layers (config.moe_latent_size, e.g. Nemotron-3): the fc1/fc2 latent projections carry hidden_size while the routed experts stay in the static latent dim. Add a TODO outlining how to also prune moe_latent_size itself. Refactor the minitron pruning tests to share scaffolding via _build_and_prune_variant (attention/MoE variants), _sort_and_capture (parameter sorting), _make_forward_loop, and _assert_reprune_matches (checkpoint re-prune). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
|
/claude review |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
modelopt/torch/nas/plugins/megatron.py (2)
680-683:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the MLA docstring to match the implemented layernorm behavior.
The docstring says the separate MLA input layernorm stays static, but
_DynamicTransformerLayer._setupconvertsinput_layernormwithnum_features=hidden_sizeand downstream MLA tests assert it is pruned to the active hidden size. Please removeinput layernormfrom the “stay static” list.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/nas/plugins/megatron.py` around lines 680 - 683, The docstring for the MLA layer class incorrectly states that the input layernorm stays static, but the actual implementation in _DynamicTransformerLayer._setup converts and prunes input_layernorm based on hidden_size, and tests verify this behavior. Remove the reference to input layernorm from the "stay static" list in the docstring to accurately reflect that only Q/KV latent ranks and head dims remain static during dynamic pruning.
26-31:⚠️ Potential issue | 🟡 MinorGuard optional Megatron-Core API features or explicitly declare minimum version requirement.
This module imports
TELinear,GatedDeltaNet, andMLASelfAttentionat the top level and builds_ATTENTION_TYPESfrom them. If the supported Megatron-Core minimum version does not include all three symbols, importing this plugin fails before users opt into the new variants. Additionally, line 607 readsself.config.attention_output_gatedirectly without a fallback.Either:
- Wrap these imports and
_ATTENTION_TYPESin a try-except or conditional check, defaulting missing symbols gracefully.- Explicitly bump the Megatron-Core minimum version in
pyproject.tomlto guarantee these features are available.- Use
getattr(self.config, "attention_output_gate", False)on line 607 to handle older configs.The test utilities (e.g.,
tests/_test_utils/torch/megatron/models.py:253–261) show a pattern for lazy-importing experimental features when they may be absent in older builds; the same principle applies here.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/nas/plugins/megatron.py` around lines 26 - 31, The imports of TELinear, GatedDeltaNet, and MLASelfAttention at the top level may fail if the Megatron-Core version does not include these symbols, causing the entire plugin to fail before users can opt-in to new variants. Wrap these three imports in a try-except block and set them to None or a fallback value if they are not available. Then conditionally build the _ATTENTION_TYPES dictionary to only include the types that were successfully imported. Additionally, at line 607 where self.config.attention_output_gate is accessed directly, replace it with getattr(self.config, "attention_output_gate", False) to handle configurations from older Megatron-Core versions that may not have this attribute.
🧹 Nitpick comments (1)
tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py (1)
652-653: ⚡ Quick winTighten latent-MoE config-contract assertions in the new variant test.
This test validates latent MoE weight shapes, but it currently only asserts
model.config.hidden_size. To guard checkpoint/export regressions for this new path, assertmodel.config.moe_ffn_hidden_sizeandmodel.config.num_moe_expertstoo (as done in other pruning tests here).Suggested additions
assert model.config.hidden_size == pruned_hidden + assert model.config.moe_ffn_hidden_size == pruned_moe_ffn + assert model.config.num_moe_experts == pruned_experts🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py` around lines 652 - 653, The test currently only asserts model.config.hidden_size but does not validate the latent MoE configuration contract fully. Add two additional assertions after the existing assert model.config.hidden_size == pruned_hidden line to also assert model.config.moe_ffn_hidden_size and model.config.num_moe_experts, matching the pattern used in other pruning tests in this file to ensure comprehensive validation of MoE-specific config values and guard against checkpoint/export regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/nas/plugins/megatron.py`:
- Around line 755-776: The assertion in _DynamicMLP._setup currently requires
each expert's input_size to equal self.config.hidden_size, but for latent MoE
configurations, experts run on moe_latent_size instead. Locate the assertion in
_DynamicMLP._setup that validates input_size against hidden_size and modify it
to also accept moe_latent_size when that configuration is present. The assertion
should allow input_size to match either self.config.hidden_size or
self.config.moe_latent_size (if configured) so that expert conversion succeeds
before pruning/export when using latent MoE.
In `@tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py`:
- Around line 70-80: The registry.cleanup() call in this function only executes
if all prior operations succeed. If run_mcore_inference,
run_mcore_inference_with_dummy_input, or mtn.utils.sort_parameters raise an
exception, the cleanup will be skipped and hooks will leak into subsequent
tests. Wrap the code from the ImportanceEstimatorRegistry instantiation through
mtn.utils.sort_parameters in a try block, and move registry.cleanup() into a
finally block to ensure cleanup always runs regardless of exceptions during
inference or sorting operations.
---
Outside diff comments:
In `@modelopt/torch/nas/plugins/megatron.py`:
- Around line 680-683: The docstring for the MLA layer class incorrectly states
that the input layernorm stays static, but the actual implementation in
_DynamicTransformerLayer._setup converts and prunes input_layernorm based on
hidden_size, and tests verify this behavior. Remove the reference to input
layernorm from the "stay static" list in the docstring to accurately reflect
that only Q/KV latent ranks and head dims remain static during dynamic pruning.
- Around line 26-31: The imports of TELinear, GatedDeltaNet, and
MLASelfAttention at the top level may fail if the Megatron-Core version does not
include these symbols, causing the entire plugin to fail before users can opt-in
to new variants. Wrap these three imports in a try-except block and set them to
None or a fallback value if they are not available. Then conditionally build the
_ATTENTION_TYPES dictionary to only include the types that were successfully
imported. Additionally, at line 607 where self.config.attention_output_gate is
accessed directly, replace it with getattr(self.config, "attention_output_gate",
False) to handle configurations from older Megatron-Core versions that may not
have this attribute.
---
Nitpick comments:
In `@tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py`:
- Around line 652-653: The test currently only asserts model.config.hidden_size
but does not validate the latent MoE configuration contract fully. Add two
additional assertions after the existing assert model.config.hidden_size ==
pruned_hidden line to also assert model.config.moe_ffn_hidden_size and
model.config.num_moe_experts, matching the pattern used in other pruning tests
in this file to ensure comprehensive validation of MoE-specific config values
and guard against checkpoint/export regressions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 24971815-a307-4a76-ae56-41c0ab8f97a4
📒 Files selected for processing (4)
CHANGELOG.rstmodelopt/torch/nas/plugins/megatron.pytests/_test_utils/torch/megatron/models.pytests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py
✅ Files skipped from review due to trivial changes (1)
- CHANGELOG.rst
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/_test_utils/torch/megatron/models.py
|
/claude review |
Wrap the activation/sort body of _sort_and_capture in try/finally so the ImportanceEstimatorRegistry hooks and patched TE return_layernorm_output state are always cleaned up, preventing leakage into subsequent tests on failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
There was a problem hiding this comment.
Claude review summary
Findings: 0 CRITICAL, 2 IMPORTANT, 3 SUGGESTION.
IMPORTANT — Compatibility with older Megatron-Core. Two places will break for users on a slightly older mcore who have not opted into the new variants:
-
modelopt/torch/nas/plugins/megatron.py:26-51 — TELinear, GatedDeltaNet, and MLASelfAttention are imported unconditionally. import_plugin swallows the resulting ImportError but logs a warning and silently disables the entire mcore NAS/pruning plugin, so existing users of dense GPT or Mamba pruning lose their workflow with no obvious error. Either bump the effective minimum mcore version (and document it) or wrap these three imports in their own try/except and conditionally compose _ATTENTION_TYPES / register decorators.
-
modelopt/torch/nas/plugins/megatron.py:607 — self.config.attention_output_gate raises AttributeError on TransformerConfig versions that predate the field. The codebase already uses getattr(..., "attention_output_gate", False) defensively in megatron_model_stats.py:304; please match that pattern. Trivial inline suggestion posted.
SUGGESTIONs (non-blocking):
- _DynamicMLASelfAttention docstring (megatron.py:680-683) still claims the input layernorm stays static, but _DynamicTransformerLayer._setup converts it and _test_mcore_mla_pruning asserts the LN weight is pruned to the active hidden size.
- _test_mcore_qwen35_gdn_moe_pruning does not assert model.config.moe_shared_expert_intermediate_size even though it is in export_config.
- _test_mcore_latent_moe_pruning only asserts model.config.hidden_size; please also assert moe_ffn_hidden_size and num_moe_experts to match the existing MoE pruning tests.
Risk assessment: Medium-low. The algorithm itself is sound: _DynamicAttention correctly factors out the hidden-size-only path; the MLA / GDN / latent-MoE projection wiring matches the documented pruning scope; the importance hook correctly dispatches between fused and non-fused input layernorms. The only real risk is the silent compatibility regression for existing dense-GPT users on an older Megatron-Core build — addressing finding 1 or 2 above (preferably both) clears that.
- _DynamicSelfAttention._prunes_attention_heads reads attention_output_gate via getattr(..., False) so dense GPT/MoE pruning works on older mcore builds that predate the field. - Fix _DynamicMLASelfAttention docstring: the separate input layernorm is pruned to hidden_size by _DynamicTransformerLayer (not static). - Assert moe_shared_expert_intermediate_size config in the GDN+MoE test and moe_ffn_hidden_size/num_moe_experts config in the latent-MoE test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
|
/claude review |
There was a problem hiding this comment.
Claude review (round 4) — author addressed the prior round's gate-field compat, MLA docstring and MoE-test config asserts. One IMPORTANT compatibility risk from the previous review is still unresolved.
Findings: 0 CRITICAL, 1 IMPORTANT, 0 SUGGESTION.
IMPORTANT — Compatibility (modelopt/torch/nas/plugins/megatron.py:26–51, :100):
TELinear, GatedDeltaNet, and MLASelfAttention are imported at module top level. The outer import_plugin("megatron") catches ModuleNotFoundError but only by silently disabling the entire mcore NAS/pruning plugin — so on an older mcore build that lacks any one of these symbols (e.g. predates gated_delta_net or multi_latent_attention), existing dense-GPT and Mamba pruning users lose their workflow with only a log warning. The rest of the file already uses try: ... HAS_MAMBA / HAS_HYBRID = True / except: ... = False (lines 70–97) for exactly this version-skew reason. Mirror that pattern here, conditionally compose _ATTENTION_TYPES, and guard the three new @DMRegistry.register decorators. Inline comment with a sketch posted on the import block. Alternatively, document and pin a minimum mcore version that guarantees all three symbols.
Verified resolved from prior round:
_DynamicSelfAttention._prunes_attention_headsnow usesgetattr(self.config, "attention_output_gate", False)— older mcore builds without the field no longer crash._DynamicMLASelfAttentiondocstring no longer claims input layernorm is static.- GDN+MoE test now asserts
model.config.moe_shared_expert_intermediate_size; latent-MoE test now assertsmoe_ffn_hidden_sizeandnum_moe_experts.
Risk assessment: Low-medium. Algorithmic correctness, mode/state composition, and export wiring look sound; the only blocker is the silent compatibility regression for existing users on a slightly older Megatron-Core. Addressing the conditional-import sketch (or pinning a minimum mcore version) clears it.
|
/claude review |
|
/claude review |
What does this PR do?
Type of change: New feature
Adds Minitron (
mcore_minitron) pruning support for Megatron-Core models with attention and MoE variants that were previously unsupported. For all of these, onlyhidden_sizeis pruned (alongside the usualffn_hidden_size/num_layers/ MoE dimensions); the variant-internal dimensions are kept:attention_output_gate) — e.g. Qwen3.5 (hybrid GatedDeltaNet + gated-attention), including MoE variants. Attention / linear-attention heads are not pruned.moe_latent_size) — e.g. Nemotron-3.hidden_sizepruning resizes thefc1/fc2latent projections while the experts stay in the (static) latent dim.Introduces a
_DynamicAttentionbase class whose policy decides whether onlyhidden_sizeis pruned or attention heads too —_DynamicSelfAttention,_DynamicGatedDeltaNetand_DynamicMLASelfAttentionbuild on it (extensible for future attention types). A TODO documents how to also prunemoe_latent_sizeas a follow-up.Note that the new
megatron.coreimports are not guarded because they are available in all containers we support (e.g.nemo:26.02onwards)Usage
Testing
New GPU tests in
tests/gpu_megatron/.../test_mcore_gpt_minitron_pruning.py—test_mcore_qwen35_gdn_moe_pruning,test_mcore_mla_pruning,test_mcore_latent_moe_pruning— run across available GPUs (covers pipeline parallel). Existing GPT/MoE prune + parameter-sorting tests still pass; the file's tests were refactored to share scaffolding (_build_and_prune_variant,_sort_and_capture,_make_forward_loop,_assert_reprune_matches).pre-commit(ruff, mypy) clean.Before your PR is "Ready for review"
CONTRIBUTING.md: N/A🤖 Generated with Claude Code
Summary by CodeRabbit
num_attention_headspruning.