Skip to content

Add Minitron pruning support for GatedDeltaNet, MLA, and latent MoE#1747

Open
kevalmorabia97 wants to merge 4 commits into
mainfrom
kmorabia/minitron-gdn-mla-attn-support
Open

Add Minitron pruning support for GatedDeltaNet, MLA, and latent MoE#1747
kevalmorabia97 wants to merge 4 commits into
mainfrom
kmorabia/minitron-gdn-mla-attn-support

Conversation

@kevalmorabia97

@kevalmorabia97 kevalmorabia97 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

Type of change: New feature

Adds Minitron (mcore_minitron) pruning support for Megatron-Core models with attention and MoE variants that were previously unsupported. For all of these, only hidden_size is pruned (alongside the usual ffn_hidden_size / num_layers / MoE dimensions); the variant-internal dimensions are kept:

  • GatedDeltaNet (linear attention) and gated attention (attention_output_gate) — e.g. Qwen3.5 (hybrid GatedDeltaNet + gated-attention), including MoE variants. Attention / linear-attention heads are not pruned.
  • Multi-Latent Attention (MLA) — e.g. DeepSeek. MLA latent ranks are not pruned.
  • Latent MoE (moe_latent_size) — e.g. Nemotron-3. hidden_size pruning resizes the fc1/fc2 latent projections while the experts stay in the (static) latent dim.

Introduces a _DynamicAttention base class whose policy decides whether only hidden_size is pruned or attention heads too — _DynamicSelfAttention, _DynamicGatedDeltaNet and _DynamicMLASelfAttention build on it (extensible for future attention types). A TODO documents how to also prune moe_latent_size as a follow-up.

Note that the new megatron.core imports are not guarded because they are available in all containers we support (e.g. nemo:26.02 onwards)

Usage

import modelopt.torch.prune as mtp

# Qwen3.5 (GatedDeltaNet), DeepSeek (MLA), or Nemotron-3 (latent MoE) GPTModel
model, _ = mtp.prune(
    model,
    mode="mcore_minitron",
    constraints={"export_config": {"hidden_size": 2048, "ffn_hidden_size": 8192}},
    dummy_input=None,
    config={"forward_loop": forward_loop},
)

Testing

New GPU tests in tests/gpu_megatron/.../test_mcore_gpt_minitron_pruning.pytest_mcore_qwen35_gdn_moe_pruning, test_mcore_mla_pruning, test_mcore_latent_moe_pruning — run across available GPUs (covers pipeline parallel). Existing GPT/MoE prune + parameter-sorting tests still pass; the file's tests were refactored to share scaffolding (_build_and_prune_variant, _sort_and_capture, _make_forward_loop, _assert_reprune_matches). pre-commit (ruff, mypy) clean.

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ✅
  • Did you update Changelog?: ✅
  • Did you get Claude approval on this PR?: ✅

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Expanded Minitron pruning for Megatron-Core to support GatedDeltaNet (including gated-attention MoE variants) and Multi-Latent Attention (MLA).
    • Added optional latent MoE pruning behavior: pruning targets hidden-size projections while keeping the MoE latent dimension unchanged.
  • Documentation
    • Updated the pruning support matrix to clarify attention types that don’t support num_attention_heads pruning.
  • Bug Fixes
    • Improved attention importance/activation capture so pruning hooks only run when attention-head pruning is actually enabled.
  • Tests
    • Added/expanded GPU coverage for gated-delta-net MoE, MLA, and latent MoE pruning, with shape/config and inference checks.

Support hidden_size pruning for Megatron-Core models with GatedDeltaNet
(linear-attention, e.g. Qwen3.5 incl. MoE) and Multi-Latent Attention
(MLA, e.g. DeepSeek) layers. Introduces a _DynamicAttention base whose
policy decides whether only hidden_size is pruned or attention heads too,
and extends it for gated attention, GatedDeltaNet and MLA.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested review from a team as code owners June 16, 2026 16:25
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Extends Minitron pruning support to GatedDeltaNet, Multi-Latent Attention (MLA), and Latent MoE Megatron-Core models. New dynamic module classes and a shared _DynamicAttention base are introduced. Hidden-size importance hook registration is generalized to dispatch between fused and non-fused layernorm strategies. Test utilities and three new distributed pruning tests for model variants are added.

Changes

Extended Minitron pruning for GatedDeltaNet, MLA, and Latent MoE

Layer / File(s) Summary
Dynamic module classes for attention variants and MoE latent support
modelopt/torch/nas/plugins/megatron.py
Updates imports for TELinear, GatedDeltaNet, MLASelfAttention; introduces _ATTENTION_TYPES registry; adds _DynamicTELinear for TE linear projections; introduces _DynamicAttention base class with hidden-size-only policy; refactors _DynamicSelfAttention to inherit _DynamicAttention and conditionally export core_attention; adds _DynamicGatedDeltaNet and _DynamicMLASelfAttention with Q-LoRA vs non-LoRA projection name handling; extends _DynamicMoELayer._setup for moe_latent_size with dynamic latent projections and expert IO sizing; updates MoE export; generalizes _DynamicTransformerLayer._setup and .export to handle any _ATTENTION_TYPES instance and conditionally convert MLA input_layernorm.
Pruning importance hook registration for attention variants
modelopt/torch/prune/plugins/mcore_minitron.py
Imports _DynamicAttention; guards self-attention importance registration on _DynamicSelfAttention._prunes_attention_heads(); reworks _register_hidden_size_importance to handle _DynamicAttention generically by selecting between fused-layernorm hook (column projection) and non-fused hook (layer.input_layernorm) based on attn._has_fused_input_layernorm.
Test model builder extensions for MLA and experimental attention variants
tests/_test_utils/torch/megatron/models.py
Adds experimental_attention_variant, multi_latent_attention, and moe_latent_size parameters to get_mcore_gpt_model; switches to MLATransformerConfig for MLA with LoRA/dimension defaults; adds experimental spec branch; forwards multi_latent_attention and moe_latent_size through config and TE layer spec.
Test helper utilities for activation collection and checkpoint re-pruning
tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py
Imports SelfAttention, MoELayer, MLASelfAttention; introduces _make_forward_loop, _sort_and_capture, and _assert_reprune_matches helper utilities for centralized forward-loop generation, dynamic-space conversion with baseline capture and parameter sorting, and checkpoint-based re-pruning validation.
Refactor existing GPT and MoE tests to reuse helpers
tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py
Refactors existing GPT parameter-sorting test to use _sort_and_capture instead of inlined logic; updates GPT pruning forward-loop to use _make_forward_loop; replaces GPT checkpoint re-validation with _assert_reprune_matches; similarly refactors MoE parameter-sorting, forward-loop, and checkpoint re-validation tests; removes intervening comments in output comparison assertions.
End-to-end pruning tests for GDN MoE, MLA, and Latent MoE variants
tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py
Adds _build_and_prune_variant scaffolding; implements Qwen3.5-like gated-delta-net MoE test (asserting attention projection shapes and router/expert dimensions), MLA-only test (asserting projection and layernorm shapes), and Latent MoE test (asserting latent projection and expert shapes while keeping latent dimensions static); all three validate config updates and verify inference.
Changelog and README support matrix updates
CHANGELOG.rst, examples/pruning/README.md
Adds version 0.46 changelog entry documenting GDN and MLA Minitron pruning with pruning scope details; updates README support matrix with new footnote clarifying num_attention_heads limitations for linear attention, gated-attention variants, and MLA, and renumbers Puzzletron footnotes accordingly.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • ChenhanYu
  • jenchen13
  • hychiang-git
  • claude
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main changes: adding Minitron pruning support for three previously unsupported attention/MoE variants (GatedDeltaNet, MLA, latent MoE).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PR introduces no security anti-patterns: no torch.load(weights_only=False), numpy.load(allow_pickle=True), hardcoded trust_remote_code=True, eval/exec on external input, # nosec comments, or new un...
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kmorabia/minitron-gdn-mla-attn-support

Comment @coderabbitai help to get the list of available commands and usage tips.

@kevalmorabia97 kevalmorabia97 requested review from ChenhanYu and removed request for AAnoosheh and Edwardf0t1 June 16, 2026 16:26
@kevalmorabia97

Copy link
Copy Markdown
Collaborator Author

/claude review

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1747/

Built to branch gh-pages at 2026-06-16 18:19 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py (1)

561-564: ⚡ Quick win

Assert all pruned MoE config fields in the new GDN+MoE test.

This test prunes moe_shared_expert_intermediate_size, but it does not assert model.config.moe_shared_expert_intermediate_size. Adding that check will catch config-propagation regressions.

Suggested assertion
     assert model.config.hidden_size == pruned_hidden
     assert model.config.moe_ffn_hidden_size == pruned_moe_ffn
     assert model.config.num_moe_experts == pruned_experts
+    assert model.config.moe_shared_expert_intermediate_size == pruned_moe_shared
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py`
around lines 561 - 564, The test in the GDN+MoE section is missing an assertion
for the moe_shared_expert_intermediate_size config field even though the test
prunes this value. Add an assertion after the existing assertions (following the
pattern of asserting model.config.moe_ffn_hidden_size and
model.config.num_moe_experts) to verify that
model.config.moe_shared_expert_intermediate_size matches the expected pruned
value, which will help catch config-propagation regressions for this field.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/_test_utils/torch/megatron/models.py`:
- Around line 167-180: The multi_latent_attention and
experimental_attention_variant flags are currently able to both be set to True
simultaneously, which creates an ambiguous configuration. Add a fast-fail
assertion guard at the beginning of the code section where these flags are
interpreted (before the if multi_latent_attention block) to ensure that both
flags cannot be True at the same time. The assertion should validate that these
two attention mode flags are mutually exclusive and raise an informative error
if both are enabled.

---

Nitpick comments:
In `@tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py`:
- Around line 561-564: The test in the GDN+MoE section is missing an assertion
for the moe_shared_expert_intermediate_size config field even though the test
prunes this value. Add an assertion after the existing assertions (following the
pattern of asserting model.config.moe_ffn_hidden_size and
model.config.num_moe_experts) to verify that
model.config.moe_shared_expert_intermediate_size matches the expected pruned
value, which will help catch config-propagation regressions for this field.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d2c81799-5b69-4444-83d4-5d3f81350a95

📥 Commits

Reviewing files that changed from the base of the PR and between d1fd121 and ec59fe1.

📒 Files selected for processing (6)
  • CHANGELOG.rst
  • examples/pruning/README.md
  • modelopt/torch/nas/plugins/megatron.py
  • modelopt/torch/prune/plugins/mcore_minitron.py
  • tests/_test_utils/torch/megatron/models.py
  • tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py

Comment thread tests/_test_utils/torch/megatron/models.py
@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.46154% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 76.54%. Comparing base (1e461dd) to head (bad7525).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/nas/plugins/megatron.py 98.30% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1747       +/-   ##
===========================================
+ Coverage   58.45%   76.54%   +18.08%     
===========================================
  Files         510      511        +1     
  Lines       56271    56399      +128     
===========================================
+ Hits        32896    43172    +10276     
+ Misses      23375    13227    -10148     
Flag Coverage Δ
examples 41.82% <64.61%> (+19.38%) ⬆️
gpu 57.81% <98.46%> (+37.21%) ⬆️
regression 14.68% <0.00%> (+0.05%) ⬆️
unit 54.29% <0.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Support hidden_size pruning for latent MoE layers (config.moe_latent_size,
e.g. Nemotron-3): the fc1/fc2 latent projections carry hidden_size while the
routed experts stay in the static latent dim. Add a TODO outlining how to also
prune moe_latent_size itself.

Refactor the minitron pruning tests to share scaffolding via _build_and_prune_variant
(attention/MoE variants), _sort_and_capture (parameter sorting), _make_forward_loop,
and _assert_reprune_matches (checkpoint re-prune).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 changed the title Add Minitron pruning support for GatedDeltaNet and MLA attention Add Minitron pruning support for GatedDeltaNet, MLA, and latent MoE Jun 16, 2026
@kevalmorabia97

Copy link
Copy Markdown
Collaborator Author

/claude review

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
modelopt/torch/nas/plugins/megatron.py (2)

680-683: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the MLA docstring to match the implemented layernorm behavior.

The docstring says the separate MLA input layernorm stays static, but _DynamicTransformerLayer._setup converts input_layernorm with num_features=hidden_size and downstream MLA tests assert it is pruned to the active hidden size. Please remove input layernorm from the “stay static” list.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/nas/plugins/megatron.py` around lines 680 - 683, The docstring
for the MLA layer class incorrectly states that the input layernorm stays
static, but the actual implementation in _DynamicTransformerLayer._setup
converts and prunes input_layernorm based on hidden_size, and tests verify this
behavior. Remove the reference to input layernorm from the "stay static" list in
the docstring to accurately reflect that only Q/KV latent ranks and head dims
remain static during dynamic pruning.

26-31: ⚠️ Potential issue | 🟡 Minor

Guard optional Megatron-Core API features or explicitly declare minimum version requirement.

This module imports TELinear, GatedDeltaNet, and MLASelfAttention at the top level and builds _ATTENTION_TYPES from them. If the supported Megatron-Core minimum version does not include all three symbols, importing this plugin fails before users opt into the new variants. Additionally, line 607 reads self.config.attention_output_gate directly without a fallback.

Either:

  1. Wrap these imports and _ATTENTION_TYPES in a try-except or conditional check, defaulting missing symbols gracefully.
  2. Explicitly bump the Megatron-Core minimum version in pyproject.toml to guarantee these features are available.
  3. Use getattr(self.config, "attention_output_gate", False) on line 607 to handle older configs.

The test utilities (e.g., tests/_test_utils/torch/megatron/models.py:253–261) show a pattern for lazy-importing experimental features when they may be absent in older builds; the same principle applies here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/nas/plugins/megatron.py` around lines 26 - 31, The imports of
TELinear, GatedDeltaNet, and MLASelfAttention at the top level may fail if the
Megatron-Core version does not include these symbols, causing the entire plugin
to fail before users can opt-in to new variants. Wrap these three imports in a
try-except block and set them to None or a fallback value if they are not
available. Then conditionally build the _ATTENTION_TYPES dictionary to only
include the types that were successfully imported. Additionally, at line 607
where self.config.attention_output_gate is accessed directly, replace it with
getattr(self.config, "attention_output_gate", False) to handle configurations
from older Megatron-Core versions that may not have this attribute.
🧹 Nitpick comments (1)
tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py (1)

652-653: ⚡ Quick win

Tighten latent-MoE config-contract assertions in the new variant test.

This test validates latent MoE weight shapes, but it currently only asserts model.config.hidden_size. To guard checkpoint/export regressions for this new path, assert model.config.moe_ffn_hidden_size and model.config.num_moe_experts too (as done in other pruning tests here).

Suggested additions
     assert model.config.hidden_size == pruned_hidden
+    assert model.config.moe_ffn_hidden_size == pruned_moe_ffn
+    assert model.config.num_moe_experts == pruned_experts
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py`
around lines 652 - 653, The test currently only asserts model.config.hidden_size
but does not validate the latent MoE configuration contract fully. Add two
additional assertions after the existing assert model.config.hidden_size ==
pruned_hidden line to also assert model.config.moe_ffn_hidden_size and
model.config.num_moe_experts, matching the pattern used in other pruning tests
in this file to ensure comprehensive validation of MoE-specific config values
and guard against checkpoint/export regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/nas/plugins/megatron.py`:
- Around line 755-776: The assertion in _DynamicMLP._setup currently requires
each expert's input_size to equal self.config.hidden_size, but for latent MoE
configurations, experts run on moe_latent_size instead. Locate the assertion in
_DynamicMLP._setup that validates input_size against hidden_size and modify it
to also accept moe_latent_size when that configuration is present. The assertion
should allow input_size to match either self.config.hidden_size or
self.config.moe_latent_size (if configured) so that expert conversion succeeds
before pruning/export when using latent MoE.

In `@tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py`:
- Around line 70-80: The registry.cleanup() call in this function only executes
if all prior operations succeed. If run_mcore_inference,
run_mcore_inference_with_dummy_input, or mtn.utils.sort_parameters raise an
exception, the cleanup will be skipped and hooks will leak into subsequent
tests. Wrap the code from the ImportanceEstimatorRegistry instantiation through
mtn.utils.sort_parameters in a try block, and move registry.cleanup() into a
finally block to ensure cleanup always runs regardless of exceptions during
inference or sorting operations.

---

Outside diff comments:
In `@modelopt/torch/nas/plugins/megatron.py`:
- Around line 680-683: The docstring for the MLA layer class incorrectly states
that the input layernorm stays static, but the actual implementation in
_DynamicTransformerLayer._setup converts and prunes input_layernorm based on
hidden_size, and tests verify this behavior. Remove the reference to input
layernorm from the "stay static" list in the docstring to accurately reflect
that only Q/KV latent ranks and head dims remain static during dynamic pruning.
- Around line 26-31: The imports of TELinear, GatedDeltaNet, and
MLASelfAttention at the top level may fail if the Megatron-Core version does not
include these symbols, causing the entire plugin to fail before users can opt-in
to new variants. Wrap these three imports in a try-except block and set them to
None or a fallback value if they are not available. Then conditionally build the
_ATTENTION_TYPES dictionary to only include the types that were successfully
imported. Additionally, at line 607 where self.config.attention_output_gate is
accessed directly, replace it with getattr(self.config, "attention_output_gate",
False) to handle configurations from older Megatron-Core versions that may not
have this attribute.

---

Nitpick comments:
In `@tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py`:
- Around line 652-653: The test currently only asserts model.config.hidden_size
but does not validate the latent MoE configuration contract fully. Add two
additional assertions after the existing assert model.config.hidden_size ==
pruned_hidden line to also assert model.config.moe_ffn_hidden_size and
model.config.num_moe_experts, matching the pattern used in other pruning tests
in this file to ensure comprehensive validation of MoE-specific config values
and guard against checkpoint/export regressions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 24971815-a307-4a76-ae56-41c0ab8f97a4

📥 Commits

Reviewing files that changed from the base of the PR and between ec59fe1 and f32711e.

📒 Files selected for processing (4)
  • CHANGELOG.rst
  • modelopt/torch/nas/plugins/megatron.py
  • tests/_test_utils/torch/megatron/models.py
  • tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.rst
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/_test_utils/torch/megatron/models.py

Comment thread modelopt/torch/nas/plugins/megatron.py
@kevalmorabia97

Copy link
Copy Markdown
Collaborator Author

/claude review

Wrap the activation/sort body of _sort_and_capture in try/finally so the
ImportanceEstimatorRegistry hooks and patched TE return_layernorm_output state
are always cleaned up, preventing leakage into subsequent tests on failure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review summary

Findings: 0 CRITICAL, 2 IMPORTANT, 3 SUGGESTION.

IMPORTANT — Compatibility with older Megatron-Core. Two places will break for users on a slightly older mcore who have not opted into the new variants:

  1. modelopt/torch/nas/plugins/megatron.py:26-51 — TELinear, GatedDeltaNet, and MLASelfAttention are imported unconditionally. import_plugin swallows the resulting ImportError but logs a warning and silently disables the entire mcore NAS/pruning plugin, so existing users of dense GPT or Mamba pruning lose their workflow with no obvious error. Either bump the effective minimum mcore version (and document it) or wrap these three imports in their own try/except and conditionally compose _ATTENTION_TYPES / register decorators.

  2. modelopt/torch/nas/plugins/megatron.py:607 — self.config.attention_output_gate raises AttributeError on TransformerConfig versions that predate the field. The codebase already uses getattr(..., "attention_output_gate", False) defensively in megatron_model_stats.py:304; please match that pattern. Trivial inline suggestion posted.

SUGGESTIONs (non-blocking):

  • _DynamicMLASelfAttention docstring (megatron.py:680-683) still claims the input layernorm stays static, but _DynamicTransformerLayer._setup converts it and _test_mcore_mla_pruning asserts the LN weight is pruned to the active hidden size.
  • _test_mcore_qwen35_gdn_moe_pruning does not assert model.config.moe_shared_expert_intermediate_size even though it is in export_config.
  • _test_mcore_latent_moe_pruning only asserts model.config.hidden_size; please also assert moe_ffn_hidden_size and num_moe_experts to match the existing MoE pruning tests.

Risk assessment: Medium-low. The algorithm itself is sound: _DynamicAttention correctly factors out the hidden-size-only path; the MLA / GDN / latent-MoE projection wiring matches the documented pruning scope; the importance hook correctly dispatches between fused and non-fused input layernorms. The only real risk is the silent compatibility regression for existing dense-GPT users on an older Megatron-Core build — addressing finding 1 or 2 above (preferably both) clears that.

Comment thread modelopt/torch/nas/plugins/megatron.py Outdated
Comment thread modelopt/torch/nas/plugins/megatron.py
- _DynamicSelfAttention._prunes_attention_heads reads attention_output_gate
  via getattr(..., False) so dense GPT/MoE pruning works on older mcore builds
  that predate the field.
- Fix _DynamicMLASelfAttention docstring: the separate input layernorm is
  pruned to hidden_size by _DynamicTransformerLayer (not static).
- Assert moe_shared_expert_intermediate_size config in the GDN+MoE test and
  moe_ffn_hidden_size/num_moe_experts config in the latent-MoE test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97

Copy link
Copy Markdown
Collaborator Author

/claude review

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review (round 4) — author addressed the prior round's gate-field compat, MLA docstring and MoE-test config asserts. One IMPORTANT compatibility risk from the previous review is still unresolved.

Findings: 0 CRITICAL, 1 IMPORTANT, 0 SUGGESTION.

IMPORTANT — Compatibility (modelopt/torch/nas/plugins/megatron.py:26–51, :100):
TELinear, GatedDeltaNet, and MLASelfAttention are imported at module top level. The outer import_plugin("megatron") catches ModuleNotFoundError but only by silently disabling the entire mcore NAS/pruning plugin — so on an older mcore build that lacks any one of these symbols (e.g. predates gated_delta_net or multi_latent_attention), existing dense-GPT and Mamba pruning users lose their workflow with only a log warning. The rest of the file already uses try: ... HAS_MAMBA / HAS_HYBRID = True / except: ... = False (lines 70–97) for exactly this version-skew reason. Mirror that pattern here, conditionally compose _ATTENTION_TYPES, and guard the three new @DMRegistry.register decorators. Inline comment with a sketch posted on the import block. Alternatively, document and pin a minimum mcore version that guarantees all three symbols.

Verified resolved from prior round:

  • _DynamicSelfAttention._prunes_attention_heads now uses getattr(self.config, "attention_output_gate", False) — older mcore builds without the field no longer crash.
  • _DynamicMLASelfAttention docstring no longer claims input layernorm is static.
  • GDN+MoE test now asserts model.config.moe_shared_expert_intermediate_size; latent-MoE test now asserts moe_ffn_hidden_size and num_moe_experts.

Risk assessment: Low-medium. Algorithmic correctness, mode/state composition, and export wiring look sound; the only blocker is the silent compatibility regression for existing users on a slightly older Megatron-Core. Addressing the conditional-import sketch (or pinning a minimum mcore version) clears it.

@kevalmorabia97

Copy link
Copy Markdown
Collaborator Author

/claude review
Note that the new megatron.core imports are not guarded because they are available in all containers we support (e.g. nemo:26.02 onwards)

@kevalmorabia97

Copy link
Copy Markdown
Collaborator Author

/claude review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant