Skip to content

Fix Mamba conv1d compatibility [OMNIML-5199]#1750

Open
jenchen13 wants to merge 3 commits into
mainfrom
jennifchen/mamba-conv1d-export
Open

Fix Mamba conv1d compatibility [OMNIML-5199]#1750
jenchen13 wants to merge 3 commits into
mainfrom
jennifchen/mamba-conv1d-export

Conversation

@jenchen13

@jenchen13 jenchen13 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: Bug Fix

Megatron-LM changed MambaMixer conv1d fields, causing this bug
120B_PTQ_0/0 [rank2]: AttributeError: 'MambaMixer' object has no attribute 'conv1d'

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A
  • Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • Bug Fixes

    • Improved Mamba model export compatibility by correctly handling both legacy convolution-style parameters and newer weight/bias layouts during conversion.
  • Tests

    • Added end-to-end GPU tests covering export to safetensors and re-import into fresh models for a small Mamba + attention + MoE hybrid.
    • Added focused unit coverage to ensure convolution parameter handling matches expected checkpoint key patterns.

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
@jenchen13 jenchen13 requested review from a team as code owners June 16, 2026 18:30
@jenchen13 jenchen13 requested a review from meenchen June 16, 2026 18:30
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4a0ddb81-d180-4e83-881a-837d05a20200

📥 Commits

Reviewing files that changed from the base of the PR and between cdda314 and 47f5e65.

📒 Files selected for processing (1)
  • tests/_test_utils/torch/transformers_models.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/_test_utils/torch/transformers_models.py

📝 Walkthrough

Walkthrough

Adds _MambaConv1dCompat and _get_mamba_conv1d to megatron_importer.py to handle both legacy conv1d module and newer conv1d_weight/conv1d_bias parameter layouts in Mamba mixers. Both the importer and unified exporter are updated to use this helper. Nemotron-H test fixtures and a new GPU test module covering unit and end-to-end export/import flows are added.

Changes

Mamba conv1d compatibility wrapper

Layer / File(s) Summary
Compat helper definition and import/export wiring
modelopt/torch/export/plugins/megatron_importer.py, modelopt/torch/export/unified_export_megatron.py
Defines _MambaConv1dCompat and _get_mamba_conv1d, which dispatch to the existing conv1d module or wrap conv1d_weight/conv1d_bias into a compatible module. Updates _import_mamba_layer and the unified exporter's conv1d rule to use this helper.
Nemotron-H tiny model test fixtures
tests/_test_utils/torch/transformers_models.py
Adds NemotronHConfig import and two helpers — get_tiny_nemotron_h and create_tiny_nemotron_h_dir — for building and persisting a tiny Mamba+MoE hybrid model for tests.
Unit and end-to-end importer tests
tests/gpu_megatron/torch/export/test_megatron_importer.py
Adds _Mixer stub and two unit tests for _get_mamba_conv1d (legacy passthrough and wrapper parameter identity), plus an end-to-end distributed test that builds a NemotronH mcore hybrid, exports to safetensors, asserts Mamba/MoE key patterns, and reimports into a fresh model.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Suggested reviewers

  • ChenhanYu
  • hychiang-git
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main change: fixing Mamba conv1d compatibility issues. It directly relates to the core problem (AttributeError on MambaMixer.conv1d) and the solution implemented across multiple files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No security anti-patterns found. All trust_remote_code parameters are properly configurable with False defaults, no unsafe torch.load/numpy.load/eval/exec usage, no # nosec comments, and no new non...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jennifchen/mamba-conv1d-export

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1750/

Built to branch gh-pages at 2026-06-16 20:48 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unit/torch/export/test_megatron_importer.py (1)

25-47: ⚡ Quick win

Add a negative-path regression test for unsupported mixer layouts.

Current coverage validates both success paths, but not the explicit failure contract. Add a test that asserts AttributeError when neither conv1d nor conv1d_weight/conv1d_bias exists.

♻️ Suggested test addition
+def test_get_mamba_conv1d_raises_for_unsupported_layout():
+    mixer = _Mixer()
+    with pytest.raises(AttributeError, match="does not have `conv1d`"):
+        _get_mamba_conv1d(mixer)

As per coding guidelines, checked-in tests should document expected behavior and protect against regressions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/export/test_megatron_importer.py` around lines 25 - 47, Add
a new test function to document the expected failure behavior when the _Mixer
object lacks both the legacy conv1d module and the direct
conv1d_weight/conv1d_bias parameters. Create a test that instantiates a _Mixer
object without any of these attributes and asserts that calling
_get_mamba_conv1d on it raises an AttributeError. This negative-path test will
ensure the function's failure contract is protected against future regressions
and documents the unsupported mixer layout scenario.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unit/torch/export/test_megatron_importer.py`:
- Around line 25-47: Add a new test function to document the expected failure
behavior when the _Mixer object lacks both the legacy conv1d module and the
direct conv1d_weight/conv1d_bias parameters. Create a test that instantiates a
_Mixer object without any of these attributes and asserts that calling
_get_mamba_conv1d on it raises an AttributeError. This negative-path test will
ensure the function's failure contract is protected against future regressions
and documents the unsupported mixer layout scenario.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 545462bf-1a21-4b5c-ab87-333ef9a869d9

📥 Commits

Reviewing files that changed from the base of the PR and between 1cccf66 and fd8e9a9.

📒 Files selected for processing (3)
  • modelopt/torch/export/plugins/megatron_importer.py
  • modelopt/torch/export/unified_export_megatron.py
  • tests/unit/torch/export/test_megatron_importer.py

@jenchen13 jenchen13 changed the title Fix Mamba conv1d compatibility Fix Mamba conv1d compatibility [OMNIML-5199] Jun 16, 2026
@ChenhanYu

Copy link
Copy Markdown
Collaborator

/claude review

@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.31%. Comparing base (1e461dd) to head (47f5e65).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1750      +/-   ##
==========================================
+ Coverage   58.45%   68.31%   +9.85%     
==========================================
  Files         510      511       +1     
  Lines       56271    56367      +96     
==========================================
+ Hits        32896    38507    +5611     
+ Misses      23375    17860    -5515     
Flag Coverage Δ
examples 41.79% <28.57%> (+19.35%) ⬆️
gpu 31.77% <100.00%> (+11.17%) ⬆️
regression 14.70% <28.57%> (+0.07%) ⬆️
unit 54.34% <28.57%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97 kevalmorabia97 added the cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this to tests/gpu_megatron/ which has all required megatron dependencies. It may fail with import error in unit test env

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not really a GPU test though, is there a CPU megatron path?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we dont have cpu-only megatron path. Given its unit test so likely <1s to run so fine to just run in gpu test path

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have moved them to gpu_megatron and added more robust NemotronH import + export tests

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review passed — no blocking issues found.

Summary
Narrowly-scoped compatibility shim for the Megatron-LM API change (PR #4899) that flattened MambaMixer.conv1d (Conv1d submodule) into direct conv1d_weight/conv1d_bias parameters.

Verified

  • Algorithm/state correctness: _MambaConv1dCompat.__init__ assigns mixer.conv1d_weight/conv1d_bias as attributes — since they are nn.Parameter instances, PyTorch auto-registers them via __setattr__, so state_dict() yields {weight, bias} matching the legacy nn.Conv1d shape. The wrapper holds the same Parameter objects as the mixer, so load_state_dict() writes through to the underlying MambaMixer parameters (test confirms is identity).
  • Importer (_name_remapping): With REPLICATE parallel_config, the conv1d weight/bias are loaded from HF without sharding, then module.load_state_dict({weight, bias}) succeeds for both legacy Conv1d and the wrapper (matching keys, no quantizer state, no _extra_state).
  • Exporter (_get_weight_bias_name_remapping): hasattr(module, 'weight') and module.weight.numel() > 0 are satisfied by the wrapper; the export writes backbone.layers.{i}.mixer.conv1d.weight/.bias matching HF format (Nemotron-H mapping).
  • Plugin laziness: Both new symbols are defined outside the import_plugin("megatron") block but don't import megatron, so the unit test loads cleanly.
  • Backward compat: Legacy code path is preserved (early return when conv1d attr exists); HF-side keys are unchanged in both directions; no modelopt_state schema impact.
  • Cross-file usage: unified_export_megatron.py correctly imports the helper from megatron_importer.py. Other call sites of mixer.conv1d (modelopt/torch/nas/plugins/megatron.py) are NAS-specific and unrelated to the export flow this PR fixes.

LGTM.

…round-trip

Move test_megatron_importer.py from tests/unit to tests/gpu_megatron/torch/export
and add a tiny NemotronH (Mamba + MoE + attention hybrid) export/import round-trip
test that exercises the mamba conv1d export path. Adds get_tiny_nemotron_h /
create_tiny_nemotron_h_dir helpers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
@jenchen13 jenchen13 requested a review from kevalmorabia97 June 16, 2026 20:38
@jenchen13

Copy link
Copy Markdown
Contributor Author

/claude review

NemotronHConfig only exists in newer transformers; the top-level import broke every
test importing transformers_models (e.g. test_hf_dflash) under the min-transformers
CI job. Move it to a lazy import inside get_tiny_nemotron_h, matching the existing
Qwen3VL pattern.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Comment on lines +54 to +75
class _MambaConv1dCompat(torch.nn.Module):
"""Expose direct Mamba conv params through the legacy Conv1d state_dict keys."""

def __init__(self, mixer):
super().__init__()
self.weight = mixer.conv1d_weight
self.bias = mixer.conv1d_bias


def _get_mamba_conv1d(mixer):
conv1d = getattr(mixer, "conv1d", None)
if conv1d is not None:
return conv1d

if hasattr(mixer, "conv1d_weight") and hasattr(mixer, "conv1d_bias"):
# Megatron-LM PR #4899 / commit 35992ba changed MambaMixer fields from
# `conv1d` to direct `conv1d_weight` and `conv1d_bias` parameters.
return _MambaConv1dCompat(mixer)

raise AttributeError(
"MambaMixer does not have `conv1d` or `conv1d_weight`/`conv1d_bias` fields."
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] The compat wrapper exposes conv1d_weight/conv1d_bias under the legacy weight/bias keys, which is correct as long as Megatron-LM's new conv1d_weight keeps the same shape as the old nn.Conv1d.weight — i.e. 3D (conv_dim, 1, conv_kernel) matching the HF NemotronH mixer.conv1d.weight layout.

Worth double-checking against commit 35992ba: if the refactor stores conv1d_weight as a squeezed 2D (conv_dim, conv_kernel) parameter, then on export the safetensor would be written 2D and would fail to load into HF transformers' 3D nn.Conv1d (and symmetrically on import). The export/import paths here pass the tensor through verbatim without reshaping, so a layout divergence would silently produce an HF-incompatible checkpoint.

If there's any chance of a 2D layout, consider normalizing to the HF 3D shape inside _MambaConv1dCompat (e.g. self.weight = mixer.conv1d_weight.view(conv_dim, 1, -1) on export and reverse on import). If the shape is guaranteed identical, no change needed — this is just a verification ask.

Comment on lines +94 to +108
with safe_open(str(sf), framework="pt", device="cpu") as f:
keys.extend(f.keys())
# Mamba mixer (conv1d is the layer fixed by _get_mamba_conv1d), plus MoE experts.
assert any("mixer.conv1d" in k for k in keys), "mamba conv1d weights missing from export"
assert any("mixer.in_proj" in k for k in keys), "mamba in_proj weights missing from export"
assert any("mixer.experts" in k for k in keys), "moe expert weights missing from export"

# Import the exported checkpoint back into a fresh mcore model
# (megatron is already initialized from the first build).
imported = _build_mcore_nemotron_h(config, size, initialize=False)
import_mcore_gpt_from_hf(imported, str(export_dir))


def test_nemotron_h_export_import(dist_workers_size_1, tmp_path):
model_dir = create_tiny_nemotron_h_dir(tmp_path)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] Two coverage gaps worth noting (non-blocking):

  1. The end-to-end test only exercises whichever conv1d layout the installed Megatron-LM has. With an older Megatron (still exposing mixer.conv1d), _get_mamba_conv1d returns the legacy module and the new _MambaConv1dCompat branch — the actual subject of this fix — is never hit here. Only the unit test test_get_mamba_conv1d_wraps_direct_params covers the new path, and it uses a hand-authored 3D tensor. So the new-layout path isn't validated against a real Megatron mixer end-to-end unless CI pins the post-35992ba version.

  2. This round-trip validates mcore → safetensors → mcore self-consistency, not HF-loadability. Since import_mcore_gpt_from_hf reads the exported safetensors back into a same-shaped mcore param, a conv1d weight stored in a non-HF (e.g. 2D) shape would round-trip cleanly here yet still fail to load into transformers NemotronH. Consider adding an explicit assertion on the exported mixer.conv1d.weight shape/ndim (expecting the 3D HF layout) to catch a layout divergence — this ties directly to the shape concern flagged in megatron_importer.py.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review passed — no blocking issues found. LGTM

Findings: CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 2

This is a clean, minimal, well-targeted bug fix for the Megatron-LM MambaMixer field rename (conv1dconv1d_weight/conv1d_bias, commit 35992ba).

Verification done:

  • _get_mamba_conv1d is wired symmetrically into both the importer (megatron_importer.py:588) and exporter (unified_export_megatron.py:667), the only two call sites that touch mixer.conv1d in the export path.
  • The compat wrapper round-trips correctly through both consumption mechanisms: import uses module.state_dict() + load_state_dict (in-place copy into the aliased params — covered by the unit test), and export reads module.weight/module.bias attributes directly via _get_weight_bias.
  • The legacy branch (getattr(mixer, "conv1d")) preserves existing behavior, so this is backward compatible with older Megatron.
  • The e2e test's mixer.conv1d/mixer.in_proj/mixer.experts key assertions all match the mcore_nemotron.py rule book.
  • The NAS path (nas/plugins/megatron.py) also references mixer.conv1d but is out of scope for this export-focused fix.

Suggestions (non-blocking):

  1. Verify the new conv1d_weight retains the 3D (conv_dim, 1, kernel) shape of the old nn.Conv1d.weight; a squeezed 2D layout would silently produce HF-incompatible checkpoints since the tensor is passed through without reshaping.
  2. The e2e round-trip validates mcore↔safetensors self-consistency on whichever conv1d layout the installed Megatron has — it doesn't independently assert HF-loadable shape, and won't exercise the new-field branch unless CI pins post-35992ba Megatron.

Risk: Low. Additive, backward-compatible, with both unit and e2e test coverage.

Comment thread modelopt/torch/export/plugins/megatron_importer.py
Comment thread tests/gpu_megatron/torch/export/test_megatron_importer.py
@jenchen13 jenchen13 enabled auto-merge (squash) June 16, 2026 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants