Add support for dLLM encoder-decoder models (DiffusionGemma) [tied-weight PTQ export support ] by juhi10071998 · Pull Request #1707 · NVIDIA/Model-Optimizer

juhi10071998 · 2026-06-13T01:04:03Z

What does this PR do?

Type of change: new feature

Adds end-to-end PTQ + HF-checkpoint export support for block-diffusion encoder-decoder LLMs (e.g. DiffusionGemma) whose encoder/decoder stacks share parameters via HF _tied_weights_keys. Six source commits + one test commit + one CHANGELOG entry — purely additive for existing modelopt users (non-tied models see no behavioral change).

Source commits:

Onboard DiffusionGemma to hf_ptq.py — substring-list additions in model_type_is_enc_dec and MODEL_NAME_TO_TYPE (so calibration routes through .generate()), and a .sequences unwrap in the preview decode for ModelOutput-returning .generate()s.
MoE experts dedup in _export_fused_experts — when two fused-expert modules share their 3-D source params, alias the per-expert packed weight + scales on cache hit so downstream postprocess_state_dict dedup catches them. ~42% storage reduction on nvfp4_experts_only for tied 26B MoE checkpoints.
Dense Linear dedup in _export_quantized_weight — symmetric to (2) for dense Linears; no-op for nvfp4_experts_only (dense early-returns at QUANTIZATION_NONE).
Opt-in canonical-side reorder (--canonical_tied_naming, default off) — partitions state_dict so canonical-side tied keys iterate before alias-side, letting first-wins dedup keep canonical names.
sync_tied_input_amax — max-merges per-side input_quantizer.amax across tied modules BEFORE export, so single-backbone consumers (vLLM) that load one input_scale per parameter don't clip on either side. Extends commits 2+3's alias loops to include input_scale.
Default *self_conditioning* exclude — adds the diffusion-model self-conditioning wildcard to default_disabled_quantizers.yaml. Companion to PR Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762) #1691 which already added the vision-module excludes.

Test commit:

New test fixture (tests/_test_utils/torch/quantization/tied_modules.py) with three factories (make_tied_linear_pair, tie_fused_experts_3d_params, wrap_in_parent_with_tied_keys) + 10 unit tests covering commits 2–5 across tests/unit/torch/export/test_unified_export_hf.py (new) and tests/unit/torch/quantization/plugins/test_fused_experts.py (extended). Pure-Python, CPU-only, ~1s wall total.

Docs commit:

CHANGELOG entry under 0.46 New Features.

Usage

# Standard usage — no API change for non-tied models. Existing recipes still work:
python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path <ckpt-dir> \
    --qformat nvfp4_experts_only \
    --calib_size 32 \
    --trust_remote_code \
    --export_path <export-dir>

# For tied-weight models (e.g. DiffusionGemma), opt-in to canonical-side
# naming so the exported state_dict uses the canonical (e.g. decoder-side)
# names per HF's _tied_weights_keys declaration:
    --canonical_tied_naming true

Testing

10 new unit tests covering commits 2–5. Pure-Python, CPU-only, ~1s wall total.
Full local unit suite: 2588 passed, 17 intentional skips. 9 pre-existing failures observed in unrelated test files (PEFT/LoRA, ONNX, speculative-decoding) — verified pre-existing via static analysis (none import any file modified by this PR).
End-to-end PTQ export validated on DiffusionGemma v10 via nvfp4_experts_only recipe + --canonical_tied_naming true — produces a 2-shard, ~18 GB safetensors checkpoint with decoder-canonical naming.

Before your PR is "Ready for review"

Is this change backward compatible?: ✅ — --canonical_tied_naming is opt-in default off; the dedup/alias logic is cache-based on data_ptr() and no-ops for non-tied models; sync_tied_input_amax no-ops when no two modules share a weight data_ptr; the *self_conditioning* wildcard is a no-op for models without matching module names.
If you copied code from any other sources or added a new PIP dependency: N/A
Did you write any new necessary tests?: ✅ 10 unit tests across two files; fixture helpers shared via _test_utils
Did you update CHANGELOG?: ✅ — new bullet under 0.46 → New Features
Did you get Claude approval on this PR?: pending /claude review

Additional Information

Validated locally against DiffusionGemma v10 (DiffusionGemmaForBlockDiffusion, model_type: diffusion_gemma). Substring patterns are also chosen to match the older DiffusionGemma4ModelForBlockDiffusion / diffusion_gemma4 spelling, so the same code path works on earlier and current model versions.

Summary by CodeRabbit

New Features

Added support for tied-weight post-training quantization (PTQ) and HuggingFace checkpoint export for encoder-decoder LLMs with parameter sharing.
Introduced --canonical_tied_naming flag for optimized state dictionary ordering during export.
Added automatic tied-weight input quantization range synchronization across shared parameters.
Extended DiffusionGemma model support.

Documentation

Added PTQ recipe documentation and configuration templates for DiffusionGemma, including guidance on self-conditioning quantization handling.

coderabbitai · 2026-06-13T01:04:17Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cba9fc5f-525a-49e3-808b-a71c57bea5a1

📥 Commits

Reviewing files that changed from the base of the PR and between 9f5e0e1 and d2acf0a.

📒 Files selected for processing (13)

CHANGELOG.rst
examples/llm_ptq/example_utils.py
examples/llm_ptq/hf_ptq.py
modelopt/torch/export/model_utils.py
modelopt/torch/export/moe_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/utils/dataset_utils.py
modelopt_recipes/huggingface/diffusiongemma/ptq/README.md
modelopt_recipes/huggingface/diffusiongemma/ptq/disabled_quantizers.yaml
modelopt_recipes/huggingface/diffusiongemma/ptq/nvfp4_experts_only.yaml
tests/_test_utils/torch/quantization/tied_modules.py
tests/unit/torch/export/test_unified_export_hf.py
tests/unit/torch/quantization/plugins/test_fused_experts.py

✅ Files skipped from review due to trivial changes (4)

modelopt_recipes/huggingface/diffusiongemma/ptq/disabled_quantizers.yaml
modelopt_recipes/huggingface/diffusiongemma/ptq/README.md
examples/llm_ptq/example_utils.py
CHANGELOG.rst

🚧 Files skipped from review as they are similar to previous changes (8)

modelopt/torch/utils/dataset_utils.py
tests/_test_utils/torch/quantization/tied_modules.py
modelopt/torch/export/moe_utils.py
modelopt/torch/export/model_utils.py
tests/unit/torch/quantization/plugins/test_fused_experts.py
examples/llm_ptq/hf_ptq.py
tests/unit/torch/export/test_unified_export_hf.py
modelopt/torch/export/unified_export_hf.py

📝 Walkthrough

Walkthrough

Adds tied-weight PTQ and HuggingFace checkpoint export support for block-diffusion encoder-decoder LLMs. Core changes include tensor-aliased weight dedup in quantized export, input quantizer amax synchronization across tied modules, opt-in canonical state-dict key reordering, fused-expert tied handling in moe_utils.py, DiffusionGemma recipe files, and a ModelOutput unwrap fix for diffusion preview decoding.

Changes

Tied-weight PTQ and HF export for encoder-decoder models

Layer / File(s)	Summary
DiffusionGemma model type and enc-dec recognition `modelopt/torch/export/model_utils.py`, `modelopt/torch/utils/dataset_utils.py`, `examples/llm_ptq/example_utils.py`	Registers `DiffusionGemma` in `MODEL_NAME_TO_TYPE` before the `Gemma` entry; extends `model_type_is_enc_dec` to include `"diffusiongemma"` via substring match; updates `is_enc_dec` docstring to note diffusion_gemma exclusion from encoder-decoder preview decode.
CLI flag and generate() output unwrapping `examples/llm_ptq/hf_ptq.py`	Adds `--canonical_tied_naming` boolean CLI argument in `parse_args()`; threads it through `export_quantized()` into `export_hf_checkpoint()`; unwraps `ModelOutput` objects returned by `.generate()` to extract `.sequences` tensor before slicing/decoding.
DiffusionGemma PTQ recipe files and docs `modelopt_recipes/huggingface/diffusiongemma/ptq/*`, `CHANGELOG.rst`	Adds `disabled_quantizers.yaml` composing default disabled quantizers plus a `self_conditioning` wildcard rule; adds `nvfp4_experts_only.yaml` importing `base_disable_all`/`experts_nvfp4` with self-conditioning exclusions; adds README explaining recipe composition; documents all tied-weight features in CHANGELOG.
Test helper factories for tied-weight scenarios `tests/_test_utils/torch/quantization/tied_modules.py`	Introduces `make_tied_linear_pair`, `tie_fused_experts_3d_params`, and `wrap_in_parent_with_tied_keys` with assertion-driven tie validation and HF-style `_tied_weights_keys` dict/list support for encoder-decoder parent modules.
Core tied-weight dedup and canonical naming in export pipeline `modelopt/torch/export/unified_export_hf.py`	Adds tied-weight aliasing in `_export_quantized_weight` via pre-replacement `data_ptr` capture and a per-invocation `_tied_cache`; adds `_collect_canonical_tied_patterns`/`_reorder_canonical_first` for canonical-first state-dict ordering; adds `sync_tied_input_amax()` to max-merge `input_quantizer.amax`; wires `_tied_cache` and `_moe_tied_cache` through `_process_quantized_modules`; exposes `canonical_tied_naming` on the public `export_hf_checkpoint` API.
Fused experts tied-weight aliasing `modelopt/torch/export/moe_utils.py`	Extends `_export_fused_experts` with optional `_moe_tied_cache`/`_tied_cache`; captures `(gate_up_proj_dp, down_proj_dp)` source key before unpacking; on cache hit aliases per-expert `gate_proj`/`up_proj`/`down_proj` weights and scale buffers from the prior module; on cache miss records unpacked module for subsequent tied modules.
Unit tests for tied-weight export utilities `tests/unit/torch/export/test_unified_export_hf.py`	Tests `_collect_canonical_tied_patterns` for dict-style and legacy list-style inputs, `_reorder_canonical_first` key ordering, `sync_tied_input_amax` max-merge for tied/untied linears, and `_export_quantized_weight` aliasing for tied/untied/partially-unquantized pairs.
Unit tests for fused experts tied deduplication `tests/unit/torch/quantization/plugins/test_fused_experts.py`	Adds `TestExportFusedExpertsTiedDedup` verifying `data_ptr()` equality for tied fused-expert projection weights/scale buffers and inequality for untied; updates existing `_spy_export` monkeypatches to accept `**_kwargs`.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as hf_ptq.py
  participant API as export_hf_checkpoint
  participant Sync as sync_tied_input_amax
  participant Process as _process_quantized_modules
  participant ExportW as _export_quantized_weight
  participant FusedExp as _export_fused_experts
  participant Reorder as _reorder_canonical_first

  CLI->>API: canonical_tied_naming=True
  API->>Sync: max-merge input_quantizer.amax<br/>across tied dense linears and MoE modules
  Sync-->>API: amax synchronized
  API->>Process: _tied_cache keyed by data_ptr,<br/>_moe_tied_cache keyed by (gate_up_proj_dp, down_proj_dp)
  loop for each quantized linear
    Process->>ExportW: pass _tied_cache
    alt data_ptr in _tied_cache
      ExportW->>ExportW: alias packed weight and scale buffers
    else first occurrence
      ExportW->>ExportW: pack weight, store in _tied_cache
    end
  end
  loop for each fused MoE block
    Process->>FusedExp: pass _moe_tied_cache and _tied_cache
    alt source key in _moe_tied_cache
      FusedExp->>FusedExp: alias per-expert weight Parameters<br/>and scale buffers, return early
    else cache miss
      FusedExp->>ExportW: unpack with _tied_cache
      FusedExp->>FusedExp: store in _moe_tied_cache
    end
  end
  API->>Reorder: sort state_dict so canonical keys precede alias keys
  Reorder-->>API: ordered state_dict → postprocess_state_dict

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

cjluo-nv
sychen52
ChenhanYu
realAsma

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main change: adding support for DiffusionGemma (a dLLM encoder-decoder model) with tied-weight PTQ export support, which aligns with the PR's primary objective of onboarding encoder-decoder block-diffusion LLMs to the hf_ptq flow.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	PR introduces no security anti-patterns: trust_remote_code is parameterized via CLI (default False); canonical_tied_naming uses custom type converter; no torch.load/numpy.load/eval/exec vulnerabili...

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-13T01:12:59Z

Codecov Report

❌ Patch coverage is 88.52459% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.55%. Comparing base (1cccf66) to head (d2acf0a).

Files with missing lines	Patch %	Lines
modelopt/torch/export/unified_export_hf.py	86.95%	12 Missing ⚠️
modelopt/torch/export/moe_utils.py	93.10%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1707       +/-   ##
===========================================
+ Coverage   58.45%   76.55%   +18.09%     
===========================================
  Files         510      511        +1     
  Lines       56274    56457      +183     
===========================================
+ Hits        32896    43221    +10325     
+ Misses      23378    13236    -10142

Flag	Coverage Δ
examples	`41.79% <36.06%> (+19.35%)`	⬆️
gpu	`57.71% <33.60%> (+37.11%)`	⬆️
regression	`14.67% <3.27%> (+0.04%)`	⬆️
unit	`54.44% <77.86%> (+0.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 3

🧹 Nitpick comments (1)

tests/unit/torch/quantization/plugins/test_fused_experts.py (1)

561-568: 💤 Low value

Move import to top of file or add justification comment.

The import at line 566 is inside a function without a comment explaining why. Per CONTRIBUTING.md, imports belong at the top so errors surface at collection time. _export_quantized_weight is from a core modelopt module with no circular-import or optional-dependency concern.

♻️ Suggested fix

Add the import near the other modelopt.torch.export imports at the top of the file:

 from modelopt.torch.export.moe_utils import _export_fused_experts
 from modelopt.torch.export.quant_utils import get_quant_config
+from modelopt.torch.export.unified_export_hf import _export_quantized_weight

Then simplify the helper:

 def _clear_fused_experts_caches():
     """Clear function-static alias caches in both export entry points."""
     _export_fused_experts.__dict__.pop("_tied_unpacked_cache", None)
-    # _export_fused_experts internally calls _export_quantized_weight per per-expert
-    # wrapper; clear that cache too so each test sees a pristine state.
-    from modelopt.torch.export.unified_export_hf import _export_quantized_weight
-
     _export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/plugins/test_fused_experts.py` around lines 561
- 568, Move the local import of _export_quantized_weight out of
_clear_fused_experts_caches and place it with the other modelopt.torch.export
imports at the top of the test file (or, if there is a deliberate reason to
import inside the function, add a one-line justification comment explaining
why); then simplify _clear_fused_experts_caches to directly reference
_export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)
alongside the existing _export_fused_experts cache pop. Ensure you reference the
symbols _clear_fused_experts_caches, _export_quantized_weight, and
_export_fused_experts when making the change.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 45-51: The docstring for the tied-experts dedup block contradicts
the implementation by saying "input_scale is left per-side" while the code in
moe_utils.py aliases input_scale alongside weight_scale and weight_scale_2 (due
to sync_tied_input_amax running earlier); update the docstring to state that
input_scale is aliased too and mention the reason (sync_tied_input_amax runs
prior), keeping the rest of the explanation about caching data_ptr() and
aliasing behavior intact so docstring matches the implementation.

In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1545-1547: The call to _export_transformers_checkpoint is dropping
**kwargs (so options like accelerator are ignored); update the call in
unified_export_hf.py where post_state_dict, hf_quant_config =
_export_transformers_checkpoint(model, dtype,
canonical_tied_naming=canonical_tied_naming) to forward any incoming **kwargs
(e.g., pass **kwargs alongside the existing named args) so that options consumed
inside _export_transformers_checkpoint (such as accelerator) are preserved
during export; keep the canonical_tied_naming arg as-is when forwarding
**kwargs.
- Around line 721-744: The tied-weight alias cache (_tied_weight_alias_cache) is
stored on the function object of _export_quantized_weight and persists across
exports; reset it at the start of each export invocation to avoid stale aliases:
in the _export_quantized_weight function (or the outer export entrypoint that
calls it) initialize or clear
_export_quantized_weight.__dict__['_tied_weight_alias_cache'] (or replace with a
fresh dict) before using _tied_source_data_ptr so each export gets a fresh cache
and no stale module references are reused.

---

Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_fused_experts.py`:
- Around line 561-568: Move the local import of _export_quantized_weight out of
_clear_fused_experts_caches and place it with the other modelopt.torch.export
imports at the top of the test file (or, if there is a deliberate reason to
import inside the function, add a one-line justification comment explaining
why); then simplify _clear_fused_experts_caches to directly reference
_export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)
alongside the existing _export_fused_experts cache pop. Ensure you reference the
symbols _clear_fused_experts_caches, _export_quantized_weight, and
_export_fused_experts when making the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1f4075f4-f09e-491d-b111-ffe09bfe2a7a

📥 Commits

Reviewing files that changed from the base of the PR and between 2640551 and 0543907.

📒 Files selected for processing (11)

CHANGELOG.rst
examples/llm_ptq/example_utils.py
examples/llm_ptq/hf_ptq.py
modelopt/torch/export/model_utils.py
modelopt/torch/export/moe_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/utils/dataset_utils.py
modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml
tests/_test_utils/torch/quantization/tied_modules.py
tests/unit/torch/export/test_unified_export_hf.py
tests/unit/torch/quantization/plugins/test_fused_experts.py

cjluo-nv

Bot review — DM the bot to share feedback.

PR adds tied-weight PTQ/HF-export support for encoder-decoder dLLMs (DiffusionGemma). Design review: the change correctly reuses the existing data_ptr-based first-wins dedup in postprocess_state_dict rather than introducing a competing dedup system — the new helpers (sync_tied_input_amax, the alias loops, _reorder_canonical_first) all feed that existing mechanism, and the PR body explains this. Design is settled. Licensing: new test files carry the project's standard Apache header (matches existing source files), no license-file or third-party-copy changes — no block. Tests: 10 CPU-only unit tests covering dedup/alias/reorder/amax-sync paths; e2e validated manually (not in CI), acceptable for the large-model path.

Main concern (correctness): both alias caches — _export_quantized_weight._tied_weight_alias_cache and _export_fused_experts._tied_unpacked_cache — are function-static, keyed by data_ptr(), and are never cleared in production code (only tests clear them). This risks (1) cross-export contamination when export_hf_checkpoint runs more than once in a process — a freed tensor's address from export #1 can be recycled by export #2 and falsely match a stale _prior module; and (2) intra-export false positives, since the captured source data_ptr() belongs to the float weight that gets freed after the setattr replacement and its address may be reused by a later unrelated allocation. Recommend resetting both caches at the start of _export_transformers_checkpoint (or passing an explicit per-export dict). Secondary: _reorder_canonical_first's substring matcher can over-match when canonical tokens are common words. Flagging for the owner — the core mechanism is sound for the intended single-export tied case, but the persistent cache is a latent footgun.

cjluo-nv · 2026-06-15T06:00:07Z

+    # input_scale is safe to alias because sync_tied_input_amax (earlier in
+    # this export) already max-merged the per-side amaxes.
+    _cache = _export_quantized_weight.__dict__.setdefault("_tied_weight_alias_cache", {})
+    _prior = _cache.get(_tied_source_data_ptr)


Bot comment.

This alias cache is function-static (_export_quantized_weight.__dict__[...]) and keyed by data_ptr(), but it is never cleared in production — only the tests pop it. Two failure modes:

Cross-export contamination: if export_hf_checkpoint is invoked more than once in the same process (batch export, two models), the freed data_ptr of a tensor from the first export can be recycled by an allocation in the second export and falsely hit a stale _prior module — aliasing this module's weight to a dead module from the prior run.

Intra-export reuse: _tied_source_data_ptr is captured from the float weight that gets replaced by setattr below; after replacement that storage may be freed and its address reused by a later unrelated module's weight, producing a false-positive match.

Suggest clearing this cache (and the _export_fused_experts one) at the start of _export_transformers_checkpoint, or threading an explicit per-export dict through these calls so identity is scoped to a single export.

Thanks @cjluo-nv , addressed in 9f5e0e1 commit

cjluo-nv · 2026-06-15T06:00:07Z

+    return patterns, side_substrings
+
+
+def _reorder_canonical_first(state_dict: dict, model: nn.Module) -> dict:


Bot comment.

side_substrings are derived as canonical-only tokens and then matched as dot-separated path components across the whole state_dict. If a canonical-side token is a common identifier (e.g. ends up including something like weight or a generic submodule name shared by unrelated keys), _has_side_substring will pull unrelated keys into the head partition. Worth asserting/filtering the derived substrings to the tokens that are actually side-discriminating, or restricting matching to the tied subtree prefix.

Edwardf0t1

Thorough, well-documented PR with genuinely good unit tests (tie / no-tie / one-side-unquantized / canonical-reorder all covered). The design — let the existing postprocess_state_dict data_ptr dedup do the collapsing and just re-alias packed buffers so it can see them — is clean and minimally invasive.

One real latent-correctness issue to fix before merge (the process-global alias caches), plus a few medium/minor items inline. See comments.

Edwardf0t1 · 2026-06-15T06:42:20Z

+    # downstream data_ptr dedup in postprocess_state_dict can collapse them.
+    # input_scale is safe to alias because sync_tied_input_amax (earlier in
+    # this export) already max-merged the per-side amaxes.
+    _cache = _export_quantized_weight.__dict__.setdefault("_tied_weight_alias_cache", {})


🔴 High — this cache is process-global and never scoped to an export run.

_export_quantized_weight._tied_weight_alias_cache (and the sibling _tied_unpacked_cache in moe_utils.py) live on the function object and are only ever cleared by the tests (_clear_export_quantized_weight_cache). The fact that the tests need those helpers is the tell that the cache leaks across calls.

Two consequences:

Cross-export contamination → silent wrong export. The key is weight.data_ptr() (an int) and the value is a live module. If export_hf_checkpoint() runs more than once per process (library/notebook use, a loop over models, or the test session itself), run Update README.md #1 entries survive into run [E] Uncaught exception detected: Unable to open library: libnvinfer_plugin.so.9 due to libnvinfer_plugin.so.9: cannot open shared object file #2. Source weights freed during run Update README.md #1 can have their addresses recycled by run [E] Uncaught exception detected: Unable to open library: libnvinfer_plugin.so.9 due to libnvinfer_plugin.so.9: cannot open shared object file #2's allocations, so a run-[E] Uncaught exception detected: Unable to open library: libnvinfer_plugin.so.9 due to libnvinfer_plugin.so.9: cannot open shared object file #2 module can hit a stale _prior from run Update README.md #1 and get its packed weight/weight_scale/input_scale re-pointed at a different model's buffers. The single-shot hf_ptq.py CLI is safe (one export per process); the public export_hf_checkpoint API is not.

Memory leak — cache values are modules, so every exported module is pinned for the process lifetime; the model can't be GC'd after export.

Fix: scope the cache to a single export invocation — build a dict in _export_transformers_checkpoint and thread it through _process_quantized_modules → _export_quantized_weight / _export_fused_experts. If you'd rather not touch signatures, clear both caches at the top of _export_transformers_checkpoint. Either way it shouldn't be process-global state keyed by an allocator-recyclable int.

Thanks @Edwardf0t1 - addressed in 9f5e0e1 commit

Edwardf0t1 · 2026-06-15T06:42:21Z

+    # runs earlier in _export_transformers_checkpoint and max-merges the
+    # shared input_quantizer amaxes across tied fused-experts modules, so
+    # both sides now derive bit-identical input_scale values.
+    _cache = _export_fused_experts.__dict__.setdefault("_tied_unpacked_cache", {})


🔴 High (companion to the unified_export_hf.py comment). Same process-global cache problem here: _export_fused_experts._tied_unpacked_cache is keyed by a (data_ptr, data_ptr) tuple, persists across export calls, and pins every fused-experts module for the process lifetime. Fix it the same way — scope the cache to one _export_transformers_checkpoint invocation rather than to the function object.

Yes, addressed in 9f5e0e1 commit

Edwardf0t1 · 2026-06-15T06:42:21Z

+    # both sides now derive bit-identical input_scale values.
+    _cache = _export_fused_experts.__dict__.setdefault("_tied_unpacked_cache", {})
+    _prior = _cache.get(_source_key)
+    if _prior is not None and _prior is not module:


🟡 Medium — second tied module is fully unpacked, then discarded. On a cache hit the later tied module has already run the entire unpack/pack path above; this block then aliases all of it back to _prior and throws the freshly-computed buffers away. For a tied 26B MoE that transiently doubles peak experts memory and burns the compute the dedup is meant to save. If you can detect the _source_key hit before the unpack loop and short-circuit to pure aliasing, you get the storage win without the spike. At minimum, a comment noting the transient cost would help.

Makes sense, updated it, could you please review in 9f5e0e1 commit

Edwardf0t1 · 2026-06-15T06:42:21Z

+    if not canonical_patterns and not side_substrings:
+        return state_dict
+
+    def _has_side_substring(key: str) -> bool:


🟡 Medium — substring matching is broader than "reorder the tied keys". side_substrings is canonical_tokens - alias_tokens (e.g. decoder), so _has_side_substring sends every key with that path component to the head, not just the genuinely-tied ones. Harmless to correctness (the downstream dedup only drops true data_ptr duplicates), but on a model where decoder is canonical-only this reorders the entire decoder subtree. Fine for DiffusionGemma and the flag is opt-in/default-off — just flagging that the heuristic is coarser than the docstring implies and could surprise on an asymmetrically-named model.

Edwardf0t1 · 2026-06-15T06:42:21Z

+            quantizer_attrs.weight_scale_2,
+            quantizer_attrs.input_scale,
+        ):
+            if _attr is None or not hasattr(_prior, _attr):


🟢 Minor — dead None check. quantizer_attrs.weight_scale_2 / input_scale are always non-None strings (see quantizer_attr_names), so _attr is None never fires; the hasattr(_prior, _attr) guard is doing all the work. Harmless, slightly misleading.

Edwardf0t1 · 2026-06-15T06:42:21Z

+    return head
+
+
+def sync_tied_input_amax(model: nn.Module) -> int:


🟢 Minor — document the in-place model mutation. sync_tied_input_amax overwrites q.amax on the live model. That's consistent with export already destroying the model (weights replaced with packed bytes), so it's fine — but a one-line note in the docstring that the model is not reusable after this runs would save a future reader. Also worth confirming the real DiffusionGemma recipe uses per-tensor input quantizers; the numel != 1 guard silently skips the merge for non-scalar amax, and that path isn't exercised by the CPU unit tests.

Edwardf0t1 · 2026-06-15T06:42:21Z

+    # Substring match against `model.__class__.__name__.lower()` — entries are
+    # the lowercased class-name form (no underscores). Calibration then uses
+    # `model.generate` to run the full denoising loop.
+    enc_dec_model_list = ["t5", "bart", "whisper", "diffusiongemma"]


🟢 Minor — two is_enc_decs now diverge intentionally. This one (model_type_is_enc_dec, by class name) includes diffusiongemma to route calibration through .generate(), while is_enc_dec in example_utils.py (by model_type) deliberately excludes it so preview decode stays AR-style. The docstrings explain each, but same-named functions doing opposite things for the same model is a future-reader trap — a cross-reference comment would help.

Six fixes responding to comments on PR NVIDIA#1707 from coderabbitai, cjluo-nv, and Edwardf0t1. CACHE LIFECYCLE — scope tied-weight dedup caches to one export invocation (flagged by all three reviewers). The function-static caches _tied_weight_alias_cache and _tied_unpacked_cache persisted for the lifetime of the Python interpreter, never cleared in production — only the tests cleared them. Two failure modes: (a) cross-export contamination across multiple export_hf_checkpoint calls in one process, (b) recycled data_ptr() collisions causing silent false-positive aliasing. Replaced with explicit _tied_cache (int keys, per-weight dedup) and _moe_tied_cache (tuple keys, MoE-module dedup) parameters owned by the caller. _process_quantized_modules creates fresh dicts at the top of every invocation and threads them through to _export_quantized_weight and _export_fused_experts. Both dicts are GC'd when the function returns — no cross-export contamination, no stale data_ptr lookups. FORWARD **kwargs (coderabbitai, unified_export_hf.py): export_hf_checkpoint was dropping **kwargs when calling _export_transformers_checkpoint, silently ignoring options like `accelerator` consumed downstream. Fixed. MoE EARLY-EXIT ON CACHE HIT (Edwardf0t1, moe_utils.py): _export_fused_experts now checks _moe_tied_cache at the top of the function. On cache hit, builds the per-expert subtree by aliasing from _prior directly — skipping the unpack-then-pack-then-discard pattern. Saves ~10-12 GB transient memory and ~30-60 sec on the second tied side of a 26B MoE. DROP DEAD `_attr is None` CHECK (Edwardf0t1, unified_export_hf.py): quantizer_attrs.weight_scale / weight_scale_2 / input_scale are always non-None strings; the hasattr(_prior, _attr) guard below was doing all the work. Removed the misleading dead branch. DOCUMENT IN-PLACE MUTATION (Edwardf0t1, sync_tied_input_amax): Added a paragraph at the top of the docstring noting that affected .amax buffers are overwritten in place and the model is not reusable afterwards. CROSS-REFERENCE THE TWO `is_enc_dec`s (Edwardf0t1, dataset_utils.py): Added a comment in model_type_is_enc_dec explaining the intentional divergence from is_enc_dec in example_utils.py (one routes calibration to .generate(); the other gates preview-decode prompt-prefix slicing). TEST UPDATES: Removed the _clear_export_quantized_weight_cache() and _clear_fused_experts_caches() helpers — tests now construct and pass their own local cache dicts, mirroring the production pattern. 10/10 unit tests pass. NOT ADDRESSED (deferred to follow-up): Broad substring matching in _has_side_substring (Edwardf0t1 marked it as "harmless to correctness"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/export/moe_utils.py`:
- Line 224: The call to _export_quantized_weight at line 224 unconditionally
passes the _tied_cache keyword argument, which breaks backward compatibility
with existing monkeypatched shims that still expect the older (wrapper, dtype)
signature. Modify the function call to conditionally pass _tied_cache only when
it is defined or when the underlying function supports it, using either a
conditional check or try-except handling to gracefully fall back to the older
signature for incompatible shims.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2a53e035-5a70-47dd-bd51-f1b41072c27c

📥 Commits

Reviewing files that changed from the base of the PR and between 0543907 and 9f5e0e1.

📒 Files selected for processing (5)

modelopt/torch/export/moe_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/utils/dataset_utils.py
tests/unit/torch/export/test_unified_export_hf.py
tests/unit/torch/quantization/plugins/test_fused_experts.py

🚧 Files skipped from review as they are similar to previous changes (2)

modelopt/torch/utils/dataset_utils.py
modelopt/torch/export/unified_export_hf.py

coderabbitai · 2026-06-16T15:15:12Z

            wrapper.input_quantizer = i_quantizer

-            _export_quantized_weight(wrapper, dtype)
+            _export_quantized_weight(wrapper, dtype, _tied_cache=_tied_cache)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid unconditional _tied_cache kwarg forwarding here.

This call now always passes _tied_cache=..., which breaks existing monkeypatched shims that still use the older (wrapper, dtype) signature (e.g., in tests/unit/torch/quantization/plugins/test_fused_experts.py). That introduces a TypeError path unrelated to the feature under test.

💡 Suggested fix

- _export_quantized_weight(wrapper, dtype, _tied_cache=_tied_cache) + if _tied_cache is None: + _export_quantized_weight(wrapper, dtype) + else: + _export_quantized_weight(wrapper, dtype, _tied_cache=_tied_cache)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/export/moe_utils.py` at line 224, The call to _export_quantized_weight at line 224 unconditionally passes the _tied_cache keyword argument, which breaks backward compatibility with existing monkeypatched shims that still expect the older (wrapper, dtype) signature. Modify the function call to conditionally pass _tied_cache only when it is defined or when the underlying function supports it, using either a conditional check or try-except handling to gracefully fall back to the older signature for incompatible shims.

…ptq.py DiffusionGemmaForBlockDiffusion is an encoder-decoder block-diffusion text LLM (Gemma4 MoE per-layer transformer wrapped in an encoder + iterative-decoder + self-conditioning + 48-step denoising loop). Four small additions make stock examples/llm_ptq/hf_ptq.py work end-to-end for it; no new entry points needed. Substring patterns are chosen so they ALSO match the previous class name DiffusionGemma4ModelForBlockDiffusion (and previous model_type "diffusion_gemma4"). The model was renamed in transformers mid-development; using "diffusiongemma" / "DiffusionGemma" / "diffusion_gemma" matches both the current and the older class so we don't silently regress on either checkpoint generation. 1. modelopt/torch/utils/dataset_utils.py Add "diffusiongemma" to the substring list in model_type_is_enc_dec. This routes ModelOpt's built-in calibration forward_loop to model.generate() instead of a single model.forward(), so calibration exercises the full 48-step inner denoising loop and sees the entire noise->clean activation distribution. 2. modelopt/torch/export/model_utils.py Add "DiffusionGemma": "diffusion_gemma" to MODEL_NAME_TO_TYPE, *before* "Gemma". get_model_type does substring matching; "gemma" is a substring of "diffusiongemma" so without this entry the class is silently mis-classified as plain "gemma" -- which then mis-routes downstream model-type-dependent logic. 3. examples/llm_ptq/hf_ptq.py (output_decode) DiffusionGemma.generate() returns a DiffusionGemmaGenerationOutput ModelOutput dataclass, not a bare token tensor. The preview decode code's tensor-slicing crashes on ModelOutput. Unwrap to .sequences at the top of output_decode so both the enc-dec and AR slicing branches work uniformly. Generic shim -- helps any model whose .generate returns a ModelOutput with a .sequences attribute. 4. examples/llm_ptq/example_utils.py (is_enc_dec) Comment-only clarification on the semantics. is_enc_dec controls how hf_ptq.py:output_decode slices the preview result (whether to strip the prompt prefix). For T5/BART/Whisper generate returns only new tokens. For DiffusionGemma it returns prompt+canvas concatenated, so it belongs with the AR slicing path and stays out of this list. No behavioral change; documents the prior intent so future contributors don't re-add diffusion_gemma here. End-to-end smoke test: python hf_ptq.py --pyt_ckpt_path <local northbloom checkpoint> \ --qformat nvfp4_experts_only \ --calib_size 32 --trust_remote_code \ --export_path <somewhere> produces a working NVFP4 checkpoint with coherent pre/post-PTQ preview. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

When a model has multiple fused-experts modules whose 3-D source params share storage via HF _tied_weights_keys (e.g. the encoder and decoder transformer stacks of a block-diffusion encoder-decoder LLM like DiffusionGemma4 / northbloom), the unpacking loop in _export_fused_experts ordinarily creates fresh per-expert tensors for each call — destroying the tied identity and writing two full sets of expert weights + scales to disk. This adds a function-local cache keyed by (gate_up_proj.data_ptr(), down_proj.data_ptr()). On a cache miss the existing unpacking path runs unchanged. On a cache hit, after the normal unpacking completes, the per-expert weight / weight_scale / weight_scale_2 buffers are re-pointed at the prior module's tensors so they share storage. The downstream postprocess_state_dict data_ptr()-based dedup then catches them and drops the duplicates from the saved checkpoint. input_scale is intentionally NOT aliased: encoder and decoder paths have legitimately different activation distributions (verified across all 60 tied pairs of a 512-prompt calibration — down_proj_input divergence median 2.26x, max 18.4x), so each side keeps its own per-side calibrated scale. Model-agnostic: no name regex, no model-type lookup. Cache miss falls through to existing behavior, so non-tied models are unaffected. Empirical result on DiffusionGemma4 26B at nvfp4_experts_only: safetensors 28.43 GB -> 16.47 GB (-42%) shards 4 -> 2 decoder weight/weight_scale/weight_scale_2 entries: 3840 -> 0 each decoder input_scale entries: 3840 -> 3840 (kept per-side, as intended) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

Symmetric companion to the data_ptr-cache alias added to _export_fused_experts in the prior commit. Catches plain nn.Linear modules whose .weight Parameters are tied via HF tie_weights() (e.g. encoder and decoder attention QKV/O, MLP gate/up/down, router proj of an encoder-decoder LLM like DiffusionGemma4 / northbloom) when they are quantized under recipes that route through _export_quantized_weight. Mechanism: capture weight.data_ptr() at the top of the function, before the setattr further down wraps the packed bytes in a fresh nn.Parameter (which destroys the tie). At the end of the function, consult a function-local cache keyed by that captured data_ptr. On cache miss: register this sub_module as the canonical owner. On cache hit (a previously-processed module shared the same source weight memory): alias .weight, weight_scale, weight_scale_2 to the prior module's tensors so downstream data_ptr-based dedup in postprocess_state_dict drops the duplicates. input_scale is intentionally NOT aliased — calibration legitimately diverges per-side (verified in Q2 analysis: down_proj_input ratio up to 18x across 60 tied pairs on the northbloom DiffusionGemma4 model). Recipe-agnostic: under nvfp4_experts_only this is a true no-op (dense Linears early-return at QUANTIZATION_NONE; per-expert wrappers reach this function but have fresh data_ptrs from upstream slice+contiguous so cache misses always). Under full nvfp4 it fires for every tied dense Linear pair. Same safety guarantees as the existing data_ptr-based dedup: cannot false-positive because it only aliases when source memory was already shared by the model author (via tie_weights or equivalent). No name regex, no _tied_weights_keys lookup, no model introspection. Empirical results on DiffusionGemma4 26B (calib_size=8 smoke): nvfp4_experts_only: 16.47 GB -> 16.47 GB (byte-identical with prior experts-only export; this patch a no-op as expected) nvfp4 (full): ~27 GB est -> 14.24 GB (-12.7 GB; both patches firing on disjoint module sets; per-name dedup verified in index) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

Adds an opt-in pass (--canonical_tied_naming, default off) that reorders the state_dict before postprocess_state_dict so that keys matching the canonical side of HF's _tied_weights_keys declaration iterate before their aliases. The existing first-wins data_ptr dedup at quant_utils.py:1148-1163 then drops the alias names, leaving the canonical names in the exported safetensors. Motivation: for models like DiffusionGemma4 (northbloom), HF declares {alias: canonical} via _tied_weights_keys, where the encoder side is the alias and the decoder side is canonical. The original HF safetensors index uses decoder-prefixed names for all tied weights (661/691 keys vs 30/691 encoder-only layer_scalar keys). The single-backbone vLLM mockup loader strips both prefixes to model.* and relies on this canonical naming. The default modelopt export today walks the model in registration order (encoder before decoder, per the model's __init__ order), so encoder names win in the first-wins dedup. The exported checkpoint thus uses 46 677 encoder-prefixed keys for tied tensors -- backwards relative to the upstream HF naming and to what downstream consumers expect. Implementation. _tied_weights_keys is declared per model class with paths relative to that class. In nested models (e.g. DiffusionGemma4) multiple submodules declare their own ties: the outer wrapper at DiffusionGemma4ModelForBlockDiffusion ties lm_head.weight to model.decoder.embed_tokens.weight, while the inner DiffusionGemma4Model at model.model declares the much larger encoder<->decoder dict with paths relative to itself. _collect_canonical_tied_patterns walks model.named_modules() and collects every dict-style _tied_weights_keys declaration, prefixing each pattern with the submodule's qualified path so the regexes match against root-level state_dict keys. Without the prefix, the inner dict's patterns (which lack a "model." prefix) silently fail to match keys like "model.decoder.layers.0.self_attn.q_proj.weight" -- a bug that would cause only the outer dict's single entry (embed_tokens) to flip in the dedup. _reorder_canonical_first then partitions the state_dict into head (canonical-pattern matches) and tail (everything else), preserving original order within each partition. head.update(tail) yields a single dict with canonical keys first. The downstream dedup loop iterates this in insertion order and records the canonical names in seen_tensors; alias names then arrive as duplicates and are dropped. Scope of behavior change: - Models with no _tied_weights_keys, or only legacy list-of-strings declarations: _collect returns an empty pattern list, helper short-circuits, state_dict returned unchanged. Zero effect on any existing modelopt user. - Models with dict-style declarations (e.g. DiffusionGemma4): when the flag is set, canonical-side names win. When the flag is unset (default), behavior is identical to before this commit. No changes to dedup logic itself, to the existing tied-weight alias patches in _export_quantized_weight and _export_fused_experts, or to _process_quantized_modules iteration. Strictly additive. Verified on DiffusionGemma4 26B / nvfp4_experts_only / v4 / calib_size 32: 35 127 encoder keys removed (decoder kept), 0 decoder keys removed; layer_scalar (30 encoder-only keys) and per-side input_scale (11 520 keys, intentionally not deduped) unaffected; total safetensors bytes 17.68 GB matching the prior export to within 500 KB of safetensors-metadata-ordering noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

Companion piece to the existing tied-weight alias patches in _export_fused_experts (commit 8b00e85) and _export_quantized_weight (commit e8c36b0), which already alias bit-identical weight / weight_scale / weight_scale_2 between tied modules but leave input_scale per-side. This commit closes the loop on input_scale so consumers that load a single canonical scale per Linear (e.g. vLLM's single-backbone DiffusionGemma4 mockup) see a value consistent across all tied sides. Implementation has two parts. 1. New sync_tied_input_amax(model) helper. Walks named_modules(), groups by source weight data_ptr (same signature our existing dedup patches use), and max-merges input_quantizer.amax across each group. Uses the canonical 4-line idiom shared with preprocess_linear_fusion (quant_utils.py:1394-1401) and sync_moe_gate_up_amax (layer_utils.py:1197): merged = torch.max(torch.stack([q.amax for q in qs])) for q in qs: q.amax = merged.clone() Handles both dense Linears (keyed by weight.data_ptr) and fused MoE modules (keyed by (gate_up_proj, down_proj) data_ptr tuple, merging gate_up_proj_input_quantizer and down_proj_input_quantizer independently across the group). Scalar-only, matching preprocess_linear_fusion's contract. Called unconditionally from _export_transformers_checkpoint after sync_moe_gate_up_amax, BEFORE _process_quantized_modules so the merged amax flows into _export_quantized_weight's input_scale derivation. Mirrors sync_moe_gate_up_amax's "no-flag, fires when applicable, no-op otherwise" convention. 2. Extend the existing tied-weight alias loops in _export_quantized_weight and _export_fused_experts to include input_scale alongside weight_scale / weight_scale_2. Before this commit those loops intentionally skipped input_scale because encoder/decoder amaxes legitimately differed (Q2 analysis showed up to 18x divergence for down_proj_input on v1). With sync_tied_input_amax in place, both sides now derive bit-identical input_scale values; aliasing the buffers is safe and lets the existing data_ptr dedup in postprocess_state_dict collapse them so only one canonical entry per Linear survives in the exported safetensors. Also extends the Q-B canonical-side reorder pass added in commit 837768f with an auto-derived side-substring matcher. HF's _tied_weights_keys regex patterns target the pre-export module structure (fused gate_up_proj), but after _export_fused_experts unpacks them into per-expert gate_proj/up_proj/down_proj submodules, post-export keys like ...experts.Y.gate_proj.input_scale are not covered by HF's regex. Without the substring fallback, those keys fell through Q-B to the "alias-first" partition, so when the new input_scale alias step shared data_ptrs, the encoder name won the dedup instead of the decoder name. _collect_canonical_tied_patterns now returns (patterns, side_substrings). The side_substrings list is auto-derived from each _tied_weights_keys entry as the set of dot-separated tokens that appear in canonical patterns but not in alias patterns. For DiffusionGemma4 this resolves to ["decoder"]: every canonical pattern contains "decoder", no alias pattern does. _reorder_canonical_first treats a key as canonical if it matches a regex pattern OR contains a side substring as a proper path component (bordered by "." or at start/end). The path-component requirement avoids false positives from accidental name collisions. Net effect for DiffusionGemma4 nvfp4_experts_only / v4 / calib_size 32: the 11 520 encoder.X.gate_proj/up_proj/down_proj.input_scale entries that the prior export carried are removed; the 11 520 decoder-side entries remain with the merged amax-derived value. Total bytes drops by ~1 MB (scalar entries). Other tied-tensor entries (weight, weight_scale, weight_scale_2) and encoder-only entries (layer_scalar, 30 keys) are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

The diffusion self-conditioning network (block-diffusion models like DiffusionGemma) is text-only and not exercised by typical calibration data. Without exclusion its TensorQuantizers never see input, never set _amax, and export crashes at _export_quantized_weight: AttributeError: 'TensorQuantizer' object has no attribute '_amax' Companion to the upstream vision-tower / visual / embed_vision excludes already in this unit (PR NVIDIA#1691). Pattern is a no-op for non-diffusion models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

Adds the tied-modules test fixture plus 10 unit tests covering the tied-weight machinery introduced earlier in this series. tests/_test_utils/torch/quantization/tied_modules.py (new): Three small factory helpers shared by the unit tests: - make_tied_linear_pair() -- two nn.Linears whose .weight Parameter is shared via setattr (mimics HF tie_weights() after __init__). - tie_fused_experts_3d_params(enc, dec) -- in-place tie of gate_up_proj / down_proj between two fused-experts modules (paired with the existing _SyntheticFusedExperts fixture). - wrap_in_parent_with_tied_keys(enc, dec, ...) -- builds a parent nn.Module with HF-style _tied_weights_keys (dict-style for the canonical case, list-style for the legacy negative case). Each factory asserts post-conditions on the tie so a misuse fails loudly at construction. tests/unit/torch/export/test_unified_export_hf.py (new): 8 tests Commit f3e9543ab -- canonical-side reorder: - dict-style _tied_weights_keys yields patterns + canonical substrings - list-style yields no canonical info (reorder becomes a no-op) - _reorder_canonical_first puts decoder-side keys ahead of encoder-side keys Commit 3fb3ba053 -- sync_tied_input_amax: - tied Linears with divergent amaxes (2.0 vs 5.0) get both sides overwritten with the elementwise max (5.0) - untied Linears keep per-side amaxes (no-op when there's no tie) Commit 29674a7e1 -- dense Linear tied-weight dedup: - tied Linears share data_ptr for packed .weight + scale buffers - untied Linears keep independent data_ptrs - asymmetric quant: unquantized side early-returns at QUANTIZATION_NONE, stays at the original shared Parameter tests/unit/torch/quantization/plugins/test_fused_experts.py (extended): 2 tests Commit 10a8fdbd5 -- MoE experts dedup: - two _SyntheticSparseMoeBlock instances with tied 3-D source params share data_ptr across every per-expert buffer - untied counterparts keep independent per-expert data_ptrs Pure-Python; CPU-only; ~1s wall total. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

Adds a New Features bullet under 0.46 covering the tied-weight dedup, canonical-side reorder, sync_tied_input_amax helper, and the *self_conditioning* default exclude introduced earlier in this series. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

Six fixes responding to comments on PR NVIDIA#1707 from coderabbitai, cjluo-nv, and Edwardf0t1. CACHE LIFECYCLE — scope tied-weight dedup caches to one export invocation (flagged by all three reviewers). The function-static caches _tied_weight_alias_cache and _tied_unpacked_cache persisted for the lifetime of the Python interpreter, never cleared in production — only the tests cleared them. Two failure modes: (a) cross-export contamination across multiple export_hf_checkpoint calls in one process, (b) recycled data_ptr() collisions causing silent false-positive aliasing. Replaced with explicit _tied_cache (int keys, per-weight dedup) and _moe_tied_cache (tuple keys, MoE-module dedup) parameters owned by the caller. _process_quantized_modules creates fresh dicts at the top of every invocation and threads them through to _export_quantized_weight and _export_fused_experts. Both dicts are GC'd when the function returns — no cross-export contamination, no stale data_ptr lookups. FORWARD **kwargs (coderabbitai, unified_export_hf.py): export_hf_checkpoint was dropping **kwargs when calling _export_transformers_checkpoint, silently ignoring options like `accelerator` consumed downstream. Fixed. MoE EARLY-EXIT ON CACHE HIT (Edwardf0t1, moe_utils.py): _export_fused_experts now checks _moe_tied_cache at the top of the function. On cache hit, builds the per-expert subtree by aliasing from _prior directly — skipping the unpack-then-pack-then-discard pattern. Saves ~10-12 GB transient memory and ~30-60 sec on the second tied side of a 26B MoE. DROP DEAD `_attr is None` CHECK (Edwardf0t1, unified_export_hf.py): quantizer_attrs.weight_scale / weight_scale_2 / input_scale are always non-None strings; the hasattr(_prior, _attr) guard below was doing all the work. Removed the misleading dead branch. DOCUMENT IN-PLACE MUTATION (Edwardf0t1, sync_tied_input_amax): Added a paragraph at the top of the docstring noting that affected .amax buffers are overwritten in place and the model is not reusable afterwards. CROSS-REFERENCE THE TWO `is_enc_dec`s (Edwardf0t1, dataset_utils.py): Added a comment in model_type_is_enc_dec explaining the intentional divergence from is_enc_dec in example_utils.py (one routes calibration to .generate(); the other gates preview-decode prompt-prefix slicing). TEST UPDATES: Removed the _clear_export_quantized_weight_cache() and _clear_fused_experts_caches() helpers — tests now construct and pass their own local cache dicts, mirroring the production pattern. 10/10 unit tests pass. NOT ADDRESSED (deferred to follow-up): Broad substring matching in _has_side_substring (Edwardf0t1 marked it as "harmless to correctness"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

Mirrors the phi4mm / nemotron_vl model-specific recipe pattern under modelopt_recipes/huggingface/diffusiongemma/ptq/. Moves the DiffusionGemma-only *self_conditioning* exclude out of the shared default_disabled_quantizers unit and into a model-specific disabled_quantizers unit that splices the standard default plus *self_conditioning*. The shared default no longer carries a diffusion-specific exclude. New files: - modelopt_recipes/huggingface/diffusiongemma/ptq/nvfp4_experts_only.yaml Three imports -- base_disable_all, experts_nvfp4, disabled_quantizers. Skips the block_sparse_moe_nvfp4 unit because Gemma4's MoE container is named `experts` directly with no `block_sparse_moe` wrapper, so that pattern matched zero modules on DiffusionGemma. Numerically identical to the general nvfp4_experts_only preset with the additional *self_conditioning* exclude (verified by re-export: bit-identical safetensors + index.json vs the same model exported against the general preset plus the prior in-default *self_conditioning* line). - modelopt_recipes/huggingface/diffusiongemma/ptq/disabled_quantizers.yaml Reusable QuantizerCfgList unit. Imports the standard default_disabled_quantizers and appends the model-specific *self_conditioning* exclude. - modelopt_recipes/huggingface/diffusiongemma/ptq/README.md Documents the recipe family. Modified: - modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml Removes the *self_conditioning* exclude and its comment block. - CHANGELOG.rst Rewrites the *self_conditioning* sentence to describe the model-specific recipe pattern. Invocation: General preset (no diffusion-specific exclude): python hf_ptq.py --qformat nvfp4_experts_only ... DiffusionGemma-specific: python hf_ptq.py --recipe huggingface/diffusiongemma/ptq/nvfp4_experts_only ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

cjluo-nv

Bot review — DM the bot to share feedback.

Re-review of tied-weight PTQ/HF-export support for dLLM encoder-decoder models (DiffusionGemma). +880/-13, 13 files, 6 directories.

Design review (settled). The protocol applies (6 directories), but the design question is explicitly addressed: the PR reuses the existing postprocess_state_dict data_ptr first-wins dedup rather than introducing a competing dedup system — sync_tied_input_amax, the alias loops, and _reorder_canonical_first all feed that existing mechanism. Two maintainer reviewers (cjluo-nv, Edwardf0t1) independently confirmed this in prior comments. Proceeding to correctness.

Prior critical comments — all resolved:

💬 Process-global alias caches (flagged "High" by both cjluo-nv and Edwardf0t1; cross-export contamination + module leak). Author replied "addressed in 9f5e0e1" — verified: caches are now caller-owned (_tied_cache/_moe_tied_cache created fresh in _process_quantized_modules on every call) and threaded through _export_quantized_weight/_export_fused_experts. Function-static state eliminated.
💬 **kwargs dropped on _export_transformers_checkpoint call (CodeRabbit Major). Verified forwarded.
💬 MoE second-tied-module fully unpacked then discarded (Edwardf0t1 Medium). Verified: the MoE tied fast-path now short-circuits to pure aliasing before the unpack loop, avoiding the transient memory spike.
Docstring input_scale contradiction, diverging is_enc_dec cross-reference, in-place mutation note — all addressed.

New CodeRabbit comment (moe_utils.py:224, "breaks monkeypatched shims") is not valid — the in-tree spy functions in test_fused_experts.py were updated to def _spy_export(wrapper, dtype, **_kwargs), so the unconditional _tied_cache= forwarding does not break them.

Why nudge rather than approve (needs export-owner sign-off):

E2E validated manually on DiffusionGemma v10 (not in CI); the large tied-MoE path and the accelerator-gathered distributed export path have no automated coverage. 10 CPU unit tests cover the dedup/alias/reorder/amax-sync logic well.
💬 _reorder_canonical_first substring breadth (cjluo-nv:786, Edwardf0t1:798) — author did not directly reply; acknowledged by reviewers as harmless-to-correctness and opt-in/default-off, and _has_side_substring now requires dot-separated path components. Still coarser than the docstring implies on asymmetrically-named models; worth a human confirming the DiffusionGemma naming is the only intended consumer.
The numel != 1 (non-scalar amax) branch in sync_tied_input_amax silently skips the merge and is not exercised by the CPU tests; confirm the real recipe uses per-tensor input quantizers.

Licensing clean: new recipe YAMLs carry the standard NVIDIA Apache header (matches LICENSE_HEADER), no license-file or third-party-code changes.

cjluo-nv · 2026-06-16T18:01:37Z

    )
    parser.add_argument("--export_path", default="exported_model")
+    parser.add_argument(
+        "--canonical_tied_naming",


What does this do? Can we just detect difussion gemma and apply this automatically instead of introducing a new flag?

yes let me move that, agreed

cjluo-nv · 2026-06-16T18:02:21Z

    n = module.num_experts
    expert_dim = _get_fused_expert_intermediate_dim(module)

+    # Capture source tensor identities BEFORE unpacking (the source


move these into a helper method for readability

cjluo-nv · 2026-06-16T18:03:21Z

    if weight_scale is not None:
        sub_module.register_buffer(quantizer_attrs.weight_scale, weight_scale)

+    # Tied-weight dedup: if a previously-processed module shared the same


do we dedupe the quantized the weights or we also need to have them tied?

if the module was originally weight tied, we add the quantizers to both of them irrespective (in this case the expert layers of encoder and decoder) and then we calibrate the quantizers. When exporting if we don't do the de-duplication, we will end up with duplicate entries in ckpt.

cjluo-nv

Bot review — DM the bot to share feedback.

Re-review of tied-weight PTQ/HF-export support for dLLM encoder-decoder models (DiffusionGemma). +880/-13, 13 files, 6 dirs. Design-review protocol applied; design is settled (reuses the existing postprocess_state_dict data_ptr first-wins dedup — sync_tied_input_amax, the alias loops and _reorder_canonical_first all feed that one mechanism rather than introducing a competing dedup system; confirmed by maintainers cjluo-nv + Edwardf0t1). Licensing clean (recipe YAMLs carry the standard NVIDIA Apache header). No prompt-injection in the untrusted content.

Prior critical comments — verified resolved in current diff:

💬 Process-global alias caches (flagged High by cjluo-nv + Edwardf0t1). Author: "addressed in 9f5e0e1" — confirmed: _tied_cache/_moe_tied_cache are now created fresh in _process_quantized_modules on every call and threaded through _export_quantized_weight/_export_fused_experts. Function-static state eliminated.
💬 **kwargs dropped on _export_transformers_checkpoint call — verified forwarded (so accelerator is preserved).
💬 MoE second-tied-module fully unpacked then discarded — verified: the tied fast path now short-circuits to pure aliasing before the unpack loop.
💬 Docstring input_scale contradiction / is_enc_dec cross-reference — verified addressed.
CodeRabbit "moe_utils.py:224 breaks monkeypatched shims" — not valid; the in-tree spy fns were updated to (wrapper, dtype, **_kwargs).

Why nudge rather than approve (needs export-owner sign-off):

Operator's complexity question is unaddressed. The operator asks whether this can be simplified — "can we dedupe before PTQ and postprocess?" The PR re-aliases packed buffers post-quantization so the existing dedup catches them; it does not discuss the alternative of deduping the shared source weight before/around PTQ (which could avoid sync_tied_input_amax, the per-expert alias loop, and _export_quantized_weight's alias step entirely). Worth the author justifying in the PR body why post-pack aliasing is preferred over a pre-PTQ approach.
cjluo-nv's flag question (hf_ptq.py:1262) is still open: "Can we just detect diffusion gemma and apply this automatically instead of introducing a new flag?" --canonical_tied_naming adds public surface; auto-detection from _tied_weights_keys would avoid it. Author has not replied.
Minor regression introduced by the cache fix: in _process_quantized_modules the two cache-init statements now precede the docstring, so the triple-quoted block is no longer the function docstring (__doc__ becomes None). Move the _tied_cache = {} / _moe_tied_cache = {} lines below the docstring. Private fn, low impact.
Coverage gaps: E2E validated manually on DiffusionGemma v10 (not in CI). The tied-MoE path and the distributed accelerator-gathered export path have no automated coverage; the numel != 1 (non-scalar amax) branch in sync_tied_input_amax silently skips and is untested — confirm the real recipe uses per-tensor input quantizers. 10 CPU unit tests cover the dedup/alias/reorder/scalar-amax logic well.
💬 _reorder_canonical_first substring breadth (cjluo-nv:786, Edwardf0t1:798) — _has_side_substring now requires dot-separated path components, but it's still coarser than the docstring on asymmetrically-named models; opt-in/default-off limits blast radius. Worth a human confirming DiffusionGemma is the only intended consumer.

juhi10071998 requested review from a team as code owners June 13, 2026 01:04

juhi10071998 requested review from realAsma and sugunav14 June 13, 2026 01:04

juhi10071998 changed the title ~~Add tied-weight PTQ export support for block-diffusion encoder-decoder models (DiffusionGemma)~~ Add support for dLLM encoder-decoder models (DiffusionGemma) [tied-weight PTQ export support ] Jun 13, 2026

coderabbitai Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread modelopt/torch/export/moe_utils.py Outdated

Comment thread modelopt/torch/export/unified_export_hf.py Outdated

Comment thread modelopt/torch/export/unified_export_hf.py

cjluo-nv reviewed Jun 15, 2026

View reviewed changes

Edwardf0t1 reviewed Jun 15, 2026

View reviewed changes

juhi10071998 requested a review from a team as a code owner June 16, 2026 14:44

coderabbitai Bot approved these changes Jun 16, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

juhi10071998 and others added 10 commits June 16, 2026 17:05

juhi10071998 force-pushed the northbloom branch from 9f5e0e1 to d2acf0a Compare June 16, 2026 17:26

cjluo-nv reviewed Jun 16, 2026

View reviewed changes

		return patterns, side_substrings


		def _reorder_canonical_first(state_dict: dict, model: nn.Module) -> dict:

		return head


		def sync_tied_input_amax(model: nn.Module) -> int:

Conversation

juhi10071998 commented Jun 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

New Features

Documentation

Uh oh!

coderabbitai Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juhi10071998 commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading

codecov Bot commented Jun 13, 2026 •

edited

Loading