Add support for dLLM encoder-decoder models (DiffusionGemma) [tied-weight PTQ export support ]#1707
Add support for dLLM encoder-decoder models (DiffusionGemma) [tied-weight PTQ export support ]#1707juhi10071998 wants to merge 10 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (13)
✅ Files skipped from review due to trivial changes (4)
🚧 Files skipped from review as they are similar to previous changes (8)
📝 WalkthroughWalkthroughAdds tied-weight PTQ and HuggingFace checkpoint export support for block-diffusion encoder-decoder LLMs. Core changes include tensor-aliased weight dedup in quantized export, input quantizer ChangesTied-weight PTQ and HF export for encoder-decoder models
Sequence Diagram(s)sequenceDiagram
participant CLI as hf_ptq.py
participant API as export_hf_checkpoint
participant Sync as sync_tied_input_amax
participant Process as _process_quantized_modules
participant ExportW as _export_quantized_weight
participant FusedExp as _export_fused_experts
participant Reorder as _reorder_canonical_first
CLI->>API: canonical_tied_naming=True
API->>Sync: max-merge input_quantizer.amax<br/>across tied dense linears and MoE modules
Sync-->>API: amax synchronized
API->>Process: _tied_cache keyed by data_ptr,<br/>_moe_tied_cache keyed by (gate_up_proj_dp, down_proj_dp)
loop for each quantized linear
Process->>ExportW: pass _tied_cache
alt data_ptr in _tied_cache
ExportW->>ExportW: alias packed weight and scale buffers
else first occurrence
ExportW->>ExportW: pack weight, store in _tied_cache
end
end
loop for each fused MoE block
Process->>FusedExp: pass _moe_tied_cache and _tied_cache
alt source key in _moe_tied_cache
FusedExp->>FusedExp: alias per-expert weight Parameters<br/>and scale buffers, return early
else cache miss
FusedExp->>ExportW: unpack with _tied_cache
FusedExp->>FusedExp: store in _moe_tied_cache
end
end
API->>Reorder: sort state_dict so canonical keys precede alias keys
Reorder-->>API: ordered state_dict → postprocess_state_dict
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1707 +/- ##
===========================================
+ Coverage 58.45% 76.55% +18.09%
===========================================
Files 510 511 +1
Lines 56274 56457 +183
===========================================
+ Hits 32896 43221 +10325
+ Misses 23378 13236 -10142
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests/unit/torch/quantization/plugins/test_fused_experts.py (1)
561-568: 💤 Low valueMove import to top of file or add justification comment.
The import at line 566 is inside a function without a comment explaining why. Per CONTRIBUTING.md, imports belong at the top so errors surface at collection time.
_export_quantized_weightis from a core modelopt module with no circular-import or optional-dependency concern.♻️ Suggested fix
Add the import near the other
modelopt.torch.exportimports at the top of the file:from modelopt.torch.export.moe_utils import _export_fused_experts from modelopt.torch.export.quant_utils import get_quant_config +from modelopt.torch.export.unified_export_hf import _export_quantized_weightThen simplify the helper:
def _clear_fused_experts_caches(): """Clear function-static alias caches in both export entry points.""" _export_fused_experts.__dict__.pop("_tied_unpacked_cache", None) - # _export_fused_experts internally calls _export_quantized_weight per per-expert - # wrapper; clear that cache too so each test sees a pristine state. - from modelopt.torch.export.unified_export_hf import _export_quantized_weight - _export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/plugins/test_fused_experts.py` around lines 561 - 568, Move the local import of _export_quantized_weight out of _clear_fused_experts_caches and place it with the other modelopt.torch.export imports at the top of the test file (or, if there is a deliberate reason to import inside the function, add a one-line justification comment explaining why); then simplify _clear_fused_experts_caches to directly reference _export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None) alongside the existing _export_fused_experts cache pop. Ensure you reference the symbols _clear_fused_experts_caches, _export_quantized_weight, and _export_fused_experts when making the change.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 45-51: The docstring for the tied-experts dedup block contradicts
the implementation by saying "input_scale is left per-side" while the code in
moe_utils.py aliases input_scale alongside weight_scale and weight_scale_2 (due
to sync_tied_input_amax running earlier); update the docstring to state that
input_scale is aliased too and mention the reason (sync_tied_input_amax runs
prior), keeping the rest of the explanation about caching data_ptr() and
aliasing behavior intact so docstring matches the implementation.
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1545-1547: The call to _export_transformers_checkpoint is dropping
**kwargs (so options like accelerator are ignored); update the call in
unified_export_hf.py where post_state_dict, hf_quant_config =
_export_transformers_checkpoint(model, dtype,
canonical_tied_naming=canonical_tied_naming) to forward any incoming **kwargs
(e.g., pass **kwargs alongside the existing named args) so that options consumed
inside _export_transformers_checkpoint (such as accelerator) are preserved
during export; keep the canonical_tied_naming arg as-is when forwarding
**kwargs.
- Around line 721-744: The tied-weight alias cache (_tied_weight_alias_cache) is
stored on the function object of _export_quantized_weight and persists across
exports; reset it at the start of each export invocation to avoid stale aliases:
in the _export_quantized_weight function (or the outer export entrypoint that
calls it) initialize or clear
_export_quantized_weight.__dict__['_tied_weight_alias_cache'] (or replace with a
fresh dict) before using _tied_source_data_ptr so each export gets a fresh cache
and no stale module references are reused.
---
Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_fused_experts.py`:
- Around line 561-568: Move the local import of _export_quantized_weight out of
_clear_fused_experts_caches and place it with the other modelopt.torch.export
imports at the top of the test file (or, if there is a deliberate reason to
import inside the function, add a one-line justification comment explaining
why); then simplify _clear_fused_experts_caches to directly reference
_export_quantized_weight.__dict__.pop("_tied_weight_alias_cache", None)
alongside the existing _export_fused_experts cache pop. Ensure you reference the
symbols _clear_fused_experts_caches, _export_quantized_weight, and
_export_fused_experts when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1f4075f4-f09e-491d-b111-ffe09bfe2a7a
📒 Files selected for processing (11)
CHANGELOG.rstexamples/llm_ptq/example_utils.pyexamples/llm_ptq/hf_ptq.pymodelopt/torch/export/model_utils.pymodelopt/torch/export/moe_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/utils/dataset_utils.pymodelopt_recipes/configs/ptq/units/default_disabled_quantizers.yamltests/_test_utils/torch/quantization/tied_modules.pytests/unit/torch/export/test_unified_export_hf.pytests/unit/torch/quantization/plugins/test_fused_experts.py
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
PR adds tied-weight PTQ/HF-export support for encoder-decoder dLLMs (DiffusionGemma). Design review: the change correctly reuses the existing data_ptr-based first-wins dedup in postprocess_state_dict rather than introducing a competing dedup system — the new helpers (sync_tied_input_amax, the alias loops, _reorder_canonical_first) all feed that existing mechanism, and the PR body explains this. Design is settled. Licensing: new test files carry the project's standard Apache header (matches existing source files), no license-file or third-party-copy changes — no block. Tests: 10 CPU-only unit tests covering dedup/alias/reorder/amax-sync paths; e2e validated manually (not in CI), acceptable for the large-model path.
Main concern (correctness): both alias caches — _export_quantized_weight._tied_weight_alias_cache and _export_fused_experts._tied_unpacked_cache — are function-static, keyed by data_ptr(), and are never cleared in production code (only tests clear them). This risks (1) cross-export contamination when export_hf_checkpoint runs more than once in a process — a freed tensor's address from export #1 can be recycled by export #2 and falsely match a stale _prior module; and (2) intra-export false positives, since the captured source data_ptr() belongs to the float weight that gets freed after the setattr replacement and its address may be reused by a later unrelated allocation. Recommend resetting both caches at the start of _export_transformers_checkpoint (or passing an explicit per-export dict). Secondary: _reorder_canonical_first's substring matcher can over-match when canonical tokens are common words. Flagging for the owner — the core mechanism is sound for the intended single-export tied case, but the persistent cache is a latent footgun.
| # input_scale is safe to alias because sync_tied_input_amax (earlier in | ||
| # this export) already max-merged the per-side amaxes. | ||
| _cache = _export_quantized_weight.__dict__.setdefault("_tied_weight_alias_cache", {}) | ||
| _prior = _cache.get(_tied_source_data_ptr) |
There was a problem hiding this comment.
Bot comment.
This alias cache is function-static (_export_quantized_weight.__dict__[...]) and keyed by data_ptr(), but it is never cleared in production — only the tests pop it. Two failure modes:
- Cross-export contamination: if
export_hf_checkpointis invoked more than once in the same process (batch export, two models), the freeddata_ptrof a tensor from the first export can be recycled by an allocation in the second export and falsely hit a stale_priormodule — aliasing this module's weight to a dead module from the prior run. - Intra-export reuse:
_tied_source_data_ptris captured from the floatweightthat gets replaced bysetattrbelow; after replacement that storage may be freed and its address reused by a later unrelated module's weight, producing a false-positive match.
Suggest clearing this cache (and the _export_fused_experts one) at the start of _export_transformers_checkpoint, or threading an explicit per-export dict through these calls so identity is scoped to a single export.
| return patterns, side_substrings | ||
|
|
||
|
|
||
| def _reorder_canonical_first(state_dict: dict, model: nn.Module) -> dict: |
There was a problem hiding this comment.
Bot comment.
side_substrings are derived as canonical-only tokens and then matched as dot-separated path components across the whole state_dict. If a canonical-side token is a common identifier (e.g. ends up including something like weight or a generic submodule name shared by unrelated keys), _has_side_substring will pull unrelated keys into the head partition. Worth asserting/filtering the derived substrings to the tokens that are actually side-discriminating, or restricting matching to the tied subtree prefix.
Edwardf0t1
left a comment
There was a problem hiding this comment.
Thorough, well-documented PR with genuinely good unit tests (tie / no-tie / one-side-unquantized / canonical-reorder all covered). The design — let the existing postprocess_state_dict data_ptr dedup do the collapsing and just re-alias packed buffers so it can see them — is clean and minimally invasive.
One real latent-correctness issue to fix before merge (the process-global alias caches), plus a few medium/minor items inline. See comments.
| # downstream data_ptr dedup in postprocess_state_dict can collapse them. | ||
| # input_scale is safe to alias because sync_tied_input_amax (earlier in | ||
| # this export) already max-merged the per-side amaxes. | ||
| _cache = _export_quantized_weight.__dict__.setdefault("_tied_weight_alias_cache", {}) |
There was a problem hiding this comment.
🔴 High — this cache is process-global and never scoped to an export run.
_export_quantized_weight._tied_weight_alias_cache (and the sibling _tied_unpacked_cache in moe_utils.py) live on the function object and are only ever cleared by the tests (_clear_export_quantized_weight_cache). The fact that the tests need those helpers is the tell that the cache leaks across calls.
Two consequences:
- Cross-export contamination → silent wrong export. The key is
weight.data_ptr()(anint) and the value is a live module. Ifexport_hf_checkpoint()runs more than once per process (library/notebook use, a loop over models, or the test session itself), run Update README.md #1 entries survive into run [E] Uncaught exception detected: Unable to open library: libnvinfer_plugin.so.9 due to libnvinfer_plugin.so.9: cannot open shared object file #2. Source weights freed during run Update README.md #1 can have their addresses recycled by run [E] Uncaught exception detected: Unable to open library: libnvinfer_plugin.so.9 due to libnvinfer_plugin.so.9: cannot open shared object file #2's allocations, so a run-[E] Uncaught exception detected: Unable to open library: libnvinfer_plugin.so.9 due to libnvinfer_plugin.so.9: cannot open shared object file #2 module can hit a stale_priorfrom run Update README.md #1 and get its packedweight/weight_scale/input_scalere-pointed at a different model's buffers. The single-shothf_ptq.pyCLI is safe (one export per process); the publicexport_hf_checkpointAPI is not. - Memory leak — cache values are modules, so every exported module is pinned for the process lifetime; the model can't be GC'd after export.
Fix: scope the cache to a single export invocation — build a dict in _export_transformers_checkpoint and thread it through _process_quantized_modules → _export_quantized_weight / _export_fused_experts. If you'd rather not touch signatures, clear both caches at the top of _export_transformers_checkpoint. Either way it shouldn't be process-global state keyed by an allocator-recyclable int.
There was a problem hiding this comment.
Thanks @Edwardf0t1 - addressed in 9f5e0e1 commit
| # runs earlier in _export_transformers_checkpoint and max-merges the | ||
| # shared input_quantizer amaxes across tied fused-experts modules, so | ||
| # both sides now derive bit-identical input_scale values. | ||
| _cache = _export_fused_experts.__dict__.setdefault("_tied_unpacked_cache", {}) |
There was a problem hiding this comment.
🔴 High (companion to the unified_export_hf.py comment). Same process-global cache problem here: _export_fused_experts._tied_unpacked_cache is keyed by a (data_ptr, data_ptr) tuple, persists across export calls, and pins every fused-experts module for the process lifetime. Fix it the same way — scope the cache to one _export_transformers_checkpoint invocation rather than to the function object.
| # both sides now derive bit-identical input_scale values. | ||
| _cache = _export_fused_experts.__dict__.setdefault("_tied_unpacked_cache", {}) | ||
| _prior = _cache.get(_source_key) | ||
| if _prior is not None and _prior is not module: |
There was a problem hiding this comment.
🟡 Medium — second tied module is fully unpacked, then discarded. On a cache hit the later tied module has already run the entire unpack/pack path above; this block then aliases all of it back to _prior and throws the freshly-computed buffers away. For a tied 26B MoE that transiently doubles peak experts memory and burns the compute the dedup is meant to save. If you can detect the _source_key hit before the unpack loop and short-circuit to pure aliasing, you get the storage win without the spike. At minimum, a comment noting the transient cost would help.
There was a problem hiding this comment.
Makes sense, updated it, could you please review in 9f5e0e1 commit
| if not canonical_patterns and not side_substrings: | ||
| return state_dict | ||
|
|
||
| def _has_side_substring(key: str) -> bool: |
There was a problem hiding this comment.
🟡 Medium — substring matching is broader than "reorder the tied keys". side_substrings is canonical_tokens - alias_tokens (e.g. decoder), so _has_side_substring sends every key with that path component to the head, not just the genuinely-tied ones. Harmless to correctness (the downstream dedup only drops true data_ptr duplicates), but on a model where decoder is canonical-only this reorders the entire decoder subtree. Fine for DiffusionGemma and the flag is opt-in/default-off — just flagging that the heuristic is coarser than the docstring implies and could surprise on an asymmetrically-named model.
| quantizer_attrs.weight_scale_2, | ||
| quantizer_attrs.input_scale, | ||
| ): | ||
| if _attr is None or not hasattr(_prior, _attr): |
There was a problem hiding this comment.
🟢 Minor — dead None check. quantizer_attrs.weight_scale_2 / input_scale are always non-None strings (see quantizer_attr_names), so _attr is None never fires; the hasattr(_prior, _attr) guard is doing all the work. Harmless, slightly misleading.
| return head | ||
|
|
||
|
|
||
| def sync_tied_input_amax(model: nn.Module) -> int: |
There was a problem hiding this comment.
🟢 Minor — document the in-place model mutation. sync_tied_input_amax overwrites q.amax on the live model. That's consistent with export already destroying the model (weights replaced with packed bytes), so it's fine — but a one-line note in the docstring that the model is not reusable after this runs would save a future reader. Also worth confirming the real DiffusionGemma recipe uses per-tensor input quantizers; the numel != 1 guard silently skips the merge for non-scalar amax, and that path isn't exercised by the CPU unit tests.
| # Substring match against `model.__class__.__name__.lower()` — entries are | ||
| # the lowercased class-name form (no underscores). Calibration then uses | ||
| # `model.generate` to run the full denoising loop. | ||
| enc_dec_model_list = ["t5", "bart", "whisper", "diffusiongemma"] |
There was a problem hiding this comment.
🟢 Minor — two is_enc_decs now diverge intentionally. This one (model_type_is_enc_dec, by class name) includes diffusiongemma to route calibration through .generate(), while is_enc_dec in example_utils.py (by model_type) deliberately excludes it so preview decode stays AR-style. The docstrings explain each, but same-named functions doing opposite things for the same model is a future-reader trap — a cross-reference comment would help.
Six fixes responding to comments on PR NVIDIA#1707 from coderabbitai, cjluo-nv, and Edwardf0t1. CACHE LIFECYCLE — scope tied-weight dedup caches to one export invocation (flagged by all three reviewers). The function-static caches _tied_weight_alias_cache and _tied_unpacked_cache persisted for the lifetime of the Python interpreter, never cleared in production — only the tests cleared them. Two failure modes: (a) cross-export contamination across multiple export_hf_checkpoint calls in one process, (b) recycled data_ptr() collisions causing silent false-positive aliasing. Replaced with explicit _tied_cache (int keys, per-weight dedup) and _moe_tied_cache (tuple keys, MoE-module dedup) parameters owned by the caller. _process_quantized_modules creates fresh dicts at the top of every invocation and threads them through to _export_quantized_weight and _export_fused_experts. Both dicts are GC'd when the function returns — no cross-export contamination, no stale data_ptr lookups. FORWARD **kwargs (coderabbitai, unified_export_hf.py): export_hf_checkpoint was dropping **kwargs when calling _export_transformers_checkpoint, silently ignoring options like `accelerator` consumed downstream. Fixed. MoE EARLY-EXIT ON CACHE HIT (Edwardf0t1, moe_utils.py): _export_fused_experts now checks _moe_tied_cache at the top of the function. On cache hit, builds the per-expert subtree by aliasing from _prior directly — skipping the unpack-then-pack-then-discard pattern. Saves ~10-12 GB transient memory and ~30-60 sec on the second tied side of a 26B MoE. DROP DEAD `_attr is None` CHECK (Edwardf0t1, unified_export_hf.py): quantizer_attrs.weight_scale / weight_scale_2 / input_scale are always non-None strings; the hasattr(_prior, _attr) guard below was doing all the work. Removed the misleading dead branch. DOCUMENT IN-PLACE MUTATION (Edwardf0t1, sync_tied_input_amax): Added a paragraph at the top of the docstring noting that affected .amax buffers are overwritten in place and the model is not reusable afterwards. CROSS-REFERENCE THE TWO `is_enc_dec`s (Edwardf0t1, dataset_utils.py): Added a comment in model_type_is_enc_dec explaining the intentional divergence from is_enc_dec in example_utils.py (one routes calibration to .generate(); the other gates preview-decode prompt-prefix slicing). TEST UPDATES: Removed the _clear_export_quantized_weight_cache() and _clear_fused_experts_caches() helpers — tests now construct and pass their own local cache dicts, mirroring the production pattern. 10/10 unit tests pass. NOT ADDRESSED (deferred to follow-up): Broad substring matching in _has_side_substring (Edwardf0t1 marked it as "harmless to correctness"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/export/moe_utils.py`:
- Line 224: The call to _export_quantized_weight at line 224 unconditionally
passes the _tied_cache keyword argument, which breaks backward compatibility
with existing monkeypatched shims that still expect the older (wrapper, dtype)
signature. Modify the function call to conditionally pass _tied_cache only when
it is defined or when the underlying function supports it, using either a
conditional check or try-except handling to gracefully fall back to the older
signature for incompatible shims.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2a53e035-5a70-47dd-bd51-f1b41072c27c
📒 Files selected for processing (5)
modelopt/torch/export/moe_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/utils/dataset_utils.pytests/unit/torch/export/test_unified_export_hf.pytests/unit/torch/quantization/plugins/test_fused_experts.py
🚧 Files skipped from review as they are similar to previous changes (2)
- modelopt/torch/utils/dataset_utils.py
- modelopt/torch/export/unified_export_hf.py
| wrapper.input_quantizer = i_quantizer | ||
|
|
||
| _export_quantized_weight(wrapper, dtype) | ||
| _export_quantized_weight(wrapper, dtype, _tied_cache=_tied_cache) |
There was a problem hiding this comment.
Avoid unconditional _tied_cache kwarg forwarding here.
This call now always passes _tied_cache=..., which breaks existing monkeypatched shims that still use the older (wrapper, dtype) signature (e.g., in tests/unit/torch/quantization/plugins/test_fused_experts.py). That introduces a TypeError path unrelated to the feature under test.
💡 Suggested fix
- _export_quantized_weight(wrapper, dtype, _tied_cache=_tied_cache)
+ if _tied_cache is None:
+ _export_quantized_weight(wrapper, dtype)
+ else:
+ _export_quantized_weight(wrapper, dtype, _tied_cache=_tied_cache)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/export/moe_utils.py` at line 224, The call to
_export_quantized_weight at line 224 unconditionally passes the _tied_cache
keyword argument, which breaks backward compatibility with existing
monkeypatched shims that still expect the older (wrapper, dtype) signature.
Modify the function call to conditionally pass _tied_cache only when it is
defined or when the underlying function supports it, using either a conditional
check or try-except handling to gracefully fall back to the older signature for
incompatible shims.
…ptq.py
DiffusionGemmaForBlockDiffusion is an encoder-decoder block-diffusion
text LLM (Gemma4 MoE per-layer transformer wrapped in an encoder +
iterative-decoder + self-conditioning + 48-step denoising loop). Four
small additions make stock examples/llm_ptq/hf_ptq.py work end-to-end
for it; no new entry points needed.
Substring patterns are chosen so they ALSO match the previous class
name DiffusionGemma4ModelForBlockDiffusion (and previous model_type
"diffusion_gemma4"). The model was renamed in transformers
mid-development; using "diffusiongemma" / "DiffusionGemma" /
"diffusion_gemma" matches both the current and the older class so we
don't silently regress on either checkpoint generation.
1. modelopt/torch/utils/dataset_utils.py
Add "diffusiongemma" to the substring list in model_type_is_enc_dec.
This routes ModelOpt's built-in calibration forward_loop to
model.generate() instead of a single model.forward(), so calibration
exercises the full 48-step inner denoising loop and sees the entire
noise->clean activation distribution.
2. modelopt/torch/export/model_utils.py
Add "DiffusionGemma": "diffusion_gemma" to MODEL_NAME_TO_TYPE,
*before* "Gemma". get_model_type does substring matching; "gemma" is
a substring of "diffusiongemma" so without this entry the class is
silently mis-classified as plain "gemma" -- which then mis-routes
downstream model-type-dependent logic.
3. examples/llm_ptq/hf_ptq.py (output_decode)
DiffusionGemma.generate() returns a DiffusionGemmaGenerationOutput
ModelOutput dataclass, not a bare token tensor. The preview decode
code's tensor-slicing crashes on ModelOutput. Unwrap to .sequences
at the top of output_decode so both the enc-dec and AR slicing
branches work uniformly. Generic shim -- helps any model whose
.generate returns a ModelOutput with a .sequences attribute.
4. examples/llm_ptq/example_utils.py (is_enc_dec)
Comment-only clarification on the semantics. is_enc_dec controls
how hf_ptq.py:output_decode slices the preview result (whether to
strip the prompt prefix). For T5/BART/Whisper generate returns only
new tokens. For DiffusionGemma it returns prompt+canvas
concatenated, so it belongs with the AR slicing path and stays out
of this list. No behavioral change; documents the prior intent so
future contributors don't re-add diffusion_gemma here.
End-to-end smoke test:
python hf_ptq.py --pyt_ckpt_path <local northbloom checkpoint> \
--qformat nvfp4_experts_only \
--calib_size 32 --trust_remote_code \
--export_path <somewhere>
produces a working NVFP4 checkpoint with coherent pre/post-PTQ preview.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
When a model has multiple fused-experts modules whose 3-D source params share storage via HF _tied_weights_keys (e.g. the encoder and decoder transformer stacks of a block-diffusion encoder-decoder LLM like DiffusionGemma4 / northbloom), the unpacking loop in _export_fused_experts ordinarily creates fresh per-expert tensors for each call — destroying the tied identity and writing two full sets of expert weights + scales to disk. This adds a function-local cache keyed by (gate_up_proj.data_ptr(), down_proj.data_ptr()). On a cache miss the existing unpacking path runs unchanged. On a cache hit, after the normal unpacking completes, the per-expert weight / weight_scale / weight_scale_2 buffers are re-pointed at the prior module's tensors so they share storage. The downstream postprocess_state_dict data_ptr()-based dedup then catches them and drops the duplicates from the saved checkpoint. input_scale is intentionally NOT aliased: encoder and decoder paths have legitimately different activation distributions (verified across all 60 tied pairs of a 512-prompt calibration — down_proj_input divergence median 2.26x, max 18.4x), so each side keeps its own per-side calibrated scale. Model-agnostic: no name regex, no model-type lookup. Cache miss falls through to existing behavior, so non-tied models are unaffected. Empirical result on DiffusionGemma4 26B at nvfp4_experts_only: safetensors 28.43 GB -> 16.47 GB (-42%) shards 4 -> 2 decoder weight/weight_scale/weight_scale_2 entries: 3840 -> 0 each decoder input_scale entries: 3840 -> 3840 (kept per-side, as intended) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Symmetric companion to the data_ptr-cache alias added to
_export_fused_experts in the prior commit. Catches plain nn.Linear
modules whose .weight Parameters are tied via HF tie_weights() (e.g.
encoder and decoder attention QKV/O, MLP gate/up/down, router proj of
an encoder-decoder LLM like DiffusionGemma4 / northbloom) when they
are quantized under recipes that route through _export_quantized_weight.
Mechanism: capture weight.data_ptr() at the top of the function,
before the setattr further down wraps the packed bytes in a fresh
nn.Parameter (which destroys the tie). At the end of the function,
consult a function-local cache keyed by that captured data_ptr. On
cache miss: register this sub_module as the canonical owner. On cache
hit (a previously-processed module shared the same source weight
memory): alias .weight, weight_scale, weight_scale_2 to the prior
module's tensors so downstream data_ptr-based dedup in
postprocess_state_dict drops the duplicates. input_scale is
intentionally NOT aliased — calibration legitimately diverges per-side
(verified in Q2 analysis: down_proj_input ratio up to 18x across 60
tied pairs on the northbloom DiffusionGemma4 model).
Recipe-agnostic: under nvfp4_experts_only this is a true no-op (dense
Linears early-return at QUANTIZATION_NONE; per-expert wrappers reach
this function but have fresh data_ptrs from upstream slice+contiguous
so cache misses always). Under full nvfp4 it fires for every tied
dense Linear pair.
Same safety guarantees as the existing data_ptr-based dedup: cannot
false-positive because it only aliases when source memory was already
shared by the model author (via tie_weights or equivalent). No name
regex, no _tied_weights_keys lookup, no model introspection.
Empirical results on DiffusionGemma4 26B (calib_size=8 smoke):
nvfp4_experts_only: 16.47 GB -> 16.47 GB (byte-identical with prior
experts-only export; this
patch a no-op as expected)
nvfp4 (full): ~27 GB est -> 14.24 GB (-12.7 GB; both patches
firing on disjoint
module sets; per-name
dedup verified in index)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Adds an opt-in pass (--canonical_tied_naming, default off) that reorders
the state_dict before postprocess_state_dict so that keys matching the
canonical side of HF's _tied_weights_keys declaration iterate before
their aliases. The existing first-wins data_ptr dedup at
quant_utils.py:1148-1163 then drops the alias names, leaving the
canonical names in the exported safetensors.
Motivation: for models like DiffusionGemma4 (northbloom), HF declares
{alias: canonical} via _tied_weights_keys, where the encoder side is
the alias and the decoder side is canonical. The original HF
safetensors index uses decoder-prefixed names for all tied weights
(661/691 keys vs 30/691 encoder-only layer_scalar keys). The
single-backbone vLLM mockup loader strips both prefixes to model.* and
relies on this canonical naming.
The default modelopt export today walks the model in registration
order (encoder before decoder, per the model's __init__ order), so
encoder names win in the first-wins dedup. The exported checkpoint
thus uses 46 677 encoder-prefixed keys for tied tensors -- backwards
relative to the upstream HF naming and to what downstream consumers
expect.
Implementation. _tied_weights_keys is declared per model class with
paths relative to that class. In nested models (e.g. DiffusionGemma4)
multiple submodules declare their own ties: the outer wrapper at
DiffusionGemma4ModelForBlockDiffusion ties lm_head.weight to
model.decoder.embed_tokens.weight, while the inner DiffusionGemma4Model
at model.model declares the much larger encoder<->decoder dict with
paths relative to itself.
_collect_canonical_tied_patterns walks model.named_modules() and
collects every dict-style _tied_weights_keys declaration, prefixing
each pattern with the submodule's qualified path so the regexes match
against root-level state_dict keys. Without the prefix, the inner
dict's patterns (which lack a "model." prefix) silently fail to match
keys like "model.decoder.layers.0.self_attn.q_proj.weight" -- a bug
that would cause only the outer dict's single entry (embed_tokens) to
flip in the dedup.
_reorder_canonical_first then partitions the state_dict into head
(canonical-pattern matches) and tail (everything else), preserving
original order within each partition. head.update(tail) yields a
single dict with canonical keys first. The downstream dedup loop
iterates this in insertion order and records the canonical names in
seen_tensors; alias names then arrive as duplicates and are dropped.
Scope of behavior change:
- Models with no _tied_weights_keys, or only legacy list-of-strings
declarations: _collect returns an empty pattern list, helper
short-circuits, state_dict returned unchanged. Zero effect on any
existing modelopt user.
- Models with dict-style declarations (e.g. DiffusionGemma4): when
the flag is set, canonical-side names win. When the flag is unset
(default), behavior is identical to before this commit.
No changes to dedup logic itself, to the existing tied-weight alias
patches in _export_quantized_weight and _export_fused_experts, or to
_process_quantized_modules iteration. Strictly additive.
Verified on DiffusionGemma4 26B / nvfp4_experts_only / v4 / calib_size
32: 35 127 encoder keys removed (decoder kept), 0 decoder keys
removed; layer_scalar (30 encoder-only keys) and per-side input_scale
(11 520 keys, intentionally not deduped) unaffected; total safetensors
bytes 17.68 GB matching the prior export to within 500 KB of
safetensors-metadata-ordering noise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Companion piece to the existing tied-weight alias patches in _export_fused_experts (commit 8b00e85) and _export_quantized_weight (commit e8c36b0), which already alias bit-identical weight / weight_scale / weight_scale_2 between tied modules but leave input_scale per-side. This commit closes the loop on input_scale so consumers that load a single canonical scale per Linear (e.g. vLLM's single-backbone DiffusionGemma4 mockup) see a value consistent across all tied sides. Implementation has two parts. 1. New sync_tied_input_amax(model) helper. Walks named_modules(), groups by source weight data_ptr (same signature our existing dedup patches use), and max-merges input_quantizer.amax across each group. Uses the canonical 4-line idiom shared with preprocess_linear_fusion (quant_utils.py:1394-1401) and sync_moe_gate_up_amax (layer_utils.py:1197): merged = torch.max(torch.stack([q.amax for q in qs])) for q in qs: q.amax = merged.clone() Handles both dense Linears (keyed by weight.data_ptr) and fused MoE modules (keyed by (gate_up_proj, down_proj) data_ptr tuple, merging gate_up_proj_input_quantizer and down_proj_input_quantizer independently across the group). Scalar-only, matching preprocess_linear_fusion's contract. Called unconditionally from _export_transformers_checkpoint after sync_moe_gate_up_amax, BEFORE _process_quantized_modules so the merged amax flows into _export_quantized_weight's input_scale derivation. Mirrors sync_moe_gate_up_amax's "no-flag, fires when applicable, no-op otherwise" convention. 2. Extend the existing tied-weight alias loops in _export_quantized_weight and _export_fused_experts to include input_scale alongside weight_scale / weight_scale_2. Before this commit those loops intentionally skipped input_scale because encoder/decoder amaxes legitimately differed (Q2 analysis showed up to 18x divergence for down_proj_input on v1). With sync_tied_input_amax in place, both sides now derive bit-identical input_scale values; aliasing the buffers is safe and lets the existing data_ptr dedup in postprocess_state_dict collapse them so only one canonical entry per Linear survives in the exported safetensors. Also extends the Q-B canonical-side reorder pass added in commit 837768f with an auto-derived side-substring matcher. HF's _tied_weights_keys regex patterns target the pre-export module structure (fused gate_up_proj), but after _export_fused_experts unpacks them into per-expert gate_proj/up_proj/down_proj submodules, post-export keys like ...experts.Y.gate_proj.input_scale are not covered by HF's regex. Without the substring fallback, those keys fell through Q-B to the "alias-first" partition, so when the new input_scale alias step shared data_ptrs, the encoder name won the dedup instead of the decoder name. _collect_canonical_tied_patterns now returns (patterns, side_substrings). The side_substrings list is auto-derived from each _tied_weights_keys entry as the set of dot-separated tokens that appear in canonical patterns but not in alias patterns. For DiffusionGemma4 this resolves to ["decoder"]: every canonical pattern contains "decoder", no alias pattern does. _reorder_canonical_first treats a key as canonical if it matches a regex pattern OR contains a side substring as a proper path component (bordered by "." or at start/end). The path-component requirement avoids false positives from accidental name collisions. Net effect for DiffusionGemma4 nvfp4_experts_only / v4 / calib_size 32: the 11 520 encoder.X.gate_proj/up_proj/down_proj.input_scale entries that the prior export carried are removed; the 11 520 decoder-side entries remain with the merged amax-derived value. Total bytes drops by ~1 MB (scalar entries). Other tied-tensor entries (weight, weight_scale, weight_scale_2) and encoder-only entries (layer_scalar, 30 keys) are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
The diffusion self-conditioning network (block-diffusion models like DiffusionGemma) is text-only and not exercised by typical calibration data. Without exclusion its TensorQuantizers never see input, never set _amax, and export crashes at _export_quantized_weight: AttributeError: 'TensorQuantizer' object has no attribute '_amax' Companion to the upstream vision-tower / visual / embed_vision excludes already in this unit (PR NVIDIA#1691). Pattern is a no-op for non-diffusion models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Adds the tied-modules test fixture plus 10 unit tests covering the
tied-weight machinery introduced earlier in this series.
tests/_test_utils/torch/quantization/tied_modules.py (new):
Three small factory helpers shared by the unit tests:
- make_tied_linear_pair() -- two nn.Linears whose .weight Parameter is
shared via setattr (mimics HF tie_weights() after __init__).
- tie_fused_experts_3d_params(enc, dec) -- in-place tie of
gate_up_proj / down_proj between two fused-experts modules (paired
with the existing _SyntheticFusedExperts fixture).
- wrap_in_parent_with_tied_keys(enc, dec, ...) -- builds a parent
nn.Module with HF-style _tied_weights_keys (dict-style for the
canonical case, list-style for the legacy negative case).
Each factory asserts post-conditions on the tie so a misuse fails
loudly at construction.
tests/unit/torch/export/test_unified_export_hf.py (new): 8 tests
Commit f3e9543ab -- canonical-side reorder:
- dict-style _tied_weights_keys yields patterns + canonical
substrings
- list-style yields no canonical info (reorder becomes a no-op)
- _reorder_canonical_first puts decoder-side keys ahead of
encoder-side keys
Commit 3fb3ba053 -- sync_tied_input_amax:
- tied Linears with divergent amaxes (2.0 vs 5.0) get both sides
overwritten with the elementwise max (5.0)
- untied Linears keep per-side amaxes (no-op when there's no tie)
Commit 29674a7e1 -- dense Linear tied-weight dedup:
- tied Linears share data_ptr for packed .weight + scale buffers
- untied Linears keep independent data_ptrs
- asymmetric quant: unquantized side early-returns at
QUANTIZATION_NONE, stays at the original shared Parameter
tests/unit/torch/quantization/plugins/test_fused_experts.py (extended):
2 tests
Commit 10a8fdbd5 -- MoE experts dedup:
- two _SyntheticSparseMoeBlock instances with tied 3-D source
params share data_ptr across every per-expert buffer
- untied counterparts keep independent per-expert data_ptrs
Pure-Python; CPU-only; ~1s wall total.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Adds a New Features bullet under 0.46 covering the tied-weight dedup, canonical-side reorder, sync_tied_input_amax helper, and the *self_conditioning* default exclude introduced earlier in this series. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Six fixes responding to comments on PR NVIDIA#1707 from coderabbitai, cjluo-nv, and Edwardf0t1. CACHE LIFECYCLE — scope tied-weight dedup caches to one export invocation (flagged by all three reviewers). The function-static caches _tied_weight_alias_cache and _tied_unpacked_cache persisted for the lifetime of the Python interpreter, never cleared in production — only the tests cleared them. Two failure modes: (a) cross-export contamination across multiple export_hf_checkpoint calls in one process, (b) recycled data_ptr() collisions causing silent false-positive aliasing. Replaced with explicit _tied_cache (int keys, per-weight dedup) and _moe_tied_cache (tuple keys, MoE-module dedup) parameters owned by the caller. _process_quantized_modules creates fresh dicts at the top of every invocation and threads them through to _export_quantized_weight and _export_fused_experts. Both dicts are GC'd when the function returns — no cross-export contamination, no stale data_ptr lookups. FORWARD **kwargs (coderabbitai, unified_export_hf.py): export_hf_checkpoint was dropping **kwargs when calling _export_transformers_checkpoint, silently ignoring options like `accelerator` consumed downstream. Fixed. MoE EARLY-EXIT ON CACHE HIT (Edwardf0t1, moe_utils.py): _export_fused_experts now checks _moe_tied_cache at the top of the function. On cache hit, builds the per-expert subtree by aliasing from _prior directly — skipping the unpack-then-pack-then-discard pattern. Saves ~10-12 GB transient memory and ~30-60 sec on the second tied side of a 26B MoE. DROP DEAD `_attr is None` CHECK (Edwardf0t1, unified_export_hf.py): quantizer_attrs.weight_scale / weight_scale_2 / input_scale are always non-None strings; the hasattr(_prior, _attr) guard below was doing all the work. Removed the misleading dead branch. DOCUMENT IN-PLACE MUTATION (Edwardf0t1, sync_tied_input_amax): Added a paragraph at the top of the docstring noting that affected .amax buffers are overwritten in place and the model is not reusable afterwards. CROSS-REFERENCE THE TWO `is_enc_dec`s (Edwardf0t1, dataset_utils.py): Added a comment in model_type_is_enc_dec explaining the intentional divergence from is_enc_dec in example_utils.py (one routes calibration to .generate(); the other gates preview-decode prompt-prefix slicing). TEST UPDATES: Removed the _clear_export_quantized_weight_cache() and _clear_fused_experts_caches() helpers — tests now construct and pass their own local cache dicts, mirroring the production pattern. 10/10 unit tests pass. NOT ADDRESSED (deferred to follow-up): Broad substring matching in _has_side_substring (Edwardf0t1 marked it as "harmless to correctness"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Mirrors the phi4mm / nemotron_vl model-specific recipe pattern under
modelopt_recipes/huggingface/diffusiongemma/ptq/. Moves the
DiffusionGemma-only *self_conditioning* exclude out of the shared
default_disabled_quantizers unit and into a model-specific
disabled_quantizers unit that splices the standard default plus
*self_conditioning*. The shared default no longer carries a
diffusion-specific exclude.
New files:
- modelopt_recipes/huggingface/diffusiongemma/ptq/nvfp4_experts_only.yaml
Three imports -- base_disable_all, experts_nvfp4, disabled_quantizers.
Skips the block_sparse_moe_nvfp4 unit because Gemma4's MoE container
is named `experts` directly with no `block_sparse_moe` wrapper, so
that pattern matched zero modules on DiffusionGemma. Numerically
identical to the general nvfp4_experts_only preset with the
additional *self_conditioning* exclude (verified by re-export:
bit-identical safetensors + index.json vs the same model exported
against the general preset plus the prior in-default
*self_conditioning* line).
- modelopt_recipes/huggingface/diffusiongemma/ptq/disabled_quantizers.yaml
Reusable QuantizerCfgList unit. Imports the standard
default_disabled_quantizers and appends the model-specific
*self_conditioning* exclude.
- modelopt_recipes/huggingface/diffusiongemma/ptq/README.md
Documents the recipe family.
Modified:
- modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml
Removes the *self_conditioning* exclude and its comment block.
- CHANGELOG.rst
Rewrites the *self_conditioning* sentence to describe the
model-specific recipe pattern.
Invocation:
General preset (no diffusion-specific exclude):
python hf_ptq.py --qformat nvfp4_experts_only ...
DiffusionGemma-specific:
python hf_ptq.py --recipe huggingface/diffusiongemma/ptq/nvfp4_experts_only ...
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review of tied-weight PTQ/HF-export support for dLLM encoder-decoder models (DiffusionGemma). +880/-13, 13 files, 6 directories.
Design review (settled). The protocol applies (6 directories), but the design question is explicitly addressed: the PR reuses the existing postprocess_state_dict data_ptr first-wins dedup rather than introducing a competing dedup system — sync_tied_input_amax, the alias loops, and _reorder_canonical_first all feed that existing mechanism. Two maintainer reviewers (cjluo-nv, Edwardf0t1) independently confirmed this in prior comments. Proceeding to correctness.
Prior critical comments — all resolved:
- 💬 Process-global alias caches (flagged "High" by both cjluo-nv and Edwardf0t1; cross-export contamination + module leak). Author replied "addressed in 9f5e0e1" — verified: caches are now caller-owned (
_tied_cache/_moe_tied_cachecreated fresh in_process_quantized_moduleson every call) and threaded through_export_quantized_weight/_export_fused_experts. Function-static state eliminated. - 💬
**kwargsdropped on_export_transformers_checkpointcall (CodeRabbit Major). Verified forwarded. - 💬 MoE second-tied-module fully unpacked then discarded (Edwardf0t1 Medium). Verified: the MoE tied fast-path now short-circuits to pure aliasing before the unpack loop, avoiding the transient memory spike.
- Docstring
input_scalecontradiction, divergingis_enc_deccross-reference, in-place mutation note — all addressed.
New CodeRabbit comment (moe_utils.py:224, "breaks monkeypatched shims") is not valid — the in-tree spy functions in test_fused_experts.py were updated to def _spy_export(wrapper, dtype, **_kwargs), so the unconditional _tied_cache= forwarding does not break them.
Why nudge rather than approve (needs export-owner sign-off):
- E2E validated manually on DiffusionGemma v10 (not in CI); the large tied-MoE path and the
accelerator-gathered distributed export path have no automated coverage. 10 CPU unit tests cover the dedup/alias/reorder/amax-sync logic well. - 💬
_reorder_canonical_firstsubstring breadth (cjluo-nv:786, Edwardf0t1:798) — author did not directly reply; acknowledged by reviewers as harmless-to-correctness and opt-in/default-off, and_has_side_substringnow requires dot-separated path components. Still coarser than the docstring implies on asymmetrically-named models; worth a human confirming the DiffusionGemma naming is the only intended consumer. - The
numel != 1(non-scalar amax) branch insync_tied_input_amaxsilently skips the merge and is not exercised by the CPU tests; confirm the real recipe uses per-tensor input quantizers.
Licensing clean: new recipe YAMLs carry the standard NVIDIA Apache header (matches LICENSE_HEADER), no license-file or third-party-code changes.
| ) | ||
| parser.add_argument("--export_path", default="exported_model") | ||
| parser.add_argument( | ||
| "--canonical_tied_naming", |
There was a problem hiding this comment.
What does this do? Can we just detect difussion gemma and apply this automatically instead of introducing a new flag?
There was a problem hiding this comment.
yes let me move that, agreed
| n = module.num_experts | ||
| expert_dim = _get_fused_expert_intermediate_dim(module) | ||
|
|
||
| # Capture source tensor identities BEFORE unpacking (the source |
There was a problem hiding this comment.
move these into a helper method for readability
| if weight_scale is not None: | ||
| sub_module.register_buffer(quantizer_attrs.weight_scale, weight_scale) | ||
|
|
||
| # Tied-weight dedup: if a previously-processed module shared the same |
There was a problem hiding this comment.
do we dedupe the quantized the weights or we also need to have them tied?
There was a problem hiding this comment.
if the module was originally weight tied, we add the quantizers to both of them irrespective (in this case the expert layers of encoder and decoder) and then we calibrate the quantizers. When exporting if we don't do the de-duplication, we will end up with duplicate entries in ckpt.
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review of tied-weight PTQ/HF-export support for dLLM encoder-decoder models (DiffusionGemma). +880/-13, 13 files, 6 dirs. Design-review protocol applied; design is settled (reuses the existing postprocess_state_dict data_ptr first-wins dedup — sync_tied_input_amax, the alias loops and _reorder_canonical_first all feed that one mechanism rather than introducing a competing dedup system; confirmed by maintainers cjluo-nv + Edwardf0t1). Licensing clean (recipe YAMLs carry the standard NVIDIA Apache header). No prompt-injection in the untrusted content.
Prior critical comments — verified resolved in current diff:
- 💬 Process-global alias caches (flagged High by cjluo-nv + Edwardf0t1). Author: "addressed in 9f5e0e1" — confirmed:
_tied_cache/_moe_tied_cacheare now created fresh in_process_quantized_moduleson every call and threaded through_export_quantized_weight/_export_fused_experts. Function-static state eliminated. - 💬
**kwargsdropped on_export_transformers_checkpointcall — verified forwarded (soacceleratoris preserved). - 💬 MoE second-tied-module fully unpacked then discarded — verified: the tied fast path now short-circuits to pure aliasing before the unpack loop.
- 💬 Docstring
input_scalecontradiction /is_enc_deccross-reference — verified addressed. - CodeRabbit "moe_utils.py:224 breaks monkeypatched shims" — not valid; the in-tree spy fns were updated to
(wrapper, dtype, **_kwargs).
Why nudge rather than approve (needs export-owner sign-off):
- Operator's complexity question is unaddressed. The operator asks whether this can be simplified — "can we dedupe before PTQ and postprocess?" The PR re-aliases packed buffers post-quantization so the existing dedup catches them; it does not discuss the alternative of deduping the shared source weight before/around PTQ (which could avoid
sync_tied_input_amax, the per-expert alias loop, and_export_quantized_weight's alias step entirely). Worth the author justifying in the PR body why post-pack aliasing is preferred over a pre-PTQ approach. - cjluo-nv's flag question (hf_ptq.py:1262) is still open: "Can we just detect diffusion gemma and apply this automatically instead of introducing a new flag?"
--canonical_tied_namingadds public surface; auto-detection from_tied_weights_keyswould avoid it. Author has not replied. - Minor regression introduced by the cache fix: in
_process_quantized_modulesthe two cache-init statements now precede the docstring, so the triple-quoted block is no longer the function docstring (__doc__becomesNone). Move the_tied_cache = {}/_moe_tied_cache = {}lines below the docstring. Private fn, low impact. - Coverage gaps: E2E validated manually on DiffusionGemma v10 (not in CI). The tied-MoE path and the distributed
accelerator-gathered export path have no automated coverage; thenumel != 1(non-scalar amax) branch insync_tied_input_amaxsilently skips and is untested — confirm the real recipe uses per-tensor input quantizers. 10 CPU unit tests cover the dedup/alias/reorder/scalar-amax logic well. - 💬
_reorder_canonical_firstsubstring breadth (cjluo-nv:786, Edwardf0t1:798) —_has_side_substringnow requires dot-separated path components, but it's still coarser than the docstring on asymmetrically-named models; opt-in/default-off limits blast radius. Worth a human confirming DiffusionGemma is the only intended consumer.
What does this PR do?
Type of change: new feature
Adds end-to-end PTQ + HF-checkpoint export support for block-diffusion encoder-decoder LLMs (e.g. DiffusionGemma) whose encoder/decoder stacks share parameters via HF
_tied_weights_keys. Six source commits + one test commit + one CHANGELOG entry — purely additive for existing modelopt users (non-tied models see no behavioral change).Source commits:
Onboard DiffusionGemma to
hf_ptq.py— substring-list additions inmodel_type_is_enc_decandMODEL_NAME_TO_TYPE(so calibration routes through.generate()), and a.sequencesunwrap in the preview decode forModelOutput-returning.generate()s.MoE experts dedup in
_export_fused_experts— when two fused-expert modules share their 3-D source params, alias the per-expert packed weight + scales on cache hit so downstreampostprocess_state_dictdedup catches them. ~42% storage reduction onnvfp4_experts_onlyfor tied 26B MoE checkpoints.Dense Linear dedup in
_export_quantized_weight— symmetric to (2) for dense Linears; no-op fornvfp4_experts_only(dense early-returns atQUANTIZATION_NONE).Opt-in canonical-side reorder (
--canonical_tied_naming, default off) — partitions state_dict so canonical-side tied keys iterate before alias-side, letting first-wins dedup keep canonical names.sync_tied_input_amax— max-merges per-sideinput_quantizer.amaxacross tied modules BEFORE export, so single-backbone consumers (vLLM) that load oneinput_scaleper parameter don't clip on either side. Extends commits 2+3's alias loops to includeinput_scale.Default
*self_conditioning*exclude — adds the diffusion-model self-conditioning wildcard todefault_disabled_quantizers.yaml. Companion to PR Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762) #1691 which already added the vision-module excludes.Test commit:
tests/_test_utils/torch/quantization/tied_modules.py) with three factories (make_tied_linear_pair,tie_fused_experts_3d_params,wrap_in_parent_with_tied_keys) + 10 unit tests covering commits 2–5 acrosstests/unit/torch/export/test_unified_export_hf.py(new) andtests/unit/torch/quantization/plugins/test_fused_experts.py(extended). Pure-Python, CPU-only, ~1s wall total.Docs commit:
Usage
Testing
nvfp4_experts_onlyrecipe +--canonical_tied_naming true— produces a 2-shard, ~18 GB safetensors checkpoint with decoder-canonical naming.Before your PR is "Ready for review"
--canonical_tied_namingis opt-in default off; the dedup/alias logic is cache-based ondata_ptr()and no-ops for non-tied models;sync_tied_input_amaxno-ops when no two modules share a weightdata_ptr; the*self_conditioning*wildcard is a no-op for models without matching module names._test_utils/claude reviewAdditional Information
Validated locally against DiffusionGemma v10 (
DiffusionGemmaForBlockDiffusion,model_type: diffusion_gemma). Substring patterns are also chosen to match the olderDiffusionGemma4ModelForBlockDiffusion/diffusion_gemma4spelling, so the same code path works on earlier and current model versions.Summary by CodeRabbit
New Features
--canonical_tied_namingflag for optimized state dictionary ordering during export.Documentation