fastgen DMD2: make the Qwen-Image example self-contained on stock nemo_automodel#1688
fastgen DMD2: make the Qwen-Image example self-contained on stock nemo_automodel#1688jingyu-ml wants to merge 16 commits into
Conversation
…n-time self-contained) Make the published DMD2 Qwen-Image example run against STOCK upstream nemo_automodel by removing its dependence on local AutoModel modifications (staged patches #1-5): - Vendor the patched data files into examples/diffusers/fastgen/fastgen_data/ (collate_fns, text_to_image_dataset, mock_dataloader), importing the UNPATCHED upstream helpers (sampler, base_dataset, text_to_video_dataset) from the stock nemo_automodel package. Provenance headers per CONTRIBUTING (source + commit e42584e3, Apache-2.0). - Repoint the published configs' dataloader _target_ to fastgen_data.build_* and add an explicit sys.path seam + startup resolution logging in dmd2_finetune.py (source-checkout). - Add fastgen_checkpoint.PartialLoadCheckpointer (overrides only load_optimizer with allow_partial_load=True) and inject it in dmd2_recipe.load_checkpoint, so both the student and fake-score optimizer restores tolerate FSDP2 partial shards without a patched AutoModel. No change to modelopt/torch/fastgen/ (DMD2 math). Preprocessing vendoring + tests + the removability audit follow in subsequent commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Bound nemo_automodel[diffusion] to the tested range (>=0.4.0,<1.0; 0.4.0 == the validated commit e42584e3) and add a runtime soft-guard in fastgen_data/__init__.py that converts a missing unpatched-upstream-helper ImportError into an actionable message naming the supported range. Supports AC-7 (bounded version) and AC-1 (clear failure on drift). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…ined, no tools/ tree)
Vendor the AutoModel preprocessing stack needed to build the VAE+text-embed cache from raw
images, trimmed to the Qwen-Image path:
- preprocess/processors/{base,base_video,registry,caption_loaders,qwen_image}.py + a trimmed
processors/__init__.py that drops flux/wan/hunyuan (and their deps, e.g.
nemo_automodel.shared.transformers_patches). The Qwen-Image processor self-registers.
- preprocess/preprocessing_multiprocess.py (driver): the `from tools.diffusion.processors`
import is rewritten to the vendored `.processors` package; MultiTierBucketCalculator is
imported from the stock nemo_automodel package (it IS packaged). Image subcommand
--processor defaults to qwen_image.
- preprocess_qwen_image.py launcher mirroring dmd2_finetune.py's sys.path seam.
Removes the example's dependence on AutoModel staged patches #7-8 and on the un-packaged
AutoModel tools/ tree. Provenance headers per CONTRIBUTING. No tools.* imports remain.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…per CONTRIBUTING)
The 10 vendored files (fastgen_data/{collate_fns,text_to_image_dataset,mock_dataloader}.py and
preprocess/{preprocessing_multiprocess,processors/*}.py) carry the original NeMo-AutoModel
Apache-2.0 header plus a "Vendored from ... @ e42584e3" provenance note. Add them to the
insert-license-py exclude so the hook does not prepend a second NVIDIA header on top.
LICENSE Third-Party Software Notices is intentionally NOT modified: every vendored file is
NVIDIA-copyright Apache-2.0 (same entity and license as this repo, from the sibling
NVIDIA-NeMo/Automodel repo), so there is no external third-party copyright holder to list;
the per-file provenance headers are the applicable record. OSRB sign-off remains a maintainer
release gate.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Add examples/diffusers/fastgen/tests/test_vendored_migration.py with two tiers: - Environment-independent (run anywhere): published configs repointed to fastgen_data.*, no tools.* imports in vendored code, the AC-8 removability coverage audit (all nine staged AutoModel files vendored-or-excluded), and provenance headers on the 10 vendored files. - Dependency-guarded (importorskip nemo_automodel/torch, run in the container): data builders accept negative_prompt_embedding_path, collate emits the batch contract + broadcasts the negative embed, PartialLoadCheckpointer overrides only load_optimizer (model load stays strict) + the recipe injects it, and the qwen_image processor self-registers. The four environment-independent tests pass on CPU; the GPU/container end-to-end positive checks (smoke train, FSDP2 resume, preprocessing on real images) run via SLURM. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Add a "Requirements & self-contained data path" section: runs on stock nemo_automodel (>=0.4.0,<1.0) from a source checkout; data loading (fastgen_data/) and preprocessing (preprocess/) are vendored so no nemo_automodel modifications are needed; how to build the cache via preprocess_qwen_image.py. Completes the train-time + preprocessing migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…style Round 1 closes the login-node-actionable AC-7 gaps Codex flagged: - LICENSE: record the NeMo-AutoModel vendoring under Third-Party Software Notices (Apache-2.0). - Vendored files: add the NVIDIA SPDX header after the original AutoModel license, per CONTRIBUTING "Copying code from other sources" (matches modelopt/.../calib_utils.py) — all 10. - Runtime guard: fastgen_data/__init__.py soft-warns when installed nemo_automodel is outside the tested >=0.4.0,<1.0 range; preprocess/__init__.py guards the multi_tier_bucketing import with an actionable message. - Tests: drop plan workflow markers (AC-/Phase/Step) from comments per the plan's code-style note. Not changed (deliberate): the collate prompt_embeds_mask -> text_embeddings_mask mapping is byte-identical to the patched AutoModel (faithful vendor; changing it would alter current training behavior). GPU/SLURM end-to-end validation (smoke/resume/preprocessing/revert-all-nine) remains deferred to the container environment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
… tests
Address the Round 1 review's actionable gaps:
- Add make_negative_prompt_embedding.py: generates the CFG negative_prompt_embedding.pt the
canonical config requires, reusing the vendored QwenImageProcessor's text encoder (default
negative prompt ""), saved in the dataloader's payload format {"embed", "mask"}. Closes the
AC-5/AC-8 self-containment gap (preprocessing previously emitted only per-image caches, so a
from-scratch user had no negative-prompt embedding to point the config at). Documented in README.
- Move migration tests to tests/examples/diffusers/fastgen/ (the repo's example CI lane) and fix
the example-dir resolution from the new location.
- Strengthen the provenance test to assert the NVIDIA SPDX header presence + ordering (after the
original license), not just the "Vendored from" note.
- Update stale "python -m tools.diffusion..." invocation strings in the vendored driver docs to the
preprocess_qwen_image.py launcher.
GPU/SLURM end-to-end execution remains environment-deferred. No modelopt/torch/fastgen changes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…ig test Address the Round 2 review: - Every real-data config on disk now targets fastgen_data.build_text_to_image_multiresolution_dataloader. The 2 committed configs (dmd2_qwen_image.yaml, smoke) were already repointed; the 6 local-only experiment configs (1M_shift3, df0.7, 320k, 320k_shift3, 8steps, 8steps_320k — git-excluded, with hardcoded /lustre data paths) are repointed on disk so the user's own runs survive deleting the AutoModel patches. (They remain excluded from the published diff.) - The config test now enumerates configs/*.yaml and fails on ANY upstream dataloader target, so a new config cannot silently reintroduce the upstream dependence. Robust to whatever set ships (2 in a fresh clone, 8 on the user's disk). - Assert the builder's negative_prompt_embedding_path defaults to None (CFG-less is optional). GPU/SLURM end-to-end execution remains environment-deferred. No modelopt/torch/fastgen changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…ntract
The stock-upstream SLURM validation (8/9 pass) surfaced a test-fixture bug, not a migration bug:
test_collate_emits_contract_keys_and_broadcasts_negative_prompt built a fake sample with only
{latent, prompt_embeds, prompt_embeds_mask}, but collate_fn_production also requires
crop_resolution / original_resolution / crop_offset / prompt / image_path / bucket_id /
aspect_ratio (the keys TextToImageDataset emits). Add them so the test exercises the real
collate path. Vendored code unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Apply the repo ruff config to the vendored data path and preprocessing files: ruff-format + autofixes (pyupgrade py310, isort) plus manual fixes for E402 (logger placement), E501 (provenance comment wrap), N806 (lowercase dims in the video-only 5D shape validator), SIM103, RUF022 (sorted __all__), C401, PERF401. No behavioral change: correctness-critical imports (recipe checkpointer injection, re-exports, # noqa: F401 side-effect imports, MultiTierBucketCalculator, DefaultLoadPlanner) and all provenance/SPDX headers preserved. ruff check + format --check clean; env-independent migration tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughVendored NeMo-AutoModel preprocessing and data loaders into the fastgen example, rewired configs/scripts to use local builders, added FSDP2-tolerant optimizer restore, implemented preprocessing and negative-prompt tooling, and added tests validating the migration. ChangesFastGen DMD2 NeMo-AutoModel Vendoring & Integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error)
✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1688 +/- ##
===========================================
+ Coverage 56.59% 75.57% +18.97%
===========================================
Files 507 511 +4
Lines 55794 57337 +1543
===========================================
+ Hits 31579 43333 +11754
+ Misses 24215 14004 -10211
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…collate) Reduce the migration's footprint per review: (1) delete the vendored mock_dataloader.py and the mock-data smoke config + its README/recipe references (the published quick-start was broken on stock anyway; real training does not use it); (2) delete the vendored base_video.py (video code, dead in an image-only example) and make the --list_processors display image-only; (3) rewrite collate_fns.py as thin wrappers over stock nemo_automodel (import the unpatched collate + sampler; re-add only the DMD2 delta: text_embeddings_mask + broadcast negative-prompt embedding) instead of a full ~440-line vendored copy, also dropping the dead video collate/builder. text_to_image_dataset.py stays a faithful vendored copy (its prompt_embeds_mask change is interleaved with cache loading). Default config + tests/licensing updated; ruff + env-independent checks pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…quality) The preprocess/ tree is vendored verbatim from NVIDIA-NeMo/Automodel (Apache-2.0) and uses the upstream worker-global pattern (module-level 'X | None' set lazily in _init_worker), which trips mypy's strict union-attr/arg-type under examples.*. Add a scoped mypy override (ignore_errors, like tests.*) for examples.diffusers.fastgen.preprocess.* so the faithful copy is not annotated/diverged from upstream. The authored fastgen_data/ wrapper is unaffected (already mypy-clean). 36 errors in 2 files -> 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Assign modelopt/torch/fastgen (DMD2 library) and /examples/diffusers/fastgen (this example) to @NVIDIA/modelopt-torch-fastgen-codeowners. Both are more-specific entries placed after their parent rules (modelopt/torch and /examples/diffusers), so GitHub's last-match-wins routes fastgen review to the fastgen team, matching the repo convention (e.g. modelopt/torch/quantization carving out of modelopt/torch). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/diffusers/fastgen/dmd2_recipe.py (1)
671-671:⚠️ Potential issue | 🔴 CriticalFix unsafe checkpoint deserialization (
weights_only=False) in diffusers fastgen restore paths
examples/diffusers/fastgen/dmd2_recipe.pyrestores sidecar checkpoint payloads usingtorch.load(..., weights_only=False)at lines 671, 676, 689, and 698 with no inline justification. Update these loads toweights_only=True(or add an inline comment explaining whyweights_only=Falseis safe).Suggested patch
- ema_state = torch.load(ema_path, map_location="cpu", weights_only=False) + ema_state = torch.load(ema_path, map_location="cpu", weights_only=True) ... - state = torch.load(state_path, map_location="cpu", weights_only=False) + state = torch.load(state_path, map_location="cpu", weights_only=True) ... - disc_state = torch.load(disc_path, map_location="cpu", weights_only=False) + disc_state = torch.load(disc_path, map_location="cpu", weights_only=True) ... - disc_opt_state = torch.load(disc_opt_path, map_location="cpu", weights_only=False) + disc_opt_state = torch.load(disc_opt_path, map_location="cpu", weights_only=True)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/diffusers/fastgen/dmd2_recipe.py` at line 671, Summary: Unsafe checkpoint deserialization uses torch.load(..., weights_only=False); change to weights_only=True or justify why full deserialization is required. Locate the torch.load calls that load sidecar checkpoints (e.g., the ema_state load referenced by the symbol ema_state and variable ema_path in dmd2_recipe.py) and update the argument weights_only=False to weights_only=True for all similar restore paths (the other torch.load calls flagged in the review). If full object deserialization truly is required, instead leave weights_only=False but add an inline comment explaining why it is safe and unavoidable; otherwise prefer weights_only=True to avoid executing arbitrary pickle payloads.Source: Coding guidelines
🧹 Nitpick comments (1)
examples/diffusers/fastgen/preprocess/processors/base.py (1)
187-190: ⚡ Quick winMove the NumPy import to module scope (or document why it must be deferred).
Line 187 imports
numpyinsidepreprocess_imagewithout an explicit circular/optional/heavy-import justification, so import errors surface at runtime instead of module load.As per coding guidelines, imports should stay at file top unless there is a documented exception.
Proposed change
from abc import ABC, abstractmethod from typing import Any +import numpy as np import torch from PIL import Image @@ def preprocess_image(self, image: Image.Image) -> torch.Tensor: @@ - import numpy as np - image_tensor = torch.from_numpy(np.array(image)).float() / 255.0🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/diffusers/fastgen/preprocess/processors/base.py` around lines 187 - 190, The numpy import is done inside the preprocessing code (where image_tensor is created in preprocess_image), which causes import errors to appear at runtime; move "import numpy as np" to module scope near the other imports at the top of examples/diffusers/fastgen/preprocess/processors/base.py (or add a short comment explaining a deliberate deferred import) so that preprocess_image (and the image_tensor creation) uses a top-level np import; update any references in preprocess_image and related functions to use the module-scoped np.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/diffusers/fastgen/fastgen_data/collate_fns.py`:
- Line 103: The code uses torch.load(path, map_location="cpu",
weights_only=False) to load a user-configurable negative_prompt_embedding_path,
which allows unsafe pickle loading; change the load to a safe mode and validate
the input: ensure the loader uses weights_only=True (i.e., torch.load(path,
map_location="cpu", weights_only=True)) or otherwise deserialize with a safe
format (e.g., torch.jit.load or a JSON/npz reader) and validate
negative_prompt_embedding_path exists and is from a trusted source before
loading; update the line creating payload (payload = torch.load(...)) and any
code that assumes pickled objects accordingly.
In `@examples/diffusers/fastgen/preprocess/processors/base.py`:
- Around line 208-210: The code currently dereferences
models["vae"].config.scaling_factor without ensuring the attribute exists;
update the getter so that after confirming "vae" is in models and models["vae"]
has a config, you also check that the config has a scaling_factor (e.g.,
hasattr(models["vae"].config, "scaling_factor") or getattr with default) and
only return it when present, otherwise return the fallback 0.18215; reference
the models dict, the "vae" entry, its config attribute, and the scaling_factor
property when implementing this guard.
In `@examples/diffusers/fastgen/preprocess/processors/caption_loaders.py`:
- Line 425: The import "from collections import defaultdict" is placed inside a
function in caption_loaders.py—move this import to the module top-level and
remove the in-function import; search for the in-function occurrence of "from
collections import defaultdict" (and any use of defaultdict) and add a single
top-of-file import statement instead so import errors surface at module load and
the function body no longer contains the import.
In `@examples/diffusers/fastgen/preprocess/processors/qwen_image.py`:
- Around line 184-191: The second value returned by pipeline.encode_prompt (the
prompt_embeds_mask) is being ignored in encode_prompt calls and therefore not
serialized; update the encode_prompt usage in qwen_image.py to capture both
outputs (e.g., prompt_embeds, prompt_embeds_mask = pipeline.encode_prompt(...))
and include prompt_embeds_mask in the returned/persisted dict alongside
prompt_embeds (apply the same .detach().cpu().to(torch.bfloat16) or appropriate
dtype/shape handling as done for prompt_embeds) so downstream TextToImageDataset
receives the real mask instead of falling back to an all-ones mask.
- Line 217: The autocast call inside verify_latent is using an invalid dtype for
CUDA (torch.float32); update the autocast usage in verify_latent (qwen_image.py)
to use a valid reduced-precision dtype such as torch.float16 or torch.bfloat16,
or remove the dtype argument so autocast picks the default, ensuring the with
torch.no_grad(), autocast(...) context no longer raises and causes false
verification failures.
In `@LICENSE`:
- Around line 226-228: Add a top-of-file vendored header to each Python file
listed (dmd2_finetune.py, dmd2_recipe.py, export_diffusers_qwen_image.py,
fastgen_checkpoint.py, fastgen_data/__init__.py, fastgen_data/collate_fns.py,
inference_dmd2_qwen_image.py, make_negative_prompt_embedding.py,
preprocess_qwen_image.py) that matches the LICENSE expectation: a single-line or
block comment beginning with "Vendored from" and including the original
NVIDIA-NeMo/Automodel source path and commit SHA (e.g., "Vendored from
NVIDIA-NeMo/Automodel: <original/path> @ <commit>"); place it at the very top of
each file before imports and include any short attribution or license snippet
required by the original file so each file contains the "Vendored from" string
referenced in LICENSE.
In `@tests/examples/diffusers/fastgen/test_vendored_migration.py`:
- Around line 157-163: Move the local import of inspect out of the test function
to a top-level import; update the module imports to include "import inspect" at
the module scope and remove the "import inspect" line from inside
test_data_builders_importable_and_accept_negative_prompt_path so the test uses
the module-level inspect symbol.
---
Outside diff comments:
In `@examples/diffusers/fastgen/dmd2_recipe.py`:
- Line 671: Summary: Unsafe checkpoint deserialization uses torch.load(...,
weights_only=False); change to weights_only=True or justify why full
deserialization is required. Locate the torch.load calls that load sidecar
checkpoints (e.g., the ema_state load referenced by the symbol ema_state and
variable ema_path in dmd2_recipe.py) and update the argument weights_only=False
to weights_only=True for all similar restore paths (the other torch.load calls
flagged in the review). If full object deserialization truly is required,
instead leave weights_only=False but add an inline comment explaining why it is
safe and unavoidable; otherwise prefer weights_only=True to avoid executing
arbitrary pickle payloads.
---
Nitpick comments:
In `@examples/diffusers/fastgen/preprocess/processors/base.py`:
- Around line 187-190: The numpy import is done inside the preprocessing code
(where image_tensor is created in preprocess_image), which causes import errors
to appear at runtime; move "import numpy as np" to module scope near the other
imports at the top of examples/diffusers/fastgen/preprocess/processors/base.py
(or add a short comment explaining a deliberate deferred import) so that
preprocess_image (and the image_tensor creation) uses a top-level np import;
update any references in preprocess_image and related functions to use the
module-scoped np.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5a5a2507-bada-45a1-9617-9f3d3b0b7710
📒 Files selected for processing (23)
.pre-commit-config.yamlLICENSEexamples/diffusers/fastgen/README.mdexamples/diffusers/fastgen/configs/dmd2_qwen_image.yamlexamples/diffusers/fastgen/configs/dmd2_qwen_image_smoke.yamlexamples/diffusers/fastgen/dmd2_finetune.pyexamples/diffusers/fastgen/dmd2_recipe.pyexamples/diffusers/fastgen/fastgen_checkpoint.pyexamples/diffusers/fastgen/fastgen_data/__init__.pyexamples/diffusers/fastgen/fastgen_data/collate_fns.pyexamples/diffusers/fastgen/fastgen_data/text_to_image_dataset.pyexamples/diffusers/fastgen/make_negative_prompt_embedding.pyexamples/diffusers/fastgen/preprocess/__init__.pyexamples/diffusers/fastgen/preprocess/preprocessing_multiprocess.pyexamples/diffusers/fastgen/preprocess/processors/__init__.pyexamples/diffusers/fastgen/preprocess/processors/base.pyexamples/diffusers/fastgen/preprocess/processors/caption_loaders.pyexamples/diffusers/fastgen/preprocess/processors/qwen_image.pyexamples/diffusers/fastgen/preprocess/processors/registry.pyexamples/diffusers/fastgen/preprocess_qwen_image.pyexamples/diffusers/fastgen/requirements.txtpyproject.tomltests/examples/diffusers/fastgen/test_vendored_migration.py
💤 Files with no reviewable changes (1)
- examples/diffusers/fastgen/configs/dmd2_qwen_image_smoke.yaml
| if "vae" in models and hasattr(models["vae"], "config"): | ||
| return models["vae"].config.scaling_factor | ||
| return 0.18215 # Default for most models |
There was a problem hiding this comment.
Guard scaling_factor existence before dereference.
Line 209 assumes models["vae"].config.scaling_factor always exists once config exists. If scaling_factor is missing, this raises instead of using the default fallback.
Proposed change
def get_vae_scaling_factor(self, models: dict[str, Any]) -> float:
@@
- if "vae" in models and hasattr(models["vae"], "config"):
- return models["vae"].config.scaling_factor
+ if "vae" in models and hasattr(models["vae"], "config"):
+ scaling_factor = getattr(models["vae"].config, "scaling_factor", None)
+ if scaling_factor is not None:
+ return scaling_factor
return 0.18215 # Default for most models🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/diffusers/fastgen/preprocess/processors/base.py` around lines 208 -
210, The code currently dereferences models["vae"].config.scaling_factor without
ensuring the attribute exists; update the getter so that after confirming "vae"
is in models and models["vae"] has a config, you also check that the config has
a scaling_factor (e.g., hasattr(models["vae"].config, "scaling_factor") or
getattr with default) and only return it when present, otherwise return the
fallback 0.18215; reference the models dict, the "vae" entry, its config
attribute, and the scaling_factor property when implementing this guard.
| prompt_embeds, _ = pipeline.encode_prompt( | ||
| prompt=prompt, | ||
| device=device, | ||
| ) | ||
|
|
||
| return { | ||
| "prompt_embeds": prompt_embeds.detach().cpu().to(torch.bfloat16), | ||
| } |
There was a problem hiding this comment.
prompt_embeds_mask is dropped before cache serialization.
Line 184 receives a second output from encode_prompt but discards it, and Lines 261-266 persist only prompt_embeds. Downstream, TextToImageDataset falls back to an all-ones mask when the mask key is absent, so the true sequence mask contract is lost.
Proposed change
with torch.no_grad():
- prompt_embeds, _ = pipeline.encode_prompt(
+ prompt_embeds, prompt_embeds_mask = pipeline.encode_prompt(
prompt=prompt,
device=device,
)
- return {
+ result = {
"prompt_embeds": prompt_embeds.detach().cpu().to(torch.bfloat16),
}
+ if prompt_embeds_mask is not None:
+ result["prompt_embeds_mask"] = prompt_embeds_mask.detach().cpu().to(torch.long)
+ return result
@@
return {
# Image latent
"latent": latent,
# Text embeddings
"prompt_embeds": text_encodings["prompt_embeds"],
+ "prompt_embeds_mask": text_encodings.get("prompt_embeds_mask"),
# Metadata
"original_resolution": metadata["original_resolution"],Also applies to: 261-266
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/diffusers/fastgen/preprocess/processors/qwen_image.py` around lines
184 - 191, The second value returned by pipeline.encode_prompt (the
prompt_embeds_mask) is being ignored in encode_prompt calls and therefore not
serialized; update the encode_prompt usage in qwen_image.py to capture both
outputs (e.g., prompt_embeds, prompt_embeds_mask = pipeline.encode_prompt(...))
and include prompt_embeds_mask in the returned/persisted dict alongside
prompt_embeds (apply the same .detach().cpu().to(torch.bfloat16) or appropriate
dtype/shape handling as done for prompt_embeds) so downstream TextToImageDataset
receives the real mask instead of falling back to an all-ones mask.
| Copyright (c) 2025 NVIDIA CORPORATION — portions of examples/diffusers/fastgen/ are | ||
| vendored from NVIDIA-NeMo/Automodel (https://github.com/NVIDIA-NeMo/Automodel); see the | ||
| per-file "Vendored from" headers for the exact source paths and commit. |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check that vendored files have "Vendored from" headers.
# Expected: Each file starts with a comment containing "Vendored from" + source URL.
vendored_files=(
"examples/diffusers/fastgen/fastgen_data/text_to_image_dataset.py"
"examples/diffusers/fastgen/preprocess/preprocessing_multiprocess.py"
"examples/diffusers/fastgen/preprocess/processors/__init__.py"
"examples/diffusers/fastgen/preprocess/processors/base.py"
"examples/diffusers/fastgen/preprocess/processors/registry.py"
"examples/diffusers/fastgen/preprocess/processors/caption_loaders.py"
"examples/diffusers/fastgen/preprocess/processors/qwen_image.py"
)
echo "Checking for 'Vendored from' headers in vendored files..."
for file in "${vendored_files[@]}"; do
if [ -f "$file" ]; then
if head -20 "$file" | grep -q "Vendored from"; then
echo "✓ $file: has Vendored from header"
else
echo "✗ $file: MISSING Vendored from header"
fi
else
echo "⚠ $file: file not found"
fi
doneRepository: NVIDIA/Model-Optimizer
Length of output: 763
🏁 Script executed:
#!/bin/bash
set -euo pipefail
dir="examples/diffusers/fastgen"
echo "Scanning for Python files under: $dir"
mapfile -t files < <(find "$dir" -type f -name "*.py" | sort)
echo "Found ${`#files`[@]} .py files."
missing=()
for f in "${files[@]}"; do
if head -30 "$f" | rg -q "Vendored from"; then
echo "✓ $f"
else
echo "✗ $f (missing 'Vendored from' in first 30 lines)"
missing+=("$f")
fi
done
if [ ${`#missing`[@]} -gt 0 ]; then
echo
echo "Missing count: ${`#missing`[@]}"
printf '%s\n' "${missing[@]}"
exit 1
else
echo
echo "All scanned Python files have 'Vendored from' headers."
fiRepository: NVIDIA/Model-Optimizer
Length of output: 201
🏁 Script executed:
#!/bin/bash
set -euo pipefail
dir="examples/diffusers/fastgen"
echo "Scanning for Python files under: $dir"
tmp="$(mktemp)"
find "$dir" -type f -name "*.py" | sort > "$tmp"
mapfile -t files < "$tmp"
rm -f "$tmp"
echo "Found ${`#files`[@]} .py files."
missing=()
for f in "${files[@]}"; do
if head -40 "$f" | rg -q "Vendored from"; then
echo "✓ $f"
else
echo "✗ $f (missing 'Vendored from' in first 40 lines)"
missing+=("$f")
fi
done
if [ ${`#missing`[@]} -gt 0 ]; then
echo
echo "Missing count: ${`#missing`[@]}"
printf '%s\n' "${missing[@]}"
exit 1
else
echo
echo "All scanned Python files have 'Vendored from' headers."
fiRepository: NVIDIA/Model-Optimizer
Length of output: 2025
🏁 Script executed:
#!/bin/bash
set -euo pipefail
dir="examples/diffusers/fastgen"
tmp="$(mktemp)"
find "$dir" -type f -name "*.py" | sort > "$tmp"
mapfile -t files < "$tmp"
rm -f "$tmp"
echo "Checking where 'Vendored from' appears (first 200 lines + anywhere in file):"
for f in "${files[@]}"; do
first200_has=0
any_has=0
if head -200 "$f" | rg -q "Vendored from"; then
first200_has=1
fi
if rg -q "Vendored from" "$f"; then
any_has=1
fi
if [ "$any_has" -eq 1 ]; then
if [ "$first200_has" -eq 1 ]; then
echo "✓ $f (Vendored from in first 200 lines)"
else
echo "⚠ $f (Vendored from exists but after first 200 lines)"
fi
else
echo "✗ $f (no 'Vendored from' anywhere)"
fi
doneRepository: NVIDIA/Model-Optimizer
Length of output: 1689
Add missing # Vendored from ... headers to vendored fastgen Python files
LICENSE points to per-file “Vendored from” headers, but these vendored files under examples/diffusers/fastgen/ contain no “Vendored from” string anywhere:
examples/diffusers/fastgen/dmd2_finetune.pyexamples/diffusers/fastgen/dmd2_recipe.pyexamples/diffusers/fastgen/export_diffusers_qwen_image.pyexamples/diffusers/fastgen/fastgen_checkpoint.pyexamples/diffusers/fastgen/fastgen_data/__init__.pyexamples/diffusers/fastgen/fastgen_data/collate_fns.pyexamples/diffusers/fastgen/inference_dmd2_qwen_image.pyexamples/diffusers/fastgen/make_negative_prompt_embedding.pyexamples/diffusers/fastgen/preprocess_qwen_image.py
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@LICENSE` around lines 226 - 228, Add a top-of-file vendored header to each
Python file listed (dmd2_finetune.py, dmd2_recipe.py,
export_diffusers_qwen_image.py, fastgen_checkpoint.py, fastgen_data/__init__.py,
fastgen_data/collate_fns.py, inference_dmd2_qwen_image.py,
make_negative_prompt_embedding.py, preprocess_qwen_image.py) that matches the
LICENSE expectation: a single-line or block comment beginning with "Vendored
from" and including the original NVIDIA-NeMo/Automodel source path and commit
SHA (e.g., "Vendored from NVIDIA-NeMo/Automodel: <original/path> @ <commit>");
place it at the very top of each file before imports and include any short
attribution or license snippet required by the original file so each file
contains the "Vendored from" string referenced in LICENSE.
…e diffusers team Drop the /examples/diffusers/fastgen carve-out; @NVIDIA/modelopt-torch-fastgen-codeowners owns modelopt/torch/fastgen (the DMD2 library). The example stays under /examples/diffusers (@NVIDIA/modelopt-examples-diffusers-codeowners). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Apply the safe, sensible CodeRabbit findings: (1) collate_fns.py loads the negative-prompt embedding with weights_only=True (it is a dict of plain tensors saved by make_negative_prompt_embedding.py, so it loads cleanly) — removes the pickle/RCE vector per SECURITY.md; (2) move 'from collections import defaultdict' (caption_loaders) and 'import inspect' (test) to module top; (3) drop the no-op autocast(dtype=torch.float32) in QwenImageProcessor.verify_latent — the latent is already .float(), so this is behavior-identical and removes the invalid-CUDA-autocast-dtype warning. Preprocessing/data only; no change to DMD2 training math. Intentionally not changed (would alter validated behavior or are false positives): not persisting prompt_embeds_mask in the cache (would switch the text-attention mask from the all-ones fallback the existing caches + running jobs use to a real mask — changes training math); the speculative scaling_factor guard (Qwen-Image VAE always defines it; vendored verbatim); 'Vendored from' headers on authored files (those correctly carry the standard NVIDIA SPDX header — LICENSE states only 'portions' are vendored). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
|
/ok to test 85ae40c |
There was a problem hiding this comment.
These are also nvidia authored and apache 2 license files so wr dont need to exclude in pre commit or have duplicate nvidia license. Let license hook auto add correct license everywhere
What does this PR do?
Type of change: new example / refactor (example self-containment)
The published
examples/diffusers/fastgenDMD2 Qwen-Image distillation example previously relied onlocal, unpublished modifications to the sibling
nemo_automodelpackage (data path, collate, apartial-load checkpointer, and Qwen-Image preprocessing). An external user running the published
Model-Optimizer example against official stock
nemo_automodelwould hit import/attribute errors.This PR makes the example self-contained on stock
nemo_automodel>=0.4.0— with the changes keptas small as possible: the team's actual AutoModel delta is only ~430 lines, so wherever the
upstream module is importable, the delta is expressed as a thin subclass/wrapper rather than a
full-file copy.
fastgen_data/— the DMD2 data path:collate_fns.py— thin wrappers over stock: import the unpatched upstream collate +bucket sampler and re-add only the DMD2 delta (
text_embeddings_mask, stacked from theper-item mask the stock collate omits; and an optional broadcast
negative_text_embeddingsfor CFG). The builder loads an optional
negative_prompt_embedding_path.text_to_image_dataset.py— a faithful vendored copy of the upstream reader (itsprompt_embeds_maskemission is interleaved with cache loading, so wrapping it would force aredundant per-item
torch.load; carried verbatim instead).fastgen_checkpoint.py—PartialLoadCheckpointer(Checkpointer)that overrides onlyload_optimizer(FSDP2DefaultLoadPlanner(allow_partial_load=True)) so optimizer resume workswithout patching upstream. Injected via an in-place re-bless of
self.checkpointerin therecipe's
load_checkpoint(model-state load stays strict).preprocess/— Qwen-Image preprocessing (preprocessing_multiprocess.py+processors/),trimmed to the image path (drops the flux/wan/hunyuan processors and the video base class).
It lives in AutoModel's top-level
tools/tree, which is not shipped in the pip package, soit cannot be wrapped and is vendored;
MultiTierBucketCalculatoris imported from stock upstream.make_negative_prompt_embedding.py— generates the optional CFG negative-prompt embedding.configs/*.yamltargetfastgen_data.build_*(a test enumerates every config).CONTRIBUTING.md(source link + commit, original Apache-2.0 header, thenNVIDIA SPDX header on verbatim-vendored files; a NeMo-AutoModel note in
LICENSE;insert-licenseexcludes for the verbatim files;
nemo_automodel[diffusion]version bound inrequirements.txt).The DMD2 math in
modelopt/torch/fastgen/is unchanged — only example/training-time glue moved.Usage
See
examples/diffusers/fastgen/README.md→ "Requirements & self-contained data path".Testing
tests/examples/diffusers/fastgen/test_vendored_migration.py: environment-independentinvariants (every config targets a vendored builder; no
tools.*imports; each former AutoModelpatch is vendored / wrapped / a documented exclusion; provenance + SPDX headers present) plus
dependency-guarded structural tests (the collate wrapper emits the batch contract + broadcasts the
negative embedding; the builder accepts
negative_prompt_embedding_path; the checkpointeroverrides only
load_optimizer; the Qwen-Image processor self-registers).nemo_automodel0.4.0 worktree (none of the local patchespresent) via SLURM — re-run after this slim-down.
through the vendored
PartialLoadCheckpointer.ruff check+ruff format --checkclean on all changed files.Before your PR is "Ready for review"
CONTRIBUTING.md: ✅ (NeMo-AutoModel @e42584e3, Apache-2.0 — provenance + original license retained, NVIDIA SPDX header added after;LICENSEthird-party note;nemo_automodelwas already a dependency)Additional Information
Opened as a draft pending: OSRB review of the vendored NeMo-AutoModel (Apache-2.0) code, CI green,
and (optional) a multi-GPU smoke-train + resume on stock upstream. Vendored from
NVIDIA-NeMo/Automodel at commit
e42584e3.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Tests
Chores