Conversation
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
|
/ok to test 6aed164 |
JiwaniZakir
left a comment
There was a problem hiding this comment.
In examples/diffusion/finetune/ltx_t2v_flow.yaml, num_workers: 0 disables parallel data loading entirely, which will significantly bottleneck training throughput since data preprocessing runs on the main process. This should match the pattern used in other model configs, or at minimum include a comment explaining why it's set to zero (e.g., debugging purposes). Additionally, build_video_multiresolution_dataloader is used with dynamic_batch_size: false — the multi-resolution dataloader's primary advantage is dynamic batching across resolutions, so disabling it here means you're getting none of that benefit while still paying the overhead of the multi-resolution infrastructure; a standard fixed-resolution dataloader would be more honest about the actual behavior. The docs in dataset.md state the LTX processor "automatically adjusts frame counts to the nearest valid value," but it's worth clarifying whether this truncates or pads — silent truncation could silently discard the last few frames of a video without the user noticing.
jgerh
left a comment
There was a problem hiding this comment.
Completed tech pubs review and provided a few copyedits.
| --caption_format sidecar | ||
| ``` | ||
|
|
||
| **Frames mode** (extracts evenly-spaced frames, each becomes a separate sample): |
There was a problem hiding this comment.
| **Frames mode** (extracts evenly-spaced frames, and each becomes a separate sample): |
| ``` | ||
|
|
||
| :::{note} | ||
| LTX-Video has a frame count constraint of **8n+1** (9, 17, 25, 33, ...). The LTX processor automatically adjusts frame counts to the nearest valid value during preprocessing. Resolution must be divisible by 32. |
There was a problem hiding this comment.
| LTX-Video has a frame count constraint of **8n+1** (9, 17, 25, 33, ...). The LTX processor automatically adjusts frame counts to the nearest valid value during preprocessing. Resolution must be divisible by 32. | |
| LTX-Video has a frame count constraint of **8n+1** (9, 17, 25, 33, ...). The LTX processor automatically adjusts frame counts to the nearest valid value during preprocessing. The resolution must be divisible by 32. |
|
|
||
| | Step | Section | What You Do | | ||
| |------|---------|-------------| | ||
| | **1. Install** | [Install NeMo AutoModel](#install-nemo-automodel) | Install the package via pip or Docker | |
There was a problem hiding this comment.
| | **1. Install** | [Install NeMo AutoModel](#install-nemo-automodel) | Install the package using pip or Docker | |
| Diffusion models operate in latent space — a compressed representation of visual data — rather than directly on raw images or videos. To avoid re-encoding data on every training step, the preprocessing | ||
| pipeline encodes all inputs ahead of time and saves them as .meta files. |
There was a problem hiding this comment.
| Diffusion models operate in latent space — a compressed representation of visual data — rather than directly on raw images or videos. To avoid re-encoding data on every training step, the preprocessing pipeline encodes all inputs ahead of time and saves them as `.meta` files. |
| Each .meta file contains: | ||
| - Latent representations produced by a VAE (Variational Autoencoder) from the raw visual data | ||
| - Text embeddings produced by a text encoder from the associated captions/prompts |
There was a problem hiding this comment.
| Each `.meta` file contains: | |
| - Latent representations produced by a VAE (Variational Autoencoder) from the raw visual data | |
| - Text embeddings produced by a text encoder from the associated captions/prompts |
| - Latent representations produced by a VAE (Variational Autoencoder) from the raw visual data | ||
| - Text embeddings produced by a text encoder from the associated captions/prompts | ||
|
|
||
| Fine-tuning then operates entirely on these pre-encoded .meta files, which is significantly faster than encoding on the fly. |
There was a problem hiding this comment.
| Fine-tuning then operates entirely on these pre-encoded `.meta` files, which is significantly faster than encoding on the fly. |
|
|
||
| ## Generation / Inference | ||
|
|
||
| Once training is complete, you can use the model to generate images or videos from text prompts. This step is called inference — as opposed to training, where the model learns from data, inference is where it produces new outputs. |
There was a problem hiding this comment.
| Once training is complete, you can use the model to generate images or videos from text prompts. This step is called inference — as opposed to training, where the model learns from data, inference is where it produces new outputs. | |
| Once training is complete, you can use the model to generate images or videos from text prompts. This step is called inference — unlike training, where the model learns from data, inference is when the model produces new outputs. |
|
|
||
| Diffusion models are a class of generative models that learn to produce images or videos by iteratively denoising samples from a noise distribution. NeMo AutoModel supports training diffusion models using **flow matching**, a framework that regresses velocity fields along straight interpolation paths between noise and data. | ||
|
|
||
| NeMo AutoModel integrates with [Hugging Face Diffusers](https://huggingface.co/docs/diffusers) for model loading and generation, while providing its own distributed training infrastructure via the `TrainDiffusionRecipe`. This recipe handles FSDP2 parallelization, flow matching loss computation, multiresolution bucketed dataloading, and checkpoint management. |
There was a problem hiding this comment.
| NeMo AutoModel integrates with [Hugging Face Diffusers](https://huggingface.co/docs/diffusers) for model loading and generation, while providing its own distributed training infrastructure through the `TrainDiffusionRecipe`. This recipe handles FSDP2 parallelization, flow matching loss computation, multiresolution bucketed dataloading, and checkpoint management. |
● What does this PR do ?
Add LTX-Video finetuning support and a Claude Code skill to automate future diffusion model onboarding.
Changelog
normalization, 8n+1 frame constraint, and 128 latent channels
support, and num_frames/height/width for RoPE positional embeddings
adapter kwargs), generation example, and generation configs table
Before your PR is "Ready for review"
Pre checks:
Additional Information
adapter works as-is — only the processor's text encoder loading needs updating for Gemma 3