Skip to content

feat: Add ltx 2.3 finetuning support#1612

Open
pthombre wants to merge 4 commits intomainfrom
pranav/ltx_2.3
Open

feat: Add ltx 2.3 finetuning support#1612
pthombre wants to merge 4 commits intomainfrom
pranav/ltx_2.3

Conversation

@pthombre
Copy link
Copy Markdown
Contributor

● What does this PR do ?

Add LTX-Video finetuning support and a Claude Code skill to automate future diffusion model onboarding.

Changelog

  • Add LTXProcessor data processor (tools/diffusion/processors/ltx.py) with T5 text encoding, AutoencoderKLLTXVideo VAE using latents_mean/latents_std
    normalization, 8n+1 frame constraint, and 128 latent channels
  • Add LTXAdapter flow matching adapter (nemo_automodel/components/flow_matching/adapters/ltx.py) with 3D latent packing/unpacking, encoder_attention_mask
    support, and num_frames/height/width for RoPE positional embeddings
  • Register LTXAdapter in create_adapter() factory in pipeline.py and update adapter init.py
  • Register LTXProcessor in ProcessorRegistry and update processor init.py
  • Add "ltx" and "ltx-video" to video processor CLI choices in preprocessing_multiprocess.py
  • Add finetune config examples/diffusion/finetune/ltx_t2v_flow.yaml for 8-GPU training
  • Add generation config examples/diffusion/generate/configs/generate_ltx.yaml
  • Add 29 adapter unit tests covering init, pack/unpack roundtrip, prepare_inputs, forward pass, CFG dropout, and end-to-end workflow
  • Add 28 processor unit tests covering properties, frame handling, encode video/text (mocked), cache data structure, and registry aliases
  • Update docs/model-coverage/diffusion.md with LTX-Video in supported models table
  • Update docs/guides/diffusion/finetune.md with LTX-Video in supported models, use-case table, model-specific notes (8n+1 frames, 32-pixel resolution,
    adapter kwargs), generation example, and generation configs table
  • Update docs/guides/diffusion/dataset.md with LTX processor in available processors table, preprocessing example, and frame constraint note
  • Add /onboard-diffusion-model Claude Code skill (.claude/skills/onboard-diffusion-model/SKILL.md) to automate future diffusion model onboarding end-to-end

Before your PR is "Ready for review"

Pre checks:

Additional Information

  • Validated end-to-end: preprocessing (5 videos from OpenVID), 100-epoch training on 8x H100 (loss stable ~1.1–1.8), and generation from trained checkpoint
  • Uses Lightricks/LTX-Video (diffusers-compatible, T5 encoder) for testing; when diffusers merges LTX-2.3 support (PR Add Support for LTX-2.3 Models huggingface/diffusers#13217), the
    adapter works as-is — only the processor's text encoder loading needs updating for Gemma 3

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
@pthombre pthombre marked this pull request as ready for review March 27, 2026 03:07
@pthombre
Copy link
Copy Markdown
Contributor Author

/ok to test 6aed164

Copy link
Copy Markdown
Contributor

@JiwaniZakir JiwaniZakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In examples/diffusion/finetune/ltx_t2v_flow.yaml, num_workers: 0 disables parallel data loading entirely, which will significantly bottleneck training throughput since data preprocessing runs on the main process. This should match the pattern used in other model configs, or at minimum include a comment explaining why it's set to zero (e.g., debugging purposes). Additionally, build_video_multiresolution_dataloader is used with dynamic_batch_size: false — the multi-resolution dataloader's primary advantage is dynamic batching across resolutions, so disabling it here means you're getting none of that benefit while still paying the overhead of the multi-resolution infrastructure; a standard fixed-resolution dataloader would be more honest about the actual behavior. The docs in dataset.md state the LTX processor "automatically adjusts frame counts to the nearest valid value," but it's worth clarifying whether this truncates or pads — silent truncation could silently discard the last few frames of a video without the user noticing.

Copy link
Copy Markdown
Contributor

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed tech pubs review and provided a few copyedits.

--caption_format sidecar
```

**Frames mode** (extracts evenly-spaced frames, each becomes a separate sample):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Frames mode** (extracts evenly-spaced frames, and each becomes a separate sample):

```

:::{note}
LTX-Video has a frame count constraint of **8n+1** (9, 17, 25, 33, ...). The LTX processor automatically adjusts frame counts to the nearest valid value during preprocessing. Resolution must be divisible by 32.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LTX-Video has a frame count constraint of **8n+1** (9, 17, 25, 33, ...). The LTX processor automatically adjusts frame counts to the nearest valid value during preprocessing. Resolution must be divisible by 32.
LTX-Video has a frame count constraint of **8n+1** (9, 17, 25, 33, ...). The LTX processor automatically adjusts frame counts to the nearest valid value during preprocessing. The resolution must be divisible by 32.


| Step | Section | What You Do |
|------|---------|-------------|
| **1. Install** | [Install NeMo AutoModel](#install-nemo-automodel) | Install the package via pip or Docker |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| **1. Install** | [Install NeMo AutoModel](#install-nemo-automodel) | Install the package using pip or Docker |

Comment on lines 65 to 66
Diffusion models operate in latent space — a compressed representation of visual data — rather than directly on raw images or videos. To avoid re-encoding data on every training step, the preprocessing
pipeline encodes all inputs ahead of time and saves them as .meta files.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Diffusion models operate in latent space — a compressed representation of visual data — rather than directly on raw images or videos. To avoid re-encoding data on every training step, the preprocessing pipeline encodes all inputs ahead of time and saves them as `.meta` files.

Comment on lines 68 to 70
Each .meta file contains:
- Latent representations produced by a VAE (Variational Autoencoder) from the raw visual data
- Text embeddings produced by a text encoder from the associated captions/prompts
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Each `.meta` file contains:
- Latent representations produced by a VAE (Variational Autoencoder) from the raw visual data
- Text embeddings produced by a text encoder from the associated captions/prompts

- Latent representations produced by a VAE (Variational Autoencoder) from the raw visual data
- Text embeddings produced by a text encoder from the associated captions/prompts

Fine-tuning then operates entirely on these pre-encoded .meta files, which is significantly faster than encoding on the fly.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Fine-tuning then operates entirely on these pre-encoded `.meta` files, which is significantly faster than encoding on the fly.


## Generation / Inference

Once training is complete, you can use the model to generate images or videos from text prompts. This step is called inference — as opposed to training, where the model learns from data, inference is where it produces new outputs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Once training is complete, you can use the model to generate images or videos from text prompts. This step is called inference — as opposed to training, where the model learns from data, inference is where it produces new outputs.
Once training is complete, you can use the model to generate images or videos from text prompts. This step is called inference — unlike training, where the model learns from data, inference is when the model produces new outputs.


Diffusion models are a class of generative models that learn to produce images or videos by iteratively denoising samples from a noise distribution. NeMo AutoModel supports training diffusion models using **flow matching**, a framework that regresses velocity fields along straight interpolation paths between noise and data.

NeMo AutoModel integrates with [Hugging Face Diffusers](https://huggingface.co/docs/diffusers) for model loading and generation, while providing its own distributed training infrastructure via the `TrainDiffusionRecipe`. This recipe handles FSDP2 parallelization, flow matching loss computation, multiresolution bucketed dataloading, and checkpoint management.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NeMo AutoModel integrates with [Hugging Face Diffusers](https://huggingface.co/docs/diffusers) for model loading and generation, while providing its own distributed training infrastructure through the `TrainDiffusionRecipe`. This recipe handles FSDP2 parallelization, flow matching loss computation, multiresolution bucketed dataloading, and checkpoint management.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants