feat: Add ltx 2.3 finetuning support by pthombre · Pull Request #1612 · NVIDIA-NeMo/Automodel

pthombre · 2026-03-27T01:39:16Z

● What does this PR do ?

Add LTX-Video finetuning support and a Claude Code skill to automate future diffusion model onboarding.

Changelog

Add LTXProcessor data processor (tools/diffusion/processors/ltx.py) with T5 text encoding, AutoencoderKLLTXVideo VAE using latents_mean/latents_std
normalization, 8n+1 frame constraint, and 128 latent channels
Add LTXAdapter flow matching adapter (nemo_automodel/components/flow_matching/adapters/ltx.py) with 3D latent packing/unpacking, encoder_attention_mask
support, and num_frames/height/width for RoPE positional embeddings
Register LTXAdapter in create_adapter() factory in pipeline.py and update adapter init.py
Register LTXProcessor in ProcessorRegistry and update processor init.py
Add "ltx" and "ltx-video" to video processor CLI choices in preprocessing_multiprocess.py
Add finetune config examples/diffusion/finetune/ltx_t2v_flow.yaml for 8-GPU training
Add generation config examples/diffusion/generate/configs/generate_ltx.yaml
Add 29 adapter unit tests covering init, pack/unpack roundtrip, prepare_inputs, forward pass, CFG dropout, and end-to-end workflow
Add 28 processor unit tests covering properties, frame handling, encode video/text (mocked), cache data structure, and registry aliases
Update docs/model-coverage/diffusion.md with LTX-Video in supported models table
Update docs/guides/diffusion/finetune.md with LTX-Video in supported models, use-case table, model-specific notes (8n+1 frames, 32-pixel resolution,
adapter kwargs), generation example, and generation configs table
Update docs/guides/diffusion/dataset.md with LTX processor in available processors table, preprocessing example, and frame constraint note
Add /onboard-diffusion-model Claude Code skill (.claude/skills/onboard-diffusion-model/SKILL.md) to automate future diffusion model onboarding end-to-end

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed https://github.com/NVIDIA-NeMo/Automodel/blob/main/CONTRIBUTING.md
Did you write any new necessary tests?
Did you add or update any necessary documentation?

Additional Information

Validated end-to-end: preprocessing (5 videos from OpenVID), 100-epoch training on 8x H100 (loss stable ~1.1–1.8), and generation from trained checkpoint
Uses Lightricks/LTX-Video (diffusers-compatible, T5 encoder) for testing; when diffusers merges LTX-2.3 support (PR Add Support for LTX-2.3 Models huggingface/diffusers#13217), the
adapter works as-is — only the processor's text encoder loading needs updating for Gemma 3

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

copy-pr-bot · 2026-03-27T01:39:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

pthombre · 2026-03-27T03:08:02Z

/ok to test 6aed164

JiwaniZakir

In examples/diffusion/finetune/ltx_t2v_flow.yaml, num_workers: 0 disables parallel data loading entirely, which will significantly bottleneck training throughput since data preprocessing runs on the main process. This should match the pattern used in other model configs, or at minimum include a comment explaining why it's set to zero (e.g., debugging purposes). Additionally, build_video_multiresolution_dataloader is used with dynamic_batch_size: false — the multi-resolution dataloader's primary advantage is dynamic batching across resolutions, so disabling it here means you're getting none of that benefit while still paying the overhead of the multi-resolution infrastructure; a standard fixed-resolution dataloader would be more honest about the actual behavior. The docs in dataset.md state the LTX processor "automatically adjusts frame counts to the nearest valid value," but it's worth clarifying whether this truncates or pads — silent truncation could silently discard the last few frames of a video without the user noticing.

jgerh

Completed tech pubs review and provided a few copyedits.

jgerh · 2026-03-30T21:12:00Z

  --caption_format sidecar
 ```

 **Frames mode** (extracts evenly-spaced frames, each becomes a separate sample):


Suggested change

**Frames mode** (extracts evenly-spaced frames, and each becomes a separate sample):

jgerh · 2026-03-30T21:12:45Z

+```
+
+:::{note}
+LTX-Video has a frame count constraint of **8n+1** (9, 17, 25, 33, ...). The LTX processor automatically adjusts frame counts to the nearest valid value during preprocessing. Resolution must be divisible by 32.


Suggested change

LTX-Video has a frame count constraint of **8n+1** (9, 17, 25, 33, ...). The LTX processor automatically adjusts frame counts to the nearest valid value during preprocessing. Resolution must be divisible by 32.

LTX-Video has a frame count constraint of **8n+1** (9, 17, 25, 33, ...). The LTX processor automatically adjusts frame counts to the nearest valid value during preprocessing. The resolution must be divisible by 32.

jgerh · 2026-03-30T21:22:44Z


 | Step | Section | What You Do |
 |------|---------|-------------|
 | **1. Install** | [Install NeMo AutoModel](#install-nemo-automodel) | Install the package via pip or Docker |


Suggested change

| **1. Install** | [Install NeMo AutoModel](#install-nemo-automodel) | Install the package using pip or Docker |

jgerh · 2026-03-30T21:24:36Z

 Diffusion models operate in latent space — a compressed representation of visual data — rather than directly on raw images or videos. To avoid re-encoding data on every training step, the preprocessing
  pipeline encodes all inputs ahead of time and saves them as .meta files.


Suggested change

Diffusion models operate in latent space — a compressed representation of visual data — rather than directly on raw images or videos. To avoid re-encoding data on every training step, the preprocessing pipeline encodes all inputs ahead of time and saves them as `.meta` files.

jgerh · 2026-03-30T21:25:34Z

 Each .meta file contains:
 - Latent representations produced by a VAE (Variational Autoencoder) from the raw visual data
 - Text embeddings produced by a text encoder from the associated captions/prompts


Suggested change

Each `.meta` file contains:

- Latent representations produced by a VAE (Variational Autoencoder) from the raw visual data

- Text embeddings produced by a text encoder from the associated captions/prompts

jgerh · 2026-03-30T21:25:51Z

 - Latent representations produced by a VAE (Variational Autoencoder) from the raw visual data
 - Text embeddings produced by a text encoder from the associated captions/prompts

 Fine-tuning then operates entirely on these pre-encoded .meta files, which is significantly faster than encoding on the fly.


Suggested change

Fine-tuning then operates entirely on these pre-encoded `.meta` files, which is significantly faster than encoding on the fly.

jgerh · 2026-03-30T21:26:51Z

+
 ## Generation / Inference

 Once training is complete, you can use the model to generate images or videos from text prompts. This step is called inference — as opposed to training, where the model learns from data, inference is where it produces new outputs.


Suggested change

Once training is complete, you can use the model to generate images or videos from text prompts. This step is called inference — as opposed to training, where the model learns from data, inference is where it produces new outputs.

Once training is complete, you can use the model to generate images or videos from text prompts. This step is called inference — unlike training, where the model learns from data, inference is when the model produces new outputs.

jgerh · 2026-03-30T21:32:06Z


 Diffusion models are a class of generative models that learn to produce images or videos by iteratively denoising samples from a noise distribution. NeMo AutoModel supports training diffusion models using **flow matching**, a framework that regresses velocity fields along straight interpolation paths between noise and data.

 NeMo AutoModel integrates with [Hugging Face Diffusers](https://huggingface.co/docs/diffusers) for model loading and generation, while providing its own distributed training infrastructure via the `TrainDiffusionRecipe`. This recipe handles FSDP2 parallelization, flow matching loss computation, multiresolution bucketed dataloading, and checkpoint management.


Suggested change

NeMo AutoModel integrates with [Hugging Face Diffusers](https://huggingface.co/docs/diffusers) for model loading and generation, while providing its own distributed training infrastructure through the `TrainDiffusionRecipe`. This recipe handles FSDP2 parallelization, flow matching loss computation, multiresolution bucketed dataloading, and checkpoint management.

feat: Add ltx 2.3 finetuning support

9d13c2f

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

pthombre added 2 commits March 26, 2026 20:06

Update training yaml values

29c919b

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

Merge branch 'main' into pranav/ltx_2.3

6aed164

pthombre marked this pull request as ready for review March 27, 2026 03:07

pthombre requested review from HuiyingLi, ZhiyuLi-Nvidia, adil-a, akoumpa, hemildesai and jgerh as code owners March 27, 2026 03:07

copy-pr-bot bot had a problem deploying to nemo-ci March 27, 2026 03:08 Failure

copy-pr-bot bot temporarily deployed to nemo-ci March 27, 2026 03:08 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 27, 2026 03:08 Failure

copy-pr-bot bot temporarily deployed to nemo-ci March 27, 2026 03:08 Inactive

copy-pr-bot bot temporarily deployed to test March 27, 2026 03:08 Inactive

Merge branch 'main' into pranav/ltx_2.3

d81d93b

JiwaniZakir reviewed Mar 29, 2026

View reviewed changes

jgerh reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add ltx 2.3 finetuning support#1612

feat: Add ltx 2.3 finetuning support#1612
pthombre wants to merge 4 commits intomainfrom
pranav/ltx_2.3

pthombre commented Mar 27, 2026

Uh oh!

copy-pr-bot bot commented Mar 27, 2026

Uh oh!

pthombre commented Mar 27, 2026

Uh oh!

JiwaniZakir left a comment

Uh oh!

jgerh left a comment

Uh oh!

jgerh Mar 30, 2026

Uh oh!

jgerh Mar 30, 2026

Uh oh!

jgerh Mar 30, 2026

Uh oh!

jgerh Mar 30, 2026

Uh oh!

jgerh Mar 30, 2026

Uh oh!

jgerh Mar 30, 2026

Uh oh!

jgerh Mar 30, 2026

Uh oh!

jgerh Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


	Frames mode (extracts evenly-spaced frames, and each becomes a separate sample):

	LTX-Video has a frame count constraint of 8n+1 (9, 17, 25, 33, ...). The LTX processor automatically adjusts frame counts to the nearest valid value during preprocessing. Resolution must be divisible by 32.
	LTX-Video has a frame count constraint of 8n+1 (9, 17, 25, 33, ...). The LTX processor automatically adjusts frame counts to the nearest valid value during preprocessing. The resolution must be divisible by 32.


	\| 1. Install \| [Install NeMo AutoModel](#install-nemo-automodel) \| Install the package using pip or Docker \|

		Diffusion models operate in latent space — a compressed representation of visual data — rather than directly on raw images or videos. To avoid re-encoding data on every training step, the preprocessing
		pipeline encodes all inputs ahead of time and saves them as .meta files.


	Fine-tuning then operates entirely on these pre-encoded `.meta` files, which is significantly faster than encoding on the fly.


		## Generation / Inference

		Once training is complete, you can use the model to generate images or videos from text prompts. This step is called inference — as opposed to training, where the model learns from data, inference is where it produces new outputs.


		Diffusion models are a class of generative models that learn to produce images or videos by iteratively denoising samples from a noise distribution. NeMo AutoModel supports training diffusion models using flow matching, a framework that regresses velocity fields along straight interpolation paths between noise and data.

		NeMo AutoModel integrates with [Hugging Face Diffusers](https://huggingface.co/docs/diffusers) for model loading and generation, while providing its own distributed training infrastructure via the `TrainDiffusionRecipe`. This recipe handles FSDP2 parallelization, flow matching loss computation, multiresolution bucketed dataloading, and checkpoint management.

Conversation

pthombre commented Mar 27, 2026

Uh oh!

copy-pr-bot bot commented Mar 27, 2026

Uh oh!

pthombre commented Mar 27, 2026

Uh oh!

JiwaniZakir left a comment

Choose a reason for hiding this comment

Uh oh!

jgerh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants