Skip to content
Open
4 changes: 4 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -388,6 +388,8 @@
title: PriorTransformer
- local: api/models/qwenimage_transformer2d
title: QwenImageTransformer2DModel
- local: api/models/rae_dit_transformer2d
title: RAEDiT2DModel
- local: api/models/sana_transformer2d
title: SanaTransformer2DModel
- local: api/models/sana_video_transformer3d
Expand Down Expand Up @@ -604,6 +606,8 @@
title: PRX
- local: api/pipelines/qwenimage
title: QwenImage
- local: api/pipelines/rae_dit
title: RAE DiT
- local: api/pipelines/sana
title: Sana
- local: api/pipelines/sana_sprint
Expand Down
32 changes: 32 additions & 0 deletions docs/source/en/api/models/rae_dit_transformer2d.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
<!-- Copyright 2026 The NYU Vision-X and HuggingFace Teams. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# RAEDiT2DModel

The `RAEDiT2DModel` is the Stage-2 latent diffusion transformer introduced in
[Diffusion Transformers with Representation Autoencoders](https://huggingface.co/papers/2510.11690).

Unlike DiT models that operate on VAE latents, this transformer denoises the latent space learned by
[`AutoencoderRAE`](./autoencoder_rae). It is designed to be used with [`FlowMatchEulerDiscreteScheduler`] and
decoded back to RGB with [`AutoencoderRAE`].

## Loading a pretrained transformer

```python
from diffusers import RAEDiT2DModel

transformer = RAEDiT2DModel.from_pretrained("path/to/converted-stage2-transformer")
```

## RAEDiT2DModel

[[autodoc]] RAEDiT2DModel
59 changes: 59 additions & 0 deletions docs/source/en/api/pipelines/rae_dit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
<!-- Copyright 2026 The NYU Vision-X and HuggingFace Teams. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# RAE DiT

[Diffusion Transformers with Representation Autoencoders](https://huggingface.co/papers/2510.11690) introduces a
two-stage recipe: first train a representation autoencoder (RAE), then train a diffusion transformer on the resulting
latent space.

[`RAEDiTPipeline`] implements the Stage-2 class-conditional generator in Diffusers. It combines:

- [`RAEDiT2DModel`] for latent denoising
- [`FlowMatchEulerDiscreteScheduler`] for the denoising trajectory
- [`AutoencoderRAE`] for decoding latent samples to RGB images

> [!TIP]
> [`RAEDiTPipeline`] expects a Stage-2 checkpoint converted to Diffusers format together with a compatible
> [`AutoencoderRAE`] checkpoint.

## Loading a converted pipeline

```python
import torch
from diffusers import RAEDiTPipeline

pipe = RAEDiTPipeline.from_pretrained(
"path/to/converted-rae-dit-imagenet256",
torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(class_labels=[207], num_inference_steps=25).images[0]
image.save("golden_retriever.png")
```

If the converted pipeline includes an `id2label` mapping, you can also look up class ids by name:

```python
class_id = pipe.get_label_ids("golden retriever")[0]
image = pipe(class_labels=[class_id], num_inference_steps=25).images[0]
```

## RAEDiTPipeline

[[autodoc]] RAEDiTPipeline
- all
- __call__

## RAEDiTPipelineOutput

[[autodoc]] RAEDiTPipelineOutput
91 changes: 91 additions & 0 deletions examples/research_projects/rae_dit/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Training RAEDiT Stage 2

This folder contains the minimal Stage-2 follow-up for the RAE integration: training `RAEDiT2DModel` on top of a frozen `AutoencoderRAE`.

It is intentionally placed under `examples/research_projects/rae_dit/` rather than the top-level `examples/` trainers because this is still an experimental follow-up to the new RAE support.

## Current scope

This is a minimal full-finetuning scaffold, not a paper-complete training stack. It currently does the following:

- loads a frozen pretrained `AutoencoderRAE`
- encodes RGB images to normalized Stage-1 latents on the fly
- trains only the Stage-2 `RAEDiT2DModel`
- uses `FlowMatchEulerDiscreteScheduler` with the same shifted-sigma schedule shape already used elsewhere in `diffusers`
- consumes ImageFolder class ids as `class_labels`
- can generate validation samples through `RAEDiTPipeline` during training
- saves the trained transformer under `output_dir/transformer`
- saves the scheduler config under `output_dir/scheduler`
- writes `id2label.json` from the ImageFolder class mapping

It intentionally does not yet include:

- a latent-caching path
- autoguidance or the broader upstream transport stack
- exact upstream distributed training/runtime features

## Dataset format

The script expects an `ImageFolder`-compatible dataset:

```text
train_data_dir/
n01440764/
img_0001.jpeg
n01443537/
img_0002.jpeg
```

The folder names define the class labels used during Stage-2 training.

## Quickstart

```bash
accelerate launch examples/research_projects/rae_dit/train_rae_dit.py \
--pretrained_rae_model_name_or_path nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08 \
--train_data_dir /path/to/imagenet_like_folder \
--output_dir /tmp/rae-dit \
--resolution 256 \
--train_batch_size 8 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing \
--learning_rate 1e-4 \
--lr_scheduler cosine \
--lr_warmup_steps 1000 \
--max_train_steps 200000 \
--mixed_precision bf16 \
--report_to wandb \
--allow_tf32
```

To emit validation samples during training, add:

```bash
--validation_steps 1000 \
--validation_class_label 207 \
--num_validation_images 4 \
--validation_num_inference_steps 25 \
--validation_guidance_scale 1.0
```

Validation images are written to `output_dir/validation/step-<global_step>/`.

If you already have a converted or partially trained Stage-2 checkpoint, resume from it with:

```bash
accelerate launch examples/research_projects/rae_dit/train_rae_dit.py \
--pretrained_rae_model_name_or_path nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08 \
--pretrained_transformer_model_name_or_path /path/to/previous/transformer \
--train_data_dir /path/to/imagenet_like_folder \
--output_dir /tmp/rae-dit-finetune \
--resolution 256 \
--train_batch_size 8 \
--max_train_steps 50000
```

## Notes

- The script derives a default flow shift from the latent dimensionality as `sqrt(latent_dim / time_shift_base)`, matching the upstream Stage-2 heuristic at a high level.
- The trainer assumes the selected `AutoencoderRAE` uses `reshape_to_2d=True`, because `RAEDiT2DModel` operates on 2D latent feature maps.
- Validation sampling uses a fresh scheduler cloned from the training config so sampling does not mutate the in-flight training scheduler state.
- This example is meant to land first as a training scaffold that matches the new Stage-2 model and export layout. A later follow-up can add cached latents and other training conveniences.
Loading