Add RAE Diffusion Transformer inference/preliminary training pipelines by plugyawn · Pull Request #13231 · huggingface/diffusers

plugyawn · 2026-03-09T04:49:54Z

What does this PR do?

This PR adds support for Diffusion Transformers with Representation Autoencoders. As the authors say, "Representation Autoencoders (RAEs) reuse pretrained, frozen representation encoders together with lightweight trained decoders to provide high-fidelity, semantically rich latents for diffusion transformers."

This addresses #13225, and solves (a) Inference and adds a training example for (b), tested on an NVIDIA A100 SXM4 40GB GPU

Reference implementation: byteriper's repository

Implemented: RAEDiT2DModel core model and RAEDiTPipeline, checkpoint conversion script for published upstream checkpoints, documentation, and a small training example. Documentations, etc, also written.

The inference works fine on visual inspection, and parity with the official inference is high; max_abs_error=0.00001717, mean_abs_error=0.00000122, for sampling with the same class/noise schedule/init latent noise/sampling.

L2R: byteriper's RAE DiT implementation, converted checkpoint in diffusers, using the same published Stage-2 checkpoint, the same class label, the same initial latent noise, and the same 25-step shifted Euler sampling schedule.

Inference is actually faster on diffusers: on a 40GB A100, the timings are:

Precision	CFG	Steps	Diffusers sec/img	Upstream sec/img	Diffusers img/s	Delta
bf16	1.0	25	0.817	0.913	1.225	+11.8%
bf16	4.0	25	0.852	0.931	1.174	+9.3%
bf16	1.0	50	1.568	1.761	0.638	+12.3%
bf16	4.0	50	1.649	1.853	0.606	+12.4%

Note: there is also a change to no_init_weights( ). Specifically: it makes Diffusers’ skip-weight-init behave more like normal PyTorch. Now, when no_init_weights() is active, the torch.nn.init.* functions stop returning the tensor they were called on (for ref: PyTorch does return). Most models never notice this, but the RAE-DiT implementation does rely on the return value during construction, which can make otherwise valid checkpoints fail to load through the standard from_pretrained() path.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

plugyawn · 2026-03-09T05:52:59Z

@kashif @sayakpaul would be great if you could review. Please note the no_init_weights() fix (details in the PR body); if you prefer, that could be a separate PR, but considering diffusers is supposed to be an extension to torch, I guess it makes sense?

sayakpaul · 2026-03-09T11:10:08Z

Thanks for the PR. To keep the scope manageable, could we break it down into separate PRs?

For example,

there is also a change to no_init_weights( ). Specifically: it makes Diffusers’ skip-weight-init behave more like normal PyTorch. Now, when no_init_weights() is active, the torch.nn.init.* functions stop returning the tensor they were called on (for ref: PyTorch does return). Most models never notice this, but the RAE-DiT implementation does rely on the return value during construction, which can make otherwise valid checkpoints fail to load through the standard from_pretrained() path.

could be a separate PR.

sayakpaul

Thanks!

I left some initial comments, let me know if they make sense.

sayakpaul · 2026-03-09T11:11:57Z

examples/research_projects/rae_dit/README.md

+- `examples/dreambooth/train_dreambooth_flux.py`
+  for the flow-matching training loop structure, checkpoint resume flow, and `accelerate.save_state(...)` hooks.
+- `examples/flux-control/train_control_flux.py`
+  for the transformer-only save layout and SD3-style flow-matching timestep weighting helpers.


Doesn't belong here.

sayakpaul · 2026-03-09T11:13:09Z

src/diffusers/models/modeling_utils.py

+        # Preserve the `torch.nn.init.*` return contract so third-party model
+        # constructors that chain on the returned tensor still work under
+        # `no_init_weights()`.
+        return args[0] if len(args) > 0 else None


Can you provide an example?

sayakpaul · 2026-03-09T11:14:56Z

tests/models/transformers/test_models_rae_dit_transformer2d.py

+        super().test_effective_gradient_checkpointing(loss_tolerance=1e-4)
+
+    @unittest.skip(
+        "RAEDiT initializes the output head to zeros, so cosine-based layerwise casting checks are uninformative."


I don't think this is the case? We can always skip layerwise casting for certain layer or layer groups here:

diffusers/src/diffusers/models/modeling_utils.py

Line 246 in a08c274

_skip_layerwise_casting_patterns = None

sayakpaul · 2026-03-09T11:15:35Z

tests/models/transformers/test_models_rae_dit_transformer2d.py

+    model.final_layer.linear.bias.data.normal_(mean=0.0, std=0.02)
+
+
+class RAEDiT2DModelTests(ModelTesterMixin, unittest.TestCase):


Test should use the newly added model tester mixins. You can find an example in #13046

sayakpaul · 2026-03-09T11:19:53Z

src/diffusers/models/transformers/transformer_rae_dit.py

+    if shift is None:
+        shift = torch.zeros_like(scale)


This is a small function, which is okay being present in the caller sites inline?

We also probably don't need _repeat_to_length().

sayakpaul · 2026-03-09T11:28:52Z

src/diffusers/models/transformers/transformer_rae_dit.py

+        if self.use_pos_embed:
+            pos_embed = get_2d_sincos_pos_embed(
+                self.pos_embed.shape[-1], int(sqrt(self.pos_embed.shape[1])), output_type="pt"
+            )
+            self.pos_embed.data.copy_(pos_embed.float().unsqueeze(0))


Can we use how #13046 initialized the position embeddings?

Yeah, that makes sense, will do that.

sayakpaul · 2026-03-09T11:29:21Z

src/diffusers/models/transformers/transformer_rae_dit.py

+        )
+        return hidden_states
+
+    def _run_block(


We don't need this. Let's instead follow this pattern:

diffusers/src/diffusers/models/transformers/transformer_flux.py

Line 714 in a08c274

for index_block, block in enumerate(self.transformer_blocks):

sayakpaul · 2026-03-09T11:30:10Z

src/diffusers/pipelines/rae_dit/pipeline_rae_dit.py

+
+        return class_labels
+
+    def _prepare_latents(


It should be called prepare_latents() similar to other pipelines.

sayakpaul · 2026-03-09T11:31:10Z

src/diffusers/pipelines/rae_dit/pipeline_rae_dit.py

+            if output_type == "pt":
+                output = images
+            else:
+                output = images.cpu().permute(0, 2, 3, 1).float().numpy()
+                if output_type == "pil":
+                    output = self.numpy_to_pil(output)


We should use an image processor instead here. See:

diffusers/src/diffusers/pipelines/flux/pipeline_flux.py

Line 1012 in a08c274

image = self.image_processor.postprocess(image, output_type=output_type)

sayakpaul · 2026-03-09T11:31:30Z

src/diffusers/pipelines/rae_dit/pipeline_rae_dit.py

+        if not return_dict:
+            return (output,)
+
+        return ImagePipelineOutput(images=output)


Let's give this pipeline a separate output class: RAEDiTPipelineOutput.

plugyawn added 8 commits March 9, 2026 09:08

Preserve torch init return contract under no_init_weights

828d9da

Add Stage-2 RAE DiT model, pipeline, and tooling

df855b7

Fix RAE DiT review regressions

d88d1a6

Add RAE DiT resume-order verifier

38826eb

Add RAE DiT training smoke test

5847b07

Sync RAE DiT stack with diffusers quality checks

5ec84b0

Add RAE DiT API docs

33c57ce

Rename RAEDiTTransformer2DModel to RAEDiT2DModel

8b74498

plugyawn changed the title ~~Add Stage-2 RAE DiT support with pipeline, conversion, and training tooling~~ RAE DiT inference, checkpoint conversion, and preliminary training tooling Mar 9, 2026

plugyawn changed the title ~~RAE DiT inference, checkpoint conversion, and preliminary training tooling~~ Add RAE Diffusion Transformer inference/preliminary training pipelines Mar 9, 2026

plugyawn added 2 commits March 9, 2026 10:51

Fix RAE DiT review regressions

dc437f9

Remove RAE DiT validation helper scripts from PR

fe21820

plugyawn marked this pull request as draft March 9, 2026 05:46

plugyawn marked this pull request as ready for review March 9, 2026 05:51

Add RAE DiT training validation sampling

92455c1

sayakpaul reviewed Mar 9, 2026

View reviewed changes

sayakpaul requested review from dg845 and kashif March 9, 2026 11:33

Align RAE DiT with diffusers patterns

794d350

		model.final_layer.linear.bias.data.normal_(mean=0.0, std=0.02)


		class RAEDiT2DModelTests(ModelTesterMixin, unittest.TestCase):

Conversation

plugyawn commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

plugyawn commented Mar 9, 2026

Uh oh!

sayakpaul commented Mar 9, 2026

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

plugyawn commented Mar 9, 2026 •

edited

Loading