fix: compute image_seq_len from spatial dims in Lumina2 pipeline by gambletan · Pull Request #13272 · huggingface/diffusers

gambletan · 2026-03-16T14:10:56Z

Summary

Bug: image_seq_len = latents.shape[1] takes the channel dimension (e.g. 16) instead of the spatial sequence length. Lumina2 latents have shape (batch, channels, height, width) and are NOT packed before this point.
Impact: The wrong image_seq_len feeds into calculate_shift() which computes mu for the flow-matching scheduler. Using channel count (~16) instead of token count (e.g. 4096 for 1024x1024 images) produces a completely wrong shift value, degrading generation quality.
Fix: Compute image_seq_len as (latents.shape[-2] // patch_size) * (latents.shape[-1] // patch_size), reading patch_size from self.transformer.config.patch_size. This matches how the Lumina2 transformer internally patchifies its input.

Why Flux uses `latents.shape[1]` but Lumina2 cannot

The Flux pipeline correctly uses latents.shape[1] because Flux latents are pre-packed into (batch, seq_len, channels) before image_seq_len is computed. Lumina2 does not pre-pack its latents — the transformer handles patchification internally — so shape[1] gives channels, not sequence length.

Changes

src/diffusers/pipelines/lumina2/pipeline_lumina2.py: Replace latents.shape[1] with spatial sequence length computation using patch_size from transformer config
tests/pipelines/lumina2/test_pipeline_lumina2.py: Add test verifying mu is computed from spatial dimensions (not channel dim), using dimensions where channel count != spatial seq_len to catch regressions

…na2 pipeline Fixes huggingface#12913 `image_seq_len` was computed as `latents.shape[1]`, which gives the channel dimension (e.g. 16) since Lumina2 latents have shape `(batch, channels, height, width)` and are NOT packed/reshaped before this point. The Lumina2 transformer internally patchifies the latents with `patch_size=2`, so the correct spatial sequence length is `(H // patch_size) * (W // patch_size)`. This incorrect value was passed to `calculate_shift()`, which computes the `mu` parameter for the flow-matching scheduler. Using channel count instead of token count produces a completely wrong shift, degrading generation quality. The fix reads `patch_size` from `self.transformer.config.patch_size` and computes `image_seq_len` from the last two (spatial) dimensions of the latents tensor, matching how the transformer itself computes its input sequence length. For reference, the Flux pipeline correctly uses `latents.shape[1]` because Flux latents are pre-packed into `(batch, seq_len, channels)` before this computation. Lumina2 does not pre-pack, so the same indexing does not apply. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: compute image_seq_len from spatial dims in Lumina2 pipeline#13272

fix: compute image_seq_len from spatial dims in Lumina2 pipeline#13272
gambletan wants to merge 1 commit intohuggingface:mainfrom
gambletan:fix/lumina2-image-seq-len-wrong-dimension

gambletan commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gambletan commented Mar 16, 2026

Summary

Why Flux uses latents.shape[1] but Lumina2 cannot

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why Flux uses `latents.shape[1]` but Lumina2 cannot