Skip to content

fix: compute image_seq_len from spatial dims in Lumina2 pipeline#13272

Open
gambletan wants to merge 1 commit intohuggingface:mainfrom
gambletan:fix/lumina2-image-seq-len-wrong-dimension
Open

fix: compute image_seq_len from spatial dims in Lumina2 pipeline#13272
gambletan wants to merge 1 commit intohuggingface:mainfrom
gambletan:fix/lumina2-image-seq-len-wrong-dimension

Conversation

@gambletan
Copy link

Summary

Fixes #12913

  • Bug: image_seq_len = latents.shape[1] takes the channel dimension (e.g. 16) instead of the spatial sequence length. Lumina2 latents have shape (batch, channels, height, width) and are NOT packed before this point.
  • Impact: The wrong image_seq_len feeds into calculate_shift() which computes mu for the flow-matching scheduler. Using channel count (~16) instead of token count (e.g. 4096 for 1024x1024 images) produces a completely wrong shift value, degrading generation quality.
  • Fix: Compute image_seq_len as (latents.shape[-2] // patch_size) * (latents.shape[-1] // patch_size), reading patch_size from self.transformer.config.patch_size. This matches how the Lumina2 transformer internally patchifies its input.

Why Flux uses latents.shape[1] but Lumina2 cannot

The Flux pipeline correctly uses latents.shape[1] because Flux latents are pre-packed into (batch, seq_len, channels) before image_seq_len is computed. Lumina2 does not pre-pack its latents — the transformer handles patchification internally — so shape[1] gives channels, not sequence length.

Changes

  • src/diffusers/pipelines/lumina2/pipeline_lumina2.py: Replace latents.shape[1] with spatial sequence length computation using patch_size from transformer config
  • tests/pipelines/lumina2/test_pipeline_lumina2.py: Add test verifying mu is computed from spatial dimensions (not channel dim), using dimensions where channel count != spatial seq_len to catch regressions

…na2 pipeline

Fixes huggingface#12913

`image_seq_len` was computed as `latents.shape[1]`, which gives the
channel dimension (e.g. 16) since Lumina2 latents have shape
`(batch, channels, height, width)` and are NOT packed/reshaped before
this point. The Lumina2 transformer internally patchifies the latents
with `patch_size=2`, so the correct spatial sequence length is
`(H // patch_size) * (W // patch_size)`.

This incorrect value was passed to `calculate_shift()`, which computes
the `mu` parameter for the flow-matching scheduler. Using channel count
instead of token count produces a completely wrong shift, degrading
generation quality.

The fix reads `patch_size` from `self.transformer.config.patch_size` and
computes `image_seq_len` from the last two (spatial) dimensions of the
latents tensor, matching how the transformer itself computes its input
sequence length.

For reference, the Flux pipeline correctly uses `latents.shape[1]`
because Flux latents are pre-packed into `(batch, seq_len, channels)`
before this computation. Lumina2 does not pre-pack, so the same indexing
does not apply.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Is Lumina2Pipeline's mu calculation correct?

1 participant