Skip to content

[UniLLaDA] Add UniLLaDA multimodal discrete diffusion pipeline#13686

Open
ChinChyi wants to merge 1 commit intohuggingface:mainfrom
ChinChyi:add-unillada-pipeline
Open

[UniLLaDA] Add UniLLaDA multimodal discrete diffusion pipeline#13686
ChinChyi wants to merge 1 commit intohuggingface:mainfrom
ChinChyi:add-unillada-pipeline

Conversation

@ChinChyi
Copy link
Copy Markdown

@ChinChyi ChinChyi commented May 6, 2026

What does this PR do?

Adds support for LLaDA 2.0-Uni, a unified multimodal discrete diffusion language model that supports text understanding, image understanding, and image generation in a single framework.

Paper: LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

New Components

  • LLaDA2UniImageTransformer2DModel — Image diffusion transformer for decoding VQ tokens to images
  • UniLLaDaPipeline — Unified pipeline supporting three modes:
    • Text-to-image generation
    • Image understanding (VQA, captioning)
    • Image editing
  • LLaDA2UniFlowMatchEulerScheduler — Flow matching scheduler with Euler ODE integration
  • Image tokenizer utilities — SigVQ-based image encoding/decoding

Key Features

  • Multimodal capabilities: Single model handles both vision and language tasks
  • Discrete diffusion: Block-wise iterative refinement for token generation
  • FP8 quantization support: Efficient inference with quantized weights
  • Flexible decoding: Supports both quality mode (50 steps) and turbo mode (8 steps)

Usage Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from diffusers import UniLLaDaPipeline, BlockRefinementScheduler
from diffusers.pipelines.unillada.image_tokenizer import ImageTokenizer

model_id = "inclusionAI/LLaDA2.0-Uni"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
scheduler = BlockRefinementScheduler()
image_tokenizer = ImageTokenizer(model_path=model_id)

pipe = UniLLaDaPipeline(
    transformer=model,
    tokenizer=tokenizer,
    scheduler=scheduler,
    image_tokenizer=image_tokenizer,
)

# Text-to-Image
result = pipe(prompt="A cat sitting on a windowsill at sunset")
result.images[0].save("output.png")

# Image Understanding
from PIL import Image
img = Image.open("photo.jpg")
result = pipe(image=img, question="Describe this image in detail.")
print(result.text)

# Image Editing
result = pipe(image=img, instruction="Change the background to a beach.")
result.images[0].save("edited.png")

Testing

  • Added unit tests in tests/pipelines/unillada/test_unillada.py
  • Tests cover all three modes (generation, understanding, editing)
  • Mock components for CI compatibility

Model Weights

Official weights available at: https://huggingface.co/inclusionAI/LLaDA2.0-Uni

Before submitting

  • Did you read the contributor guideline?
  • Did you read our philosophy doc?
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@yiyixuxu @a-r-r-o-w @DN6

Add UniLLaDA pipeline supporting text-to-image, image understanding,
and image editing via block-wise iterative discrete diffusion.

New components:
- UniLLaDaPipeline: main pipeline (DiffusionPipeline subclass)
- LLaDA2UniImageTransformer2DModel: image transformer model
- LLaDA2UniFlowMatchEulerScheduler: flow matching scheduler
- ImageTokenizer: VQ image encoder helper
- Documentation and tests
@github-actions github-actions Bot added size/L PR with diff > 200 LOC documentation Improvements or additions to documentation models tests utils pipelines schedulers and removed size/L PR with diff > 200 LOC labels May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant