diff --git a/docs/en/Pipeline_Usage/Accelerated_Inference.md b/docs/en/Pipeline_Usage/Accelerated_Inference.md new file mode 100644 index 00000000..ac04c750 --- /dev/null +++ b/docs/en/Pipeline_Usage/Accelerated_Inference.md @@ -0,0 +1,94 @@ +# Inference Acceleration + +The denoising process of diffusion models is typically time-consuming. To improve inference speed, various acceleration techniques can be applied, including lossless acceleration solutions such as multi-GPU parallel inference and computation graph compilation, as well as lossy acceleration solutions like Cache and quantization. + +Currently, most diffusion models are built on Diffusion Transformers (DiT), and efficient attention mechanisms are also a common acceleration method. DiffSynth-Studio currently supports certain lossless acceleration inference features. This section focuses on introducing acceleration methods from two dimensions: multi-GPU parallel inference and computation graph compilation. + +## Efficient Attention Mechanisms + +For details on the acceleration of attention mechanisms, please refer to [Attention Mechanism Implementation](../API_Reference/core/attention.md). + +## Multi-GPU Parallel Inference + +DiffSynth-Studio adopts a multi-GPU inference solution using Unified Sequence Parallel (USP). It splits the token sequence in the DiT across multiple GPUs for parallel processing. The underlying implementation is based on [xDiT](https://github.com/xdit-project/xDiT). Please note that unified sequence parallelism introduces additional communication overhead, so the actual speedup ratio is usually lower than the number of GPUs. + +Currently, DiffSynth-Studio supports unified sequence parallel acceleration for the [Wan](../Model_Details/Wan.md) and [MOVA](../Model_Details/Wan.md) models. + +First, install the `xDiT` dependency. + +```bash +pip install "xfuser[flash-attn]>=0.4.3" +``` + +Then, use `torchrun` to launch multi-GPU inference. + +```bash +torchrun --standalone --nproc_per_node=8 examples/wanvideo/acceleration/unified_sequence_parallel.py +``` + +When building the pipeline, simply configure `use_usp=True` to enable USP parallel inference. A code example is shown below. + +```python +import torch +from PIL import Image +from diffsynth.utils.data import save_video +from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig +import torch.distributed as dist + +pipe = WanVideoPipeline.from_pretrained( + torch_dtype=torch.bfloat16, + device="cuda", + use_usp=True, + model_configs=[ + ModelConfig(model_id="Wan-AI/Wan2.1-T2V-14B", origin_file_pattern="diffusion_pytorch_model*.safetensors"), + ModelConfig(model_id="Wan-AI/Wan2.1-T2V-14B", origin_file_pattern="models_t5_umt5-xxl-enc-bf16.pth"), + ModelConfig(model_id="Wan-AI/Wan2.1-T2V-14B", origin_file_pattern="Wan2.1_VAE.pth"), + ], + tokenizer_config=ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="google/umt5-xxl/"), +) + +# Text-to-video +video = pipe( + prompt="一名宇航员身穿太空服,面朝镜头骑着一匹机械马在火星表面驰骋。红色的荒凉地表延伸至远方,点缀着巨大的陨石坑和奇特的岩石结构。机械马的步伐稳健,扬起微弱的尘埃,展现出未来科技与原始探索的完美结合。宇航员手持操控装置,目光坚定,仿佛正在开辟人类的新疆域。背景是深邃的宇宙和蔚蓝的地球,画面既科幻又充满希望,让人不禁畅想未来的星际生活。", + negative_prompt="色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走", + seed=0, tiled=True, +) +if dist.get_rank() == 0: + save_video(video, "video1.mp4", fps=15, quality=5) +``` + +## Computation Graph Compilation + +PyTorch 2.0 provides an automatic computation graph compilation interface, [torch.compile](https://docs.pytorch.org/tutorials/intermediate/torch_compile_tutorial.html), which can just-in-time (JIT) compile PyTorch code into optimized kernels, thereby improving execution speed. Since the inference time of diffusion models is concentrated in the multi-step denoising phase of the DiT, and the DiT is primarily stacked with basic blocks, DiffSynth's compile feature uses a [regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) strategy targeting only the basic Transformer blocks to reduce compilation time. + +### Compile Usage Example + +Compared to standard inference, you simply need to execute `pipe.compile_pipeline()` before calling the pipeline to enable compilation acceleration. For the specific function definition, please refer to the [source code](https://github.com/modelscope/DiffSynth-Studio/blob/166e6d2d38764209f66a74dd0fe468226536ad0f/diffsynth/diffusion/base_pipeline.py#L342). + +The input parameters for `compile_pipeline` consist mainly of two types. + +The first type is the compiled model parameters, `compile_models`. Taking the Qwen-Image Pipeline as an example, if you only want to compile the DiT model, you can keep this parameter empty. If you need to additionally compile models like the VAE, you can pass `compile_models=["vae", "dit"]`. Aside from DiT, all other models use a full-graph compilation strategy, meaning the model's forward function is completely compiled into a computation graph. + +The second type is the compilation strategy parameters. This covers `mode`, `dynamic`, `fullgraph`, and other custom options. These parameters are directly passed to the `torch.compile` interface. If you are not deeply familiar with the specific mechanics of these parameters, it is recommended to keep the default settings. + + * `mode` specifies the compilation mode, including `"default"`, `"reduce-overhead"`, `"max-autotune"`, and `"max-autotune-no-cudagraphs"`. Because cudagraph has stricter requirements on computation graphs (for example, it might need to be used in conjunction with `torch.compiler.cudagraph_mark_step_begin()`), the `"reduce-overhead"` and `"max-autotune"` modes might fail to compile. + * `dynamic` determines whether to enable dynamic shapes. For most generative models, modifying the prompt, enabling CFG, or adjusting the resolution will change the shape of the input tensors to the computation graph. Setting `dynamic=True` will increase the compilation time of the first run, but it supports dynamic shapes, meaning no recompilation is needed when shapes change. When set to `dynamic=False`, the first compilation is faster, but any operation that alters the input shape will trigger a recompilation. For most scenarios, setting it to `dynamic=True` is recommended. + * `fullgraph`, when set to `True`, makes the underlying system attempt to compile the target model into a single computation graph, throwing an error if it fails. When set to `False`, the underlying system will set breakpoints where connections cannot be made, compiling the model into multiple independent computation graphs. Developers can set it to `True` to optimize compilation performance, but regular users are advised to only use `False`. + * For other parameter configurations, please consult the [API documentation](https://docs.pytorch.org/docs/stable/generated/torch.compile.html). + +### Compile Feature Developer Documentation + +If you need to provide compile support for a newly integrated pipeline, you should configure the `compilable_models` attribute in the pipeline to specify the default models to compile. For the DiT model class of that pipeline, you also need to configure `_repeated_blocks` to specify the types of basic blocks that will participate in regional compilation. + +Taking Qwen-Image as an example, its pipeline configuration is as follows: + +```python +self.compilable_models = ["dit"] +``` + +Its DiT configuration is as follows: + +```python +class QwenImageDiT(torch.nn.Module): + _repeated_blocks = ["QwenImageTransformerBlock"] +``` diff --git a/docs/en/index.rst b/docs/en/index.rst index 4b933cac..22dfc0ad 100644 --- a/docs/en/index.rst +++ b/docs/en/index.rst @@ -13,6 +13,7 @@ Welcome to DiffSynth-Studio's Documentation Pipeline_Usage/Setup Pipeline_Usage/Model_Inference + Pipeline_Usage/Accelerated_Inference Pipeline_Usage/VRAM_management Pipeline_Usage/Model_Training Pipeline_Usage/Environment_Variables diff --git a/docs/zh/Pipeline_Usage/Accelerated_Inference.md b/docs/zh/Pipeline_Usage/Accelerated_Inference.md new file mode 100644 index 00000000..f6d4f19e --- /dev/null +++ b/docs/zh/Pipeline_Usage/Accelerated_Inference.md @@ -0,0 +1,84 @@ +# 推理加速 + +扩散模型的去噪过程通常耗时较长。为提升推理速度,可采用多种加速技术,包含多卡并行推理、计算图编译等无损加速方案,以及 Cache、量化等有损加速方案。 + +当前扩散模型大多基于 Diffusion Transformer 构建,高效注意力机制同样是常用的加速手段。DiffSynth-Studio 目前已支持部分无损加速推理功能。本节重点从多卡并行推理和计算图编译两个维度介绍加速方法。 + +## 高效注意力机制 +注意力机制的加速细节请参考 [注意力机制实现](../API_Reference/core/attention.md)。 + +## 多卡并行推理 +DiffSynth-Studio 采用统一序列并行的多卡推理方案。在 DiT 中将 token 序列拆分至多张显卡进行并行处理。底层基于 [xDiT](https://github.com/xdit-project/xDiT) 实现。需要注意,统一序列并行会引入额外通信开销,实际加速比通常低于显卡数量。 + +目前 DiffSynth-Studio 已支持 [Wan](../Model_Details/Wan.md) 和 [MOVA](../Model_Details/Wan.md) 模型的统一序列并行加速。 + +首先安装 `xDiT` 依赖。 +```bash +pip install "xfuser[flash-attn]>=0.4.3" +``` + +然后使用 `torchrun` 启动多卡推理。 +```bash +torchrun --standalone --nproc_per_node=8 examples/wanvideo/acceleration/unified_sequence_parallel.py +``` + +构建 pipeline 时配置 `usp=True` 即可实现 USP 并行推理。代码示例如下。 +```python +import torch +from PIL import Image +from diffsynth.utils.data import save_video +from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig +import torch.distributed as dist + +pipe = WanVideoPipeline.from_pretrained( + torch_dtype=torch.bfloat16, + device="cuda", + use_usp=True, + model_configs=[ + ModelConfig(model_id="Wan-AI/Wan2.1-T2V-14B", origin_file_pattern="diffusion_pytorch_model*.safetensors"), + ModelConfig(model_id="Wan-AI/Wan2.1-T2V-14B", origin_file_pattern="models_t5_umt5-xxl-enc-bf16.pth"), + ModelConfig(model_id="Wan-AI/Wan2.1-T2V-14B", origin_file_pattern="Wan2.1_VAE.pth"), + ], + tokenizer_config=ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="google/umt5-xxl/"), +) + +# Text-to-video +video = pipe( + prompt="一名宇航员身穿太空服,面朝镜头骑着一匹机械马在火星表面驰骋。红色的荒凉地表延伸至远方,点缀着巨大的陨石坑和奇特的岩石结构。机械马的步伐稳健,扬起微弱的尘埃,展现出未来科技与原始探索的完美结合。宇航员手持操控装置,目光坚定,仿佛正在开辟人类的新疆域。背景是深邃的宇宙和蔚蓝的地球,画面既科幻又充满希望,让人不禁畅想未来的星际生活。", + negative_prompt="色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走", + seed=0, tiled=True, +) +if dist.get_rank() == 0: + save_video(video, "video1.mp4", fps=15, quality=5) +``` + +## 计算图编译 +PyTorch 2.0 提供了自动计算图编译接口 [torch.compile](https://docs.pytorch.org/tutorials/intermediate/torch_compile_tutorial.html),能够将 PyTorch 代码即时编译为优化内核,从而提升运行速度。由于扩散模型的推理耗时集中在 DiT 的多步去噪阶段,且 DiT 主要由基础模块堆叠而成,为缩短编译时间,DiffSynth 的 compile 功能采用仅针对基础 Transformer 模块的 [区域编译](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) 策略。 + +### Compile 使用示例 +相比常规推理,只需在调用 pipeline 前执行 `pipe.compile_pipeline()` 即可开启编译加速。具体函数定义请参阅[源代码](https://github.com/modelscope/DiffSynth-Studio/blob/166e6d2d38764209f66a74dd0fe468226536ad0f/diffsynth/diffusion/base_pipeline.py#L342)。 + +`compile_pipeline` 的输入参数主要包含两类。 + +第一类是编译模型参数 `compile_models`。以 Qwen-Image Pipeline 为例,若仅编译 DiT 模型,保持该参数为空即可。若需额外编译 VAE 等模型,可传入 `compile_models=["vae", "dit"]`。除 DiT 外,其余模型均采用整体编译策略,即把模型的 forward 函数完整编译为计算图。 + +第二类是编译策略参数。涵盖 `mode`, `dynamic`, `fullgraph` 及其他自定义选项。这些参数会直接传递给 `torch.compile` 接口。若未深入了解这些参数的具体机制,建议保持默认设置。 + +- `mode` 指定编译模式,包含 `"default"`, `"reduce-overhead"`, `"max-autotune"` 和 `"max-autotune-no-cudagraphs"`。由于 cudagraph 对计算图要求较为严格(例如可能需要配合 `torch.compiler.cudagraph_mark_step_begin()` 使用),`"reduce-overhead"` 和 `"max-autotune"` 模式可能编译失败。 +- `dynamic` 决定是否启用动态形状。对于多数生成模型,修改 prompt、开启 CFG 或调整分辨率都会改变计算图的输入张量形状。设置为 `dynamic=True` 会增加首次运行的编译时长,但支持动态形状,形状改变时无需重编译。设置为 `dynamic=False` 时首次编译较快,但任何改变输入形状的操作都会触发重新编译。对大部分场景,建议设定为 `dynamic=True`。 +- `fullgraph` 设为 `True` 时,底层会尝试将目标模型编译为单一计算图,若失败则报错。设为 `False` 时,底层会在无法连接处设置断点,将模型编译为多个独立计算图。开发者可开启 `True` 来优化编译性能,普通用户建议仅使用 `False`。 +- 其他参数配置请查阅 [api 文档](https://docs.pytorch.org/docs/stable/generated/torch.compile.html)。 + +### Compile 功能开发者文档 +若需为新接入的 pipeline 提供 compile 支持,应在 pipeline 中配置 `compilable_models` 属性以指定默认编译模型。针对该 pipeline 的 DiT 模型类,还需配置 `_repeated_blocks` 以指定参与区域编译的基础模块类型。 + +以 Qwen-Image 为例,其 pipeline 配置如下。 +```python +self.compilable_models = ["dit"] +``` + +其 DiT 配置如下。 +```python +class QwenImageDiT(torch.nn.Module): + _repeated_blocks = ["QwenImageTransformerBlock"] +``` diff --git a/docs/zh/index.rst b/docs/zh/index.rst index 42256b3b..e41e0872 100644 --- a/docs/zh/index.rst +++ b/docs/zh/index.rst @@ -13,6 +13,7 @@ Pipeline_Usage/Setup Pipeline_Usage/Model_Inference + Pipeline_Usage/Accelerated_Inference Pipeline_Usage/VRAM_management Pipeline_Usage/Model_Training Pipeline_Usage/Environment_Variables