[Cherry-Pick][XPU] Enable CudaGraph capture for MTP draft model(#7864) by Clarity256 · Pull Request #7941 · PaddlePaddle/FastDeploy

Clarity256 · 2026-05-27T09:13:00Z

Motivation

在 PR #7864 中，XPU 平台 MTP 投机解码的 target model 已支持 CUDAGraph，但 draft model 侧仍未启用 CUDAGraph capture。本 PR 在 #7864 基础上，为 MTP draft model 补齐 CUDAGraph 支持，主要包括：

draft model 前向推理启用 step_use_cudagraph 门控逻辑，并在 multi-step 执行中仅对首步进行 capture。
draft model 推理路径中传递 forward_meta 和 use_cudagraph 到 xpu_pre_process，确保 cu_seqlens_q_output / batch_id_per_token_output 在 cudagraph 模式下使用 copy_ 原地更新，保证 tensor 地址稳定性。
新增 padding_cudagraph_inputs() 方法处理 draft model 的 buffer padding，并在 graph replay 时按 real_token_num 切片 model output。
target model 侧投机解码 warmup 流程适配（capture size 计算、accept_all_drafts 参数传递）。
将 padding_sampling_params（Python 侧 CPU 实现）替换为 build_sampling_params XPU 自定义算子，在算子内部完成 infer_seed 的原地更新，避免在 cudagraph 外额外操作。
increment_value 改为与投机解码 token 数联动（(num_speculative_tokens + 1) * 4）。

Modifications

custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc：新增 Paddle 自定义算子入口，注册 build_sampling_params op。
custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h：声明 build_sampling_params C 接口。
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu：新增 XPU3 kernel 实现。
custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp：新增 CPU wrapper 和 XPU3 wrapper。
custom_ops/xpu_ops/test/test_build_sampling_params.py：新增单元测试，覆盖纯 decoder、纯 encoder、混合、单条、seed wrap-around 等场景。
fastdeploy/model_executor/layers/sample/sampler.py：forward_xpu 改用 build_sampling_params XPU 算子替代 padding_sampling_params；新增 increment_value 参数。
fastdeploy/model_executor/xpu_pre_and_post_process.py：cudagraph 模式下改用 copy_ 原地更新 cu_seqlens_q_output 和 batch_id_per_token_output，保证 graph 捕获的 tensor 地址稳定。
fastdeploy/spec_decode/mtp_xpu.py：draft model 启用 step_use_cudagraph 门控；_propose 新增 cudagraph padding 逻辑与 output slicing；_initialize_forward_meta 传递 cudagraph 参数。
fastdeploy/worker/xpu_model_runner.py：increment_value 与投机解码 token 数联动；warmup capture 流程适配 speculative decoding；infer_seed 更新移入 build_sampling_params 算子内部；draft model propose 传递 step_use_cudagraph。
tests/xpu_ci/4cards_cases/run_mtp_cudagraph.py → test_mtp_cudagraph.py：重命名测试脚本以符合 CI 命名规范。

Usage or Command

Accuracy Tests

MTP with CUDAGraph：输出与参考结果一致（见 PR 截图）

- MTP without CUDAGraph：输出与参考结果一致（见 PR 截图）

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-27T09:13:12Z

Thanks for your contribution!

CLAassistant · 2026-05-27T09:13:18Z

All committers have signed the CLA.

Based on PR PaddlePaddle#7864, this adds draft model CudaGraph support: - Enable step_use_cudagraph for draft model with proper gating logic - Pass forward_meta and use_cudagraph to xpu_pre_process in draft path - Add padding_cudagraph_inputs() for draft model buffer management - Slice model output by real_token_num when graph is active - Include target model cudagraph changes (xpu_model_runner, xpu_pre_and_post_process) - Add build_sampling_params XPU kernel for MTP Verified stable with benchmark_serving (100 requests, concurrency=16, 0 failures). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PaddlePaddle-bot · 2026-05-27T10:14:28Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-28 18:33:55

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 00299e4
Merge base: d0a9661 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

当前暂无 required 失败任务，但 7 个 Workflow 处于 action_required，需要人工审批后才会执行；因此 CI 尚未完整跑完，建议先审批并等待关键 workflow 产生结果。当前已产生结果的 2 个 optional 任务中，1 个通过、1 个失败。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
2(0)	2	1	1	0	0	0

⚠️ 注意：以下 7 个 Workflow 处于 action_required 状态（等待审批后才会执行）：Approval、Codestyle-Check、Check PR Template、CI_HPU、PR Build and Test、CI_XPU、ILUVATAR-CI。这些 Workflow 需人工审批触发。

注意：action_required workflows 不计入上表的任务统计。

2 任务状态汇总

日志列说明：失败任务直接使用工具生成的日志链接；运行中任务使用 Job 页面链接。

2.1 Required任务 : 0/0 通过

必选任务阻塞合并，失败需优先处理。当前没有已触发并进入统计的 required job；多个关键 workflow 尚处于 action_required，请先完成审批。

状态	任务	耗时	根因	修复建议	日志	重跑
⏸️	`action_required workflows`	-	需要 Approval	请通过人工审批	CI 详情	-

2.2 可选任务 — 1/2 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Trigger Jenkins for PR`	17s	Job	-
✅	其余 1 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。本轮不对 optional 失败任务做深度分析；Trigger Jenkins for PR 属于 optional，仅在上方可选任务区展示。

codecov-commenter · 2026-05-27T12:39:51Z

Codecov Report

❌ Patch coverage is 0% with 38 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@d0a9661). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/worker/xpu_model_runner.py	0.00%	17 Missing ⚠️
fastdeploy/spec_decode/mtp_xpu.py	0.00%	14 Missing ⚠️
...tdeploy/model_executor/xpu_pre_and_post_process.py	0.00%	5 Missing ⚠️
fastdeploy/model_executor/layers/sample/sampler.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7941   +/-   ##
==========================================
  Coverage           ?   64.06%           
==========================================
  Files              ?      467           
  Lines              ?    65065           
  Branches           ?     9977           
==========================================
  Hits               ?    41687           
  Misses             ?    20556           
  Partials           ?     2822

Flag	Coverage Δ
GPU	`73.18% <0.00%> (?)`
XPU	`7.06% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-28 14:46:48

📋 Review 摘要

PR 概述：为 XPU 平台 MTP 投机解码的 draft model 补齐 CUDAGraph 支持，新增 build_sampling_params XPU 算子替代 Python 侧实现
变更范围：custom_ops/xpu_ops（新增算子）、fastdeploy/spec_decode/mtp_xpu.py、fastdeploy/worker/xpu_model_runner.py、sampler.py、xpu_pre_and_post_process.py
影响面 Tag：[XPU] [Speculative Decoding] [OP]

问题

级别	文件	概述
🟡 建议	`xpu_model_runner.py:868`	重构移除了 proposer 的 MoE phase 同步逻辑，MoE+EP+投机解码场景可能回归

历史 Findings 修复情况

Finding	问题	状态
F1	`substep` 参数未在方法签名中声明	⚠️ 仍存在
F2	`infer_seed` 在非投机解码路径不再更新	⚠️ 仍存在
F3	同 F1，`substep` 参数问题	⚠️ 仍存在
F4	`capture_size` 整除截断	⚠️ 仍存在
F5	kernel 中 pad_start O(bs²) 性能问题	⚠️ 仍存在
F6	`sampling_metadata.topp_seed` 语义不一致	⚠️ 仍存在
F7	mtp_xpu.py 遗留大量注释代码	⚠️ 仍存在
F8	同 F4，`capture_size` 整除截断	⚠️ 仍存在

📝 PR 规范检查

标题格式符合 Cherry-Pick 规范 [Cherry-Pick][XPU] ... ✅。但 ## Usage or Command 段为空（仅模板注释），Checklist 全部未勾选（实际已新增单测 test_build_sampling_params.py 且提供了精度对比截图）。建议补全如下：

标题建议（可直接复制）：

[Cherry-Pick][XPU][Speculative Decoding] Enable CUDAGraph capture for MTP draft model (#7864)

PR 描述建议（点击展开，可直接复制）

## Motivation
在 PR #7864 中，XPU 平台 MTP 投机解码的 target model 已支持 CUDAGraph，但 draft model 侧仍未启用 CUDAGraph capture。本 PR 在 #7864 基础上，为 MTP draft model 补齐 CUDAGraph 支持，主要包括：

1. draft model 前向推理启用 `step_use_cudagraph` 门控逻辑，并在 multi-step 执行中仅对首步进行 capture。
2. draft model 推理路径中传递 `forward_meta` 和 `use_cudagraph` 到 `xpu_pre_process`，确保 `cu_seqlens_q_output` / `batch_id_per_token_output` 在 cudagraph 模式下使用 `copy_` 原地更新，保证 tensor 地址稳定性。
3. 新增 `padding_cudagraph_inputs()` 方法处理 draft model 的 buffer padding，并在 graph replay 时按 `real_token_num` 切片 model output。
4. target model 侧投机解码 warmup 流程适配（capture size 计算、`accept_all_drafts` 参数传递）。
5. 将 `padding_sampling_params`（Python 侧 CPU 实现）替换为 `build_sampling_params` XPU 自定义算子，在算子内部完成 `infer_seed` 的原地更新，避免在 cudagraph 外额外操作。
6. `increment_value` 改为与投机解码 token 数联动（`(num_speculative_tokens + 1) * 4`）。

## Modifications
- **`custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc`**：新增 Paddle 自定义算子入口，注册 `build_sampling_params` op。
- **`custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h`**：声明 `build_sampling_params` C 接口。
- **`custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu`**：新增 XPU3 kernel 实现。
- **`custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp`**：新增 CPU wrapper 和 XPU3 wrapper。
- **`custom_ops/xpu_ops/test/test_build_sampling_params.py`**：新增单元测试，覆盖纯 decoder、纯 encoder、混合、单条、seed wrap-around 等场景。
- **`fastdeploy/model_executor/layers/sample/sampler.py`**：`forward_xpu` 改用 `build_sampling_params` XPU 算子替代 `padding_sampling_params`；新增 `increment_value` 参数。
- **`fastdeploy/model_executor/xpu_pre_and_post_process.py`**：cudagraph 模式下改用 `copy_` 原地更新 `cu_seqlens_q_output` 和 `batch_id_per_token_output`，保证 graph 捕获的 tensor 地址稳定。
- **`fastdeploy/spec_decode/mtp_xpu.py`**：draft model 启用 `step_use_cudagraph` 门控；`_propose` 新增 cudagraph padding 逻辑与 output slicing；`_initialize_forward_meta` 传递 cudagraph 参数。
- **`fastdeploy/worker/xpu_model_runner.py`**：`increment_value` 与投机解码 token 数联动；warmup capture 流程适配 speculative decoding；`infer_seed` 更新移入 `build_sampling_params` 算子内部；draft model propose 传递 `step_use_cudagraph`。
- **`tests/xpu_ci/4cards_cases/run_mtp_cudagraph.py → test_mtp_cudagraph.py`**：重命名测试脚本以符合 CI 命名规范。

## Usage or Command
N/A

## Accuracy Tests
- MTP with CUDAGraph：输出与参考结果一致（见 PR 截图）
- MTP without CUDAGraph：输出与参考结果一致（见 PR 截图）

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体实现思路清晰，CUDAGraph 地址稳定性处理（copy_ 原地更新、output slicing）和新增 XPU 算子替代 Python 侧实现的方案合理。建议关注历史 Findings 中多个未修复问题（尤其 F2 infer_seed 更新缺失和 F4 整除截断），以及本次新增的 proposer MoE phase 同步遗漏。

PaddlePaddle-bot · 2026-05-28T06:50:13Z

-        if (
-            self.fd_config.scheduler_config.splitwise_role == "mixed" and envs.FD_XPU_ENABLE_MIXED_EP_MODE
-        ):  # Centralized scenario: the phase is initialized as "prefill" by default. During inference runtime, different types of batches can achieve phase switching at this point.
+        if self.fd_config.parallel_config.use_ep and self.fd_config.scheduler_config.splitwise_role == "mixed":


🟡 建议 移除了 self.proposer.fd_config.parallel_config.moe_phase.phase 的运行时更新

原代码在 use_cudagraph 分支内对 proposer 的 MoE phase 进行了同步更新：

if self.speculative_decoding: self.proposer.fd_config.parallel_config.moe_phase.phase = "decode" if if_only_decode else "prefill"

本次重构后该逻辑被删除。若 draft model 使用 MoE + EP，其 phase 将始终保持初始值，可能导致 EP dispatch/combine 路径不正确。

建议确认 draft model 是否涉及 MoE；若涉及，需在此处补回 proposer 的 phase 同步逻辑。

Clarity256 had a problem deploying to Metax_ci May 27, 2026 09:13 — with GitHub Actions Failure

paddle-bot Bot added the contributor External developers label May 27, 2026

Clarity256 force-pushed the feature/mtp-cudagraph-draft-model-xpu branch from 5a6882b to bf150a4 Compare May 27, 2026 09:19

Clarity256 had a problem deploying to Metax_ci May 27, 2026 09:19 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Clarity256 had a problem deploying to Metax_ci May 28, 2026 03:26 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

[XPU] Fix missing mtp_step_paddle import in mtp_xpu.py

00299e4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Clarity256 force-pushed the feature/mtp-cudagraph-draft-model-xpu branch from 4ef7a1f to 00299e4 Compare May 28, 2026 06:39

Clarity256 had a problem deploying to Metax_ci May 28, 2026 06:39 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][XPU] Enable CudaGraph capture for MTP draft model(#7864)#7941

[Cherry-Pick][XPU] Enable CudaGraph capture for MTP draft model(#7864)#7941
Clarity256 wants to merge 2 commits into
PaddlePaddle:developfrom
Clarity256:feature/mtp-cudagraph-draft-model-xpu

Clarity256 commented May 27, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 27, 2026

Uh oh!

CLAassistant commented May 27, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 27, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 27, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Clarity256 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 27, 2026

Uh oh!

CLAassistant commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 0/0 通过

2.2 可选任务 — 1/2 通过

3 失败详情（仅 required）

Uh oh!

codecov-commenter commented May 27, 2026

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

历史 Findings 修复情况

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Clarity256 commented May 27, 2026 •

edited

Loading

CLAassistant commented May 27, 2026 •

edited

Loading

PaddlePaddle-bot commented May 27, 2026 •

edited

Loading