Fix score calculation and support neox rope for fleet-gqa-latent by chang-wenbin · Pull Request #7952 · PaddlePaddle/FastDeploy

chang-wenbin · 2026-05-28T13:22:27Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-28T13:22:35Z

Thanks for your contribution!

codecov-commenter · 2026-05-28T14:20:08Z

Codecov Report

❌ Patch coverage is 22.22222% with 7 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@91ca3d1). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...l_executor/layers/moe/fused_moe_cutlass_backend.py	0.00%	5 Missing ⚠️
fastdeploy/worker/input_batch.py	50.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7952   +/-   ##
==========================================
  Coverage           ?   20.53%           
==========================================
  Files              ?      467           
  Lines              ?    65181           
  Branches           ?    10007           
==========================================
  Hits               ?    13383           
  Misses             ?    51021           
  Partials           ?      777

Flag	Coverage Δ
XPU	`20.53% <22.22%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot · 2026-05-28T14:45:44Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-29 04:11:00

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: e546850
Merge base: 91ca3d1 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

当前 Required 任务未全部通过：4 个 Required 失败，另有 1 个 Required 被取消（主测试任务），需先修复失败后重跑 CI。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	取消	跳过
41(0)	41	32	8	0	0	1	0

2 任务状态汇总

2.1 Required任务 : 5/10 通过

必选任务阻塞合并，失败需优先处理。本轮 4 个 required 失败均命中历史分析缓存；根因一致。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run Four Cards Tests / run_4_cards_tests`	20m35s	PR问题：ProposerInputBatch 缺 rotary_dim	补齐 ProposerInputBatch 初始化	Job	-
❌	`xpu_4cards_case_test / run_xpu_4cards_cases`	41m21s	PR问题：ProposerInputBatch 缺 rotary_dim	补齐 ProposerInputBatch 初始化	Job	-
❌	`xpu_8cards_case_test / run_xpu_8cards_cases`	17m22s	PR问题：ProposerInputBatch 缺 rotary_dim	补齐 ProposerInputBatch 初始化	Job	-
❌	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	15m50s	PR问题：ProposerInputBatch 缺 rotary_dim	补齐 ProposerInputBatch 初始化	Job	-
🚫	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	已取消：主测试任务未完成	修复后重跑以获得覆盖率结果	-	-
✅	其余 5 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 27/31 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	1m59s	Job	-
❌	`Check PR Template`	20s	Job	-
❌	`CI_HPU`	1h4m	Job	-
❌	`Trigger Jenkins for PR`	16s	Job	-
✅	其余 27 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run Four Cards Tests / run_4_cards_tests — PR问题（置信度: 高）

Run Four Cards Tests / run_4_cards_tests

状态: ❌ 失败
错误类型: 用例失败
置信度: 高
根因摘要: ProposerInputBatch 缺 rotary_dim
分析器: 通用分析(fallback)

根因详情:
4 卡 GPU e2e 失败文件均为 MTP 场景。日志中 gpu_model_runner.py 初始化 speculative proposer 后进入 MTPProposerCUDA，同样在 ProposerInputBatch.init_share_inputs() 创建 rope embedding 时访问未初始化的 rotary_dim。

代码核查:
本 PR 在 fastdeploy/worker/input_batch.py 中仅为 InputBatch.__init__ 新增 self.rotary_dim，但 ProposerInputBatch.__init__ 重写初始化且未调用 InputBatch.__init__。同时 ProposerInputBatch.init_share_inputs() 会调用 get_rope(rotary_dim=self.rotary_dim, ...)，因此 MTP/speculative proposer 路径会触发 AttributeError。

修复建议:

将 rotary_dim 初始化同步到 ProposerInputBatch，确保 CUDA/XPU proposer 路径都具备该属性。
增加或补充 speculative decoding/MTP 初始化路径的单测，避免仅普通 InputBatch 覆盖 neox rope。

修复建议摘要: 补齐 ProposerInputBatch 初始化

xpu_4cards_case_test / run_xpu_4cards_cases — PR问题（置信度: 高）

xpu_4cards_case_test / run_xpu_4cards_cases

状态: ❌ 失败
错误类型: 用例失败
置信度: 高
根因摘要: ProposerInputBatch 缺 rotary_dim
分析器: 通用分析(fallback)

根因详情:
本 PR 在 fastdeploy/worker/input_batch.py 仅为 InputBatch.__init__ 新增 self.rotary_dim，但 ProposerInputBatch.__init__ 重写初始化且未调用 super().__init__()，也未自行设置 rotary_dim。MTP 路径创建 ProposerInputBatch 后执行 init_share_inputs()，在 get_rope(rotary_dim=self.rotary_dim, ...) 时崩溃。

补充说明:
日志摘要中还出现 GitHub Artifact storage quota 提示，这可能影响日志/产物上传；但缓存的深度分析与代码核查均指向业务失败根因为 proposer batch 缺失 rotary_dim。

修复建议:

在 ProposerInputBatch.__init__ 中补齐与 InputBatch.__init__ 一致的 rotary_dim 初始化，或抽取公共初始化逻辑供两个 batch 类复用。
修复后重跑 MTP / speculative decoding 相关 XPU 4 卡用例。

修复建议摘要: 补齐 ProposerInputBatch 初始化

xpu_8cards_case_test / run_xpu_8cards_cases — PR问题（置信度: 高）

xpu_8cards_case_test / run_xpu_8cards_cases

状态: ❌ 失败
错误类型: 用例失败
置信度: 高
根因摘要: ProposerInputBatch 缺 rotary_dim
分析器: 通用分析(fallback)

根因详情:
8 卡 XPU 日志显示 worker 加载模型时进入 _init_speculative_proposer()，随后创建 MTPProposerXPU，在 ProposerInputBatch.init_share_inputs() 中访问 self.rotary_dim 失败。该字段是本 PR 为 neox rope 新增的字段，但没有同步初始化到 proposer 专用的 batch 类。

补充说明:
日志摘要中还出现 GitHub Artifact storage quota 提示，这可能影响日志/产物上传；但缓存的深度分析与代码核查均指向业务失败根因为 proposer batch 缺失 rotary_dim。

修复建议:

在 ProposerInputBatch.__init__ 中设置 rotary_percent = getattr(self.model_config, "rotary_percent", 1) 和 self.rotary_dim = int(rotary_percent * self.model_config.head_dim)。
覆盖 XPU MTP + PD 分离场景，避免 proposer 输入 batch 与主模型输入 batch 字段不一致。

修复建议摘要: 补齐 ProposerInputBatch 初始化

Extracted partial CE model tasks to run in CI. / run_ce_cases — PR问题（置信度: 高）

Extracted partial CE model tasks to run in CI. / run_ce_cases

状态: ❌ 失败
错误类型: 用例失败
置信度: 高
根因摘要: ProposerInputBatch 缺 rotary_dim
分析器: 通用分析(fallback)

根因详情:
CE 日志显示权重加载完成后初始化 worker 进程失败，错误链路与 4 卡 GPU e2e 一致：gpu_model_runner.py -> _init_speculative_proposer() -> MTPProposerCUDA -> ProposerInputBatch.init_share_inputs()。因此这是同一 PR 代码路径导致的批量失败，不是独立 CE 环境问题。

修复建议:

修复 ProposerInputBatch 的 rotary_dim 初始化遗漏后重跑 run_ce_cases。
如后续仍有 CE 失败，再基于新的日志继续分析；当前阻塞点为同一 AttributeError。

修复建议摘要: 补齐 ProposerInputBatch 初始化

4 建议下一步

在 fastdeploy/worker/input_batch.py 的 ProposerInputBatch.__init__ 中补齐 rotary_dim 初始化，建议与 InputBatch.__init__ 保持同一计算逻辑：rotary_percent = getattr(self.model_config, "rotary_percent", 1)，self.rotary_dim = int(rotary_percent * self.model_config.head_dim)。
修复后重跑 required 的 MTP/speculative 相关任务，并重新触发已取消的主测试任务 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage。
Optional 的 Check PR Template 失败提示 PR 描述仍为模板内容，建议补充 Motivation/Modifications/测试说明。

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-29 09:58:01

📋 Review 摘要

PR 概述：修复 MoE cutlass backend 中 learnable scaling 未生效的 bug，并为 input_batch 添加 neox rotary style 支持
变更范围：MoE layers、Worker input_batch
影响面 Tag：[OP] [Engine]

问题

级别	文件	概述
🔴 Bug	`input_batch.py:817`	`ProposerInputBatch` 未设置 `self.rotary_dim`，使用时将抛出 AttributeError

📝 PR 规范检查

标题缺少官方 Tag，描述各 section 内容为空（仅保留模板占位符）。

标题建议（可直接复制）：

[BugFix] Fix score calculation and support neox rope for fleet-gqa-latent

PR 描述建议（点击展开，可直接复制）

## Motivation

修复 MoE cutlass backend 中 `routed_scaling_factor_learnable` 的 per-expert scale 未实际生效的问题（原代码在 `moe_expert_dispatch` 之前对 `topk_weights` 做 scale，但 dispatch kernel 会重新计算 topk_weights 导致 scale 被覆盖），同时为 input_batch 添加 neox rotary style（`rotary_percent`）支持。

## Modifications

- `fused_moe_cutlass_backend.py`：将 learnable per-expert scale 的应用从 `moe_expert_dispatch` 之前移至之后，确保 scale 作用于实际使用的 topk_weights
- `input_batch.py`：新增 `rotary_dim` 属性（基于 `rotary_percent * head_dim`），替换所有 `get_rope` 调用中硬编码的 `head_dim`

## Usage or Command

N/A

## Accuracy Tests

N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

MoE score 修复逻辑正确，解决了 learnable scaling 未生效的实际 bug。但 ProposerInputBatch 子类缺少 rotary_dim 属性初始化，在投机解码场景下会导致运行时崩溃，需要修复。

PaddlePaddle-bot · 2026-05-29T01:59:52Z

@@ -813,7 +817,7 @@ def init_share_inputs(self):
        tmp_position_ids = paddle.arange(self.model_config.max_model_len).reshape((1, -1))


🔴 Bug ProposerInputBatch 继承自 InputBatch 但其 __init__ 未调用 super().__init__()，也未独立设置 self.rotary_dim。当投机解码场景调用 ProposerInputBatch.init_share_inputs() 时，此处访问 self.rotary_dim 将抛出 AttributeError。

建议修复方式：在 ProposerInputBatch.__init__ 中补充 rotary_dim 的计算，与父类保持一致：

rotary_percent = getattr(self.model_config, "rotary_percent", 1) self.rotary_dim = int(rotary_percent * self.model_config.head_dim)

chang-wenbin and others added 3 commits May 24, 2026 17:01

fix_moe_learable-score111

d3fec5d

support neox rope for fleet-gqa-latent

8095f3f

update code

e546850

chang-wenbin had a problem deploying to Metax_ci May 28, 2026 13:22 — with GitHub Actions Failure

PaddlePaddle-bot suggested changes May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix score calculation and support neox rope for fleet-gqa-latent#7952

Fix score calculation and support neox rope for fleet-gqa-latent#7952
chang-wenbin wants to merge 3 commits into
PaddlePaddle:developfrom
chang-wenbin:fleet-gqa-latent

chang-wenbin commented May 28, 2026

Uh oh!

paddle-bot Bot commented May 28, 2026

Uh oh!

codecov-commenter commented May 28, 2026

Uh oh!

PaddlePaddle-bot commented May 28, 2026 •

edited

Loading

Run Four Cards Tests / run_4_cards_tests

xpu_4cards_case_test / run_xpu_4cards_cases

xpu_8cards_case_test / run_xpu_8cards_cases

Extracted partial CE model tasks to run in CI. / run_ce_cases

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -813,7 +817,7 @@ def init_share_inputs(self):
		tmp_position_ids = paddle.arange(self.model_config.max_model_len).reshape((1, -1))

Conversation

chang-wenbin commented May 28, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 28, 2026

Uh oh!

codecov-commenter commented May 28, 2026

Codecov Report

Uh oh!

PaddlePaddle-bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 5/10 通过

2.2 可选任务 — 27/31 通过

3 失败详情（仅 required）

Run Four Cards Tests / run_4_cards_tests

xpu_4cards_case_test / run_xpu_4cards_cases

xpu_8cards_case_test / run_xpu_8cards_cases

Extracted partial CE model tasks to run in CI. / run_ce_cases

4 建议下一步

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PaddlePaddle-bot commented May 28, 2026 •

edited

Loading