[Feature] Support new blackwell decode attention by freeliuzc · Pull Request #7949 · PaddlePaddle/FastDeploy

freeliuzc · 2026-05-28T03:47:12Z

Motivation

支持新的 SM100 Decode Attention

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-28T03:47:18Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 旨在为 CUDA 平台新增对 Blackwell（SM100）相关的 decode attention 路径支持，并配套调整 KV cache scale 的 shape/dtype 以及 cache 传输侧的内存/字节计算，以适配新的实现需求。

Changes:

新增 attention backend 枚举项 FLASH_MASK_ATTN_BLACKWELL，并在 CUDA 平台路由到现有 FlashMaskAttentionBackend（由 env 开关触发 Blackwell 分支）。
为 Blackwell 路径补充 seq_lens_kv 输入缓冲，并在 forward_meta / runner / MTP 路径中透传。
针对 block_wise_fp8 KV cache scale：允许 backend 自定义 scale shape/dtype（Blackwell 下使用 [num_blocks, kv_heads, 4] + float32），并同步更新 cache manager / messager 的内存计算与分配。

PR 标题与描述检查（按仓库模板要求）

标题已符合 [Feature] ... 格式。
PR 描述中 “Modifications / Usage or Command / Accuracy Tests” 仍为空；该 PR 涉及 attention kernel 路径与 cache 结构变化，建议补充使用方式（如何启用该 backend）与必要的精度/对齐测试结果。

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
fastdeploy/worker/input_batch.py	增加 `seq_lens_kv` 共享输入 buffer。
fastdeploy/worker/gpu_model_runner.py	ForwardMeta 透传 `seq_lens_kv`；KV cache scale 支持 backend 自定义 shape/dtype。
fastdeploy/spec_decode/mtp.py	MTP 路径 KV cache scale 支持 backend 自定义 shape/dtype；ForwardMeta 透传 `seq_lens_kv`。
fastdeploy/platforms/cuda.py	CUDA 平台新增 backend 路由 `FLASH_MASK_ATTN_BLACKWELL`。
fastdeploy/platforms/base.py	新增 backend 枚举值 `FLASH_MASK_ATTN_BLACKWELL`。
fastdeploy/model_executor/layers/attention/flash_mask_attn_backend.py	Blackwell attention 分支实现、scale shape/dtype 定制、引入 `send_cache` 与 `blackwell_ops`。
fastdeploy/model_executor/forward_meta.py	ForwardMeta 新增 `seq_lens_kv` 字段。
fastdeploy/cache_manager/cache_transfer_manager.py	cache scale 的 shape/dtype/字节计算适配 Blackwell（last_dim=4、float32）。
fastdeploy/cache_manager/cache_messager.py	cache scale 的字节计算适配 float32；main 中透传 block_size。
custom_ops/setup_ops.py	CUDA 自定义算子构建列表新增 `send_cache.cu`。
custom_ops/gpu_ops/send_cache.cu	新增 `send_cache` 静态算子，用于在特定条件下触发 layerwise cache 完成信号。

        self.mask_rollback = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32")
        self.preempted_idx = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32", device="cpu")
        self.last_preempted_idx = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32", device="cpu")
+        self.seq_lens_kv = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32")



            [self.scheduler_config.max_num_seqs, 1], self.num_model_steps - 1, dtype="int32"
        )
+
+        self.seq_lens_kv = paddle.full(shape=[self.scheduler_config.max_num_seqs, 1], fill_value=0, dtype="int32")
+


            kv_num_blocks_x_cpu=self.share_inputs["kv_num_blocks_x_cpu"],
            attn_mask_offsets=self.share_inputs["attn_mask_offsets"] if self.enable_mm else None,
            routing_replay_table=routing_replay_table,
+            seq_lens_kv=self.share_inputs["seq_lens_kv"],


+_use_blackwell_attn = envs.FD_ATTENTION_BACKEND == "FLASH_MASK_ATTN_BLACKWELL"
+
+if _use_blackwell_attn:
+    try:
+        import blackwell_ops
+    except:
+        assert False, "FLASH_MASK_ATTN_BLACKWELL requires blackwell_ops, but import failed."
+else:
+    blackwell_ops = None


+        elif selected_backend == _Backend.FLASH_MASK_ATTN_BLACKWELL:
+            logger.info("Using FLASH MASK ATTN BLACKWELL backend.")
+            return "fastdeploy.model_executor.layers.attention.FlashMaskAttentionBackend"
        else:
            raise ValueError(


PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-28 11:57:14

📋 Review 摘要

PR 概述：新增 SM100 (Blackwell) Decode Attention 支持，新增 FLASH_MASK_ATTN_BLACKWELL 后端并适配 FP8 KV Cache scale shape/dtype
变更范围：custom_ops/gpu_ops/、fastdeploy/model_executor/layers/attention/、fastdeploy/cache_manager/、fastdeploy/worker/、fastdeploy/spec_decode/
影响面 Tag：[OP] [KVCache] [Feature] [Speculative Decoding]

问题

级别	文件	概述
🟡 建议	`flash_mask_attn_backend.py:68`	裸 `except:` + `assert False` 用于运行时错误，Python `-O` 下会失效
🟡 建议	`custom_ops/gpu_ops/send_cache.cu`	新增 custom op 缺少 `tests/operators/` 单测
❓ 疑问	`fastdeploy/worker/input_batch.py:388`	`seq_lens_kv` 初始化为全 0，diff 中未见更新逻辑

📝 PR 规范检查

PR 目标分支为 release/online/20260415（非 develop），标题缺少 [Cherry-Pick] 前缀和原始 PR 编号；Modifications、Usage or Command、Accuracy Tests 段落内容为空，Checklist 均未勾选。

标题建议（可直接复制）：

[Cherry-Pick][Feature] Support new blackwell decode attention(#原PR编号)

PR 描述建议（点击展开，可直接复制）

## Motivation
支持新的 SM100 (Blackwell) 架构下的 Decode Attention，新增 `FLASH_MASK_ATTN_BLACKWELL` 后端，适配 Blackwell 架构下 `block_wise_fp8` KV Cache 的 scale shape（最后维度固定为 4）和 dtype（float32）。

## Modifications
- `custom_ops/gpu_ops/send_cache.cu`：新增 `send_cache` custom op，基于 `cudaLaunchHostFunc` 实现 layerwise KV cache 写入完成信号通知
- `custom_ops/setup_ops.py`：将 `send_cache.cu` 加入编译源列表
- `fastdeploy/platforms/base.py`：新增 `FLASH_MASK_ATTN_BLACKWELL` Backend 枚举值
- `fastdeploy/platforms/cuda.py`：注册 `FLASH_MASK_ATTN_BLACKWELL` 后端映射到 `FlashMaskAttentionBackend`
- `fastdeploy/model_executor/layers/attention/flash_mask_attn_backend.py`：新增 `forward_mixed_blackwell`、`init_blackwell_attention_metadata`、`get_kv_cache_scale_shape` 方法，支持 Blackwell encoder/decoder 分离 attention 前向
- `fastdeploy/model_executor/forward_meta.py`：新增 `seq_lens_kv` 字段
- `fastdeploy/cache_manager/cache_messager.py` / `cache_transfer_manager.py`：Blackwell 下 scale shape/dtype 条件分支适配
- `fastdeploy/spec_decode/mtp.py` / `fastdeploy/worker/gpu_model_runner.py`：同步使用 `get_kv_cache_scale_shape` 获取正确的 scale shape/dtype
- `fastdeploy/worker/input_batch.py`：新增 `seq_lens_kv` 共享输入

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体实现思路清晰，Blackwell attention 的 encoder/decoder 分离前向逻辑结构完整，scale shape/dtype 在多个组件间的适配也较为完备。建议修复 except 捕获方式与运行时错误抛出、确认 seq_lens_kv 更新路径，并补充 custom op 单测后合入。

PaddlePaddle-bot · 2026-05-28T04:01:17Z

+    try:
+        import blackwell_ops
+    except:
+        assert False, "FLASH_MASK_ATTN_BLACKWELL requires blackwell_ops, but import failed."


🟡 建议 裸 except: + assert False 用于运行时错误校验

当前写法有两个问题：

except: 裸捕获会吞掉 SystemExit、KeyboardInterrupt 等非 import 异常，应改为 except ImportError:

assert False 在 Python 优化模式（python -O）下会被跳过，导致 blackwell_ops = None 后续调用时出现难以排查的 AttributeError，应改为显式 raise ImportError(...)

建议修复：

if _use_blackwell_attn: try: import blackwell_ops except ImportError as e: raise ImportError( "FLASH_MASK_ATTN_BLACKWELL requires blackwell_ops, but import failed." ) from e else: blackwell_ops = None

PaddlePaddle-bot · 2026-05-28T04:01:17Z

        self.mask_rollback = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32")
        self.preempted_idx = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32", device="cpu")
        self.last_preempted_idx = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32", device="cpu")
+        self.seq_lens_kv = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32")


❓ 疑问 seq_lens_kv 始终初始化为 0，diff 中未见更新逻辑

seq_lens_kv 被传入 flash_decoder_attn_fwd（forward_mixed_blackwell）作为 KV 序列长度参数。若该字段在推理循环中未被更新，decode 阶段将使用全 0 的 KV 长度，导致 attention 结果错误。

请确认：

seq_lens_kv 是否在 make_inputs / update_share_inputs 等路径中有更新逻辑（未出现在本次 diff）？

若是，建议在 PR 描述中说明；若否，需补充更新逻辑。

PaddlePaddle-bot · 2026-05-28T04:25:28Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-28 13:54:41

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: b221eff
Merge base: 0d7fccd (branch: release/online/20260415)
查看完整 Diff
CI 详情

1 任务总览

当前 Required CI 未全部通过：Required 失败任务数 1，等待处理的 Required 任务数 0。失败为单测回归，需优先修复后再合入。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
19(0)	19	18	1	0	0	0

2 任务状态汇总

日志列说明：失败任务直接使用 log_links_markdown 字段（已预生成），运行中任务手动拼接 [Job]({html_url})

2.1 Required任务 : 6/7 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	1h16m	PR问题：新增字段未兼容旧测试/Optional语义	初始化scale形状并改Optional注解	Job	-
✅	其余 6 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 12/12 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
✅	其余 12 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 测试失败（置信度: 高）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

状态: ❌ 失败
错误类型: 测试失败
置信度: 高
根因摘要: 新增字段未兼容旧测试/Optional语义
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`tests/cache_manager/test_cache_transfer_manager.py::TestCacheTransferManager::test_init_storage_buffer_registers_scale_buffers`	AttributeError: no attribute `cache_scale_shape`	PR 将 `_init_storage_buffer()` 的 scale buffer 计算改为依赖 `self.cache_scale_shape[-1]`，但该测试仅手动设置 `has_cache_scale=True`，未初始化新增字段。
`tests/graph_optimization/test_graph_opt_backend.py::TestGraphOptBackend::test_static_graph`	TypeError: `forward_meta.seq_lens_kv` got `NoneType`	PR 在 `ForwardMeta` 新增 `seq_lens_kv: paddle.Tensor = None`，图优化动态维度解析把非 Optional Tensor 视为必填 Tensor，测试构造的 `ForwardMeta` 中该字段为 None。

根因详情:
本次 PR 修改了 fastdeploy/cache_manager/cache_transfer_manager.py，将 block-wise fp8 scale buffer 的 stride 从 self.block_size 改为 self.cache_scale_shape[-1]，用于 Blackwell backend 的 scale 形状差异；但已有单测直接构造 manager 并设置 has_cache_scale=True 时，没有同步补齐 cache_scale_shape，因此在 _init_storage_buffer() 内触发 AttributeError。
同时 PR 在 fastdeploy/model_executor/forward_meta.py 新增 seq_lens_kv: paddle.Tensor = None。静态图优化的 dynamic_dims_marker.py 对 dataclass 的 paddle.Tensor 注解字段会隐式标记动态维度并要求运行时值是 Tensor；当前测试的 ForwardMeta(ids_remove_padding=..., step_use_cudagraph=True) 未传入 seq_lens_kv，所以解析到 None 后报 TypeError。

关键日志:

E AttributeError: 'CacheTransferManager' object has no attribute 'cache_scale_shape'
fastdeploy/cache_manager/cache_transfer_manager.py:426
tests/cache_manager/test_cache_transfer_manager.py:668

E TypeError: data forward_meta.seq_lens_kv has type annotation Tensor but got type <class 'NoneType'>
fastdeploy/model_executor/graph_optimization/dynamic_dims_marker.py:185
tests/graph_optimization/test_graph_opt_backend.py:202

修复建议:

tests/cache_manager/test_cache_transfer_manager.py L658-L665：在设置 has_cache_scale=True 的测试 fixture 中补齐 self.manager.cache_scale_shape，或在 CacheTransferManager._init_storage_buffer() 中为缺失 cache_scale_shape 提供兼容默认值（如按非 Blackwell 路径使用 [num_gpu_blocks, head_num, block_size] / block_size）。
fastdeploy/model_executor/forward_meta.py L167：若 seq_lens_kv 允许为空，应改为 Optional[paddle.Tensor] = None；若 Blackwell 路径必需，则需要同步更新 tests/graph_optimization/test_graph_opt_backend.py L114 的 ForwardMeta fixture，传入合法的 seq_lens_kv Tensor。

修复建议摘要: 初始化scale形状并改Optional注解

关联变更: fastdeploy/cache_manager/cache_transfer_manager.py L175-L186、L424-L427；fastdeploy/model_executor/forward_meta.py L167
链接: 查看日志

codecov-commenter · 2026-05-28T05:12:54Z

Codecov Report

❌ Patch coverage is 29.41176% with 60 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/online/20260415@0d7fccd). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...ecutor/layers/attention/flash_mask_attn_backend.py	16.27%	34 Missing and 2 partials ⚠️
fastdeploy/cache_manager/cache_messager.py	14.28%	6 Missing ⚠️
fastdeploy/cache_manager/cache_transfer_manager.py	72.22%	4 Missing and 1 partial ⚠️
fastdeploy/spec_decode/mtp.py	0.00%	5 Missing ⚠️
fastdeploy/worker/gpu_model_runner.py	0.00%	5 Missing ⚠️
fastdeploy/platforms/cuda.py	0.00%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@                    Coverage Diff                     @@
##             release/online/20260415    #7949   +/-   ##
==========================================================
  Coverage                           ?   72.83%           
==========================================================
  Files                              ?      387           
  Lines                              ?    54099           
  Branches                           ?     8480           
==========================================================
  Hits                               ?    39401           
  Misses                             ?    11984           
  Partials                           ?     2714

Flag	Coverage Δ
GPU	`72.83% <29.41%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

support new blackwell decode attention

b221eff

Copilot AI review requested due to automatic review settings May 28, 2026 03:47

Copilot started reviewing on behalf of freeliuzc May 28, 2026 03:47 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

PaddlePaddle-bot reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support new blackwell decode attention#7949

[Feature] Support new blackwell decode attention#7949
freeliuzc wants to merge 1 commit into
PaddlePaddle:release/online/20260415from
freeliuzc:merge_blackwell_ops

freeliuzc commented May 28, 2026

Uh oh!

paddle-bot Bot commented May 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 28, 2026

Uh oh!

PaddlePaddle-bot May 28, 2026

Uh oh!

PaddlePaddle-bot commented May 28, 2026 •

edited

Loading

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

Uh oh!

codecov-commenter commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

freeliuzc commented May 28, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 6/7 通过

2.2 可选任务 — 12/12 通过

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

Uh oh!

codecov-commenter commented May 28, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PaddlePaddle-bot commented May 28, 2026 •

edited

Loading