[Feature] Support new blackwell decode attention#7949
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 旨在为 CUDA 平台新增对 Blackwell(SM100)相关的 decode attention 路径支持,并配套调整 KV cache scale 的 shape/dtype 以及 cache 传输侧的内存/字节计算,以适配新的实现需求。
Changes:
- 新增 attention backend 枚举项
FLASH_MASK_ATTN_BLACKWELL,并在 CUDA 平台路由到现有FlashMaskAttentionBackend(由 env 开关触发 Blackwell 分支)。 - 为 Blackwell 路径补充
seq_lens_kv输入缓冲,并在 forward_meta / runner / MTP 路径中透传。 - 针对
block_wise_fp8KV cache scale:允许 backend 自定义 scale shape/dtype(Blackwell 下使用[num_blocks, kv_heads, 4]+float32),并同步更新 cache manager / messager 的内存计算与分配。
PR 标题与描述检查(按仓库模板要求)
- 标题已符合
[Feature] ...格式。 - PR 描述中 “Modifications / Usage or Command / Accuracy Tests” 仍为空;该 PR 涉及 attention kernel 路径与 cache 结构变化,建议补充使用方式(如何启用该 backend)与必要的精度/对齐测试结果。
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/worker/input_batch.py | 增加 seq_lens_kv 共享输入 buffer。 |
| fastdeploy/worker/gpu_model_runner.py | ForwardMeta 透传 seq_lens_kv;KV cache scale 支持 backend 自定义 shape/dtype。 |
| fastdeploy/spec_decode/mtp.py | MTP 路径 KV cache scale 支持 backend 自定义 shape/dtype;ForwardMeta 透传 seq_lens_kv。 |
| fastdeploy/platforms/cuda.py | CUDA 平台新增 backend 路由 FLASH_MASK_ATTN_BLACKWELL。 |
| fastdeploy/platforms/base.py | 新增 backend 枚举值 FLASH_MASK_ATTN_BLACKWELL。 |
| fastdeploy/model_executor/layers/attention/flash_mask_attn_backend.py | Blackwell attention 分支实现、scale shape/dtype 定制、引入 send_cache 与 blackwell_ops。 |
| fastdeploy/model_executor/forward_meta.py | ForwardMeta 新增 seq_lens_kv 字段。 |
| fastdeploy/cache_manager/cache_transfer_manager.py | cache scale 的 shape/dtype/字节计算适配 Blackwell(last_dim=4、float32)。 |
| fastdeploy/cache_manager/cache_messager.py | cache scale 的字节计算适配 float32;main 中透传 block_size。 |
| custom_ops/setup_ops.py | CUDA 自定义算子构建列表新增 send_cache.cu。 |
| custom_ops/gpu_ops/send_cache.cu | 新增 send_cache 静态算子,用于在特定条件下触发 layerwise cache 完成信号。 |
| self.mask_rollback = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32") | ||
| self.preempted_idx = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32", device="cpu") | ||
| self.last_preempted_idx = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32", device="cpu") | ||
| self.seq_lens_kv = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32") | ||
|
|
| [self.scheduler_config.max_num_seqs, 1], self.num_model_steps - 1, dtype="int32" | ||
| ) | ||
|
|
||
| self.seq_lens_kv = paddle.full(shape=[self.scheduler_config.max_num_seqs, 1], fill_value=0, dtype="int32") | ||
|
|
| kv_num_blocks_x_cpu=self.share_inputs["kv_num_blocks_x_cpu"], | ||
| attn_mask_offsets=self.share_inputs["attn_mask_offsets"] if self.enable_mm else None, | ||
| routing_replay_table=routing_replay_table, | ||
| seq_lens_kv=self.share_inputs["seq_lens_kv"], |
| _use_blackwell_attn = envs.FD_ATTENTION_BACKEND == "FLASH_MASK_ATTN_BLACKWELL" | ||
|
|
||
| if _use_blackwell_attn: | ||
| try: | ||
| import blackwell_ops | ||
| except: | ||
| assert False, "FLASH_MASK_ATTN_BLACKWELL requires blackwell_ops, but import failed." | ||
| else: | ||
| blackwell_ops = None |
| elif selected_backend == _Backend.FLASH_MASK_ATTN_BLACKWELL: | ||
| logger.info("Using FLASH MASK ATTN BLACKWELL backend.") | ||
| return "fastdeploy.model_executor.layers.attention.FlashMaskAttentionBackend" | ||
| else: | ||
| raise ValueError( |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-28 11:57:14
📋 Review 摘要
PR 概述:新增 SM100 (Blackwell) Decode Attention 支持,新增 FLASH_MASK_ATTN_BLACKWELL 后端并适配 FP8 KV Cache scale shape/dtype
变更范围:custom_ops/gpu_ops/、fastdeploy/model_executor/layers/attention/、fastdeploy/cache_manager/、fastdeploy/worker/、fastdeploy/spec_decode/
影响面 Tag:[OP] [KVCache] [Feature] [Speculative Decoding]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | flash_mask_attn_backend.py:68 |
裸 except: + assert False 用于运行时错误,Python -O 下会失效 |
| 🟡 建议 | custom_ops/gpu_ops/send_cache.cu |
新增 custom op 缺少 tests/operators/ 单测 |
| ❓ 疑问 | fastdeploy/worker/input_batch.py:388 |
seq_lens_kv 初始化为全 0,diff 中未见更新逻辑 |
📝 PR 规范检查
PR 目标分支为 release/online/20260415(非 develop),标题缺少 [Cherry-Pick] 前缀和原始 PR 编号;Modifications、Usage or Command、Accuracy Tests 段落内容为空,Checklist 均未勾选。
标题建议(可直接复制):
[Cherry-Pick][Feature] Support new blackwell decode attention(#原PR编号)
PR 描述建议(点击展开,可直接复制)
## Motivation
支持新的 SM100 (Blackwell) 架构下的 Decode Attention,新增 `FLASH_MASK_ATTN_BLACKWELL` 后端,适配 Blackwell 架构下 `block_wise_fp8` KV Cache 的 scale shape(最后维度固定为 4)和 dtype(float32)。
## Modifications
- `custom_ops/gpu_ops/send_cache.cu`:新增 `send_cache` custom op,基于 `cudaLaunchHostFunc` 实现 layerwise KV cache 写入完成信号通知
- `custom_ops/setup_ops.py`:将 `send_cache.cu` 加入编译源列表
- `fastdeploy/platforms/base.py`:新增 `FLASH_MASK_ATTN_BLACKWELL` Backend 枚举值
- `fastdeploy/platforms/cuda.py`:注册 `FLASH_MASK_ATTN_BLACKWELL` 后端映射到 `FlashMaskAttentionBackend`
- `fastdeploy/model_executor/layers/attention/flash_mask_attn_backend.py`:新增 `forward_mixed_blackwell`、`init_blackwell_attention_metadata`、`get_kv_cache_scale_shape` 方法,支持 Blackwell encoder/decoder 分离 attention 前向
- `fastdeploy/model_executor/forward_meta.py`:新增 `seq_lens_kv` 字段
- `fastdeploy/cache_manager/cache_messager.py` / `cache_transfer_manager.py`:Blackwell 下 scale shape/dtype 条件分支适配
- `fastdeploy/spec_decode/mtp.py` / `fastdeploy/worker/gpu_model_runner.py`:同步使用 `get_kv_cache_scale_shape` 获取正确的 scale shape/dtype
- `fastdeploy/worker/input_batch.py`:新增 `seq_lens_kv` 共享输入
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
整体实现思路清晰,Blackwell attention 的 encoder/decoder 分离前向逻辑结构完整,scale shape/dtype 在多个组件间的适配也较为完备。建议修复 except 捕获方式与运行时错误抛出、确认 seq_lens_kv 更新路径,并补充 custom op 单测后合入。
| try: | ||
| import blackwell_ops | ||
| except: | ||
| assert False, "FLASH_MASK_ATTN_BLACKWELL requires blackwell_ops, but import failed." |
There was a problem hiding this comment.
🟡 建议 裸 except: + assert False 用于运行时错误校验
当前写法有两个问题:
except:裸捕获会吞掉SystemExit、KeyboardInterrupt等非 import 异常,应改为except ImportError:assert False在 Python 优化模式(python -O)下会被跳过,导致blackwell_ops = None后续调用时出现难以排查的AttributeError,应改为显式raise ImportError(...)
建议修复:
if _use_blackwell_attn:
try:
import blackwell_ops
except ImportError as e:
raise ImportError(
"FLASH_MASK_ATTN_BLACKWELL requires blackwell_ops, but import failed."
) from e
else:
blackwell_ops = None| self.mask_rollback = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32") | ||
| self.preempted_idx = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32", device="cpu") | ||
| self.last_preempted_idx = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32", device="cpu") | ||
| self.seq_lens_kv = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32") |
There was a problem hiding this comment.
❓ 疑问 seq_lens_kv 始终初始化为 0,diff 中未见更新逻辑
seq_lens_kv 被传入 flash_decoder_attn_fwd(forward_mixed_blackwell)作为 KV 序列长度参数。若该字段在推理循环中未被更新,decode 阶段将使用全 0 的 KV 长度,导致 attention 结果错误。
请确认:
seq_lens_kv是否在make_inputs/update_share_inputs等路径中有更新逻辑(未出现在本次 diff)?- 若是,建议在 PR 描述中说明;若否,需补充更新逻辑。
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览当前 Required CI 未全部通过:Required 失败任务数 1,等待处理的 Required 任务数 0。失败为单测回归,需优先修复后再合入。
2 任务状态汇总日志列说明:失败任务直接使用 2.1 Required任务 : 6/7 通过
2.2 可选任务 — 12/12 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 测试失败(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
失败用例:
根因详情: 关键日志: 修复建议:
修复建议摘要: 初始化scale形状并改Optional注解 关联变更: |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## release/online/20260415 #7949 +/- ##
==========================================================
Coverage ? 72.83%
==========================================================
Files ? 387
Lines ? 54099
Branches ? 8480
==========================================================
Hits ? 39401
Misses ? 11984
Partials ? 2714
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.