Skip to content

[Feature] Support new blackwell decode attention#7949

Open
freeliuzc wants to merge 1 commit into
PaddlePaddle:release/online/20260415from
freeliuzc:merge_blackwell_ops
Open

[Feature] Support new blackwell decode attention#7949
freeliuzc wants to merge 1 commit into
PaddlePaddle:release/online/20260415from
freeliuzc:merge_blackwell_ops

Conversation

@freeliuzc
Copy link
Copy Markdown
Collaborator

Motivation

  1. 支持新的 SM100 Decode Attention

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings May 28, 2026 03:47
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 28, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在为 CUDA 平台新增对 Blackwell(SM100)相关的 decode attention 路径支持,并配套调整 KV cache scale 的 shape/dtype 以及 cache 传输侧的内存/字节计算,以适配新的实现需求。

Changes:

  • 新增 attention backend 枚举项 FLASH_MASK_ATTN_BLACKWELL,并在 CUDA 平台路由到现有 FlashMaskAttentionBackend(由 env 开关触发 Blackwell 分支)。
  • 为 Blackwell 路径补充 seq_lens_kv 输入缓冲,并在 forward_meta / runner / MTP 路径中透传。
  • 针对 block_wise_fp8 KV cache scale:允许 backend 自定义 scale shape/dtype(Blackwell 下使用 [num_blocks, kv_heads, 4] + float32),并同步更新 cache manager / messager 的内存计算与分配。

PR 标题与描述检查(按仓库模板要求)

  • 标题已符合 [Feature] ... 格式。
  • PR 描述中 “Modifications / Usage or Command / Accuracy Tests” 仍为空;该 PR 涉及 attention kernel 路径与 cache 结构变化,建议补充使用方式(如何启用该 backend)与必要的精度/对齐测试结果。

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
fastdeploy/worker/input_batch.py 增加 seq_lens_kv 共享输入 buffer。
fastdeploy/worker/gpu_model_runner.py ForwardMeta 透传 seq_lens_kv;KV cache scale 支持 backend 自定义 shape/dtype。
fastdeploy/spec_decode/mtp.py MTP 路径 KV cache scale 支持 backend 自定义 shape/dtype;ForwardMeta 透传 seq_lens_kv
fastdeploy/platforms/cuda.py CUDA 平台新增 backend 路由 FLASH_MASK_ATTN_BLACKWELL
fastdeploy/platforms/base.py 新增 backend 枚举值 FLASH_MASK_ATTN_BLACKWELL
fastdeploy/model_executor/layers/attention/flash_mask_attn_backend.py Blackwell attention 分支实现、scale shape/dtype 定制、引入 send_cacheblackwell_ops
fastdeploy/model_executor/forward_meta.py ForwardMeta 新增 seq_lens_kv 字段。
fastdeploy/cache_manager/cache_transfer_manager.py cache scale 的 shape/dtype/字节计算适配 Blackwell(last_dim=4、float32)。
fastdeploy/cache_manager/cache_messager.py cache scale 的字节计算适配 float32;main 中透传 block_size。
custom_ops/setup_ops.py CUDA 自定义算子构建列表新增 send_cache.cu
custom_ops/gpu_ops/send_cache.cu 新增 send_cache 静态算子,用于在特定条件下触发 layerwise cache 完成信号。

Comment on lines 385 to 389
self.mask_rollback = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32")
self.preempted_idx = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32", device="cpu")
self.last_preempted_idx = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32", device="cpu")
self.seq_lens_kv = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32")

Comment on lines 911 to +915
[self.scheduler_config.max_num_seqs, 1], self.num_model_steps - 1, dtype="int32"
)

self.seq_lens_kv = paddle.full(shape=[self.scheduler_config.max_num_seqs, 1], fill_value=0, dtype="int32")

kv_num_blocks_x_cpu=self.share_inputs["kv_num_blocks_x_cpu"],
attn_mask_offsets=self.share_inputs["attn_mask_offsets"] if self.enable_mm else None,
routing_replay_table=routing_replay_table,
seq_lens_kv=self.share_inputs["seq_lens_kv"],
Comment on lines +62 to +70
_use_blackwell_attn = envs.FD_ATTENTION_BACKEND == "FLASH_MASK_ATTN_BLACKWELL"

if _use_blackwell_attn:
try:
import blackwell_ops
except:
assert False, "FLASH_MASK_ATTN_BLACKWELL requires blackwell_ops, but import failed."
else:
blackwell_ops = None
Comment on lines +79 to 83
elif selected_backend == _Backend.FLASH_MASK_ATTN_BLACKWELL:
logger.info("Using FLASH MASK ATTN BLACKWELL backend.")
return "fastdeploy.model_executor.layers.attention.FlashMaskAttentionBackend"
else:
raise ValueError(
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-28 11:57:14

📋 Review 摘要

PR 概述:新增 SM100 (Blackwell) Decode Attention 支持,新增 FLASH_MASK_ATTN_BLACKWELL 后端并适配 FP8 KV Cache scale shape/dtype
变更范围custom_ops/gpu_ops/fastdeploy/model_executor/layers/attention/fastdeploy/cache_manager/fastdeploy/worker/fastdeploy/spec_decode/
影响面 Tag[OP] [KVCache] [Feature] [Speculative Decoding]

问题

级别 文件 概述
🟡 建议 flash_mask_attn_backend.py:68 except: + assert False 用于运行时错误,Python -O 下会失效
🟡 建议 custom_ops/gpu_ops/send_cache.cu 新增 custom op 缺少 tests/operators/ 单测
❓ 疑问 fastdeploy/worker/input_batch.py:388 seq_lens_kv 初始化为全 0,diff 中未见更新逻辑

📝 PR 规范检查

PR 目标分支为 release/online/20260415(非 develop),标题缺少 [Cherry-Pick] 前缀和原始 PR 编号;ModificationsUsage or CommandAccuracy Tests 段落内容为空,Checklist 均未勾选。

标题建议(可直接复制):

  • [Cherry-Pick][Feature] Support new blackwell decode attention(#原PR编号)
PR 描述建议(点击展开,可直接复制)
## Motivation
支持新的 SM100 (Blackwell) 架构下的 Decode Attention,新增 `FLASH_MASK_ATTN_BLACKWELL` 后端,适配 Blackwell 架构下 `block_wise_fp8` KV Cache 的 scale shape(最后维度固定为 4)和 dtype(float32)。

## Modifications
- `custom_ops/gpu_ops/send_cache.cu`:新增 `send_cache` custom op,基于 `cudaLaunchHostFunc` 实现 layerwise KV cache 写入完成信号通知
- `custom_ops/setup_ops.py`:将 `send_cache.cu` 加入编译源列表
- `fastdeploy/platforms/base.py`:新增 `FLASH_MASK_ATTN_BLACKWELL` Backend 枚举值
- `fastdeploy/platforms/cuda.py`:注册 `FLASH_MASK_ATTN_BLACKWELL` 后端映射到 `FlashMaskAttentionBackend`
- `fastdeploy/model_executor/layers/attention/flash_mask_attn_backend.py`:新增 `forward_mixed_blackwell``init_blackwell_attention_metadata``get_kv_cache_scale_shape` 方法,支持 Blackwell encoder/decoder 分离 attention 前向
- `fastdeploy/model_executor/forward_meta.py`:新增 `seq_lens_kv` 字段
- `fastdeploy/cache_manager/cache_messager.py` / `cache_transfer_manager.py`:Blackwell 下 scale shape/dtype 条件分支适配
- `fastdeploy/spec_decode/mtp.py` / `fastdeploy/worker/gpu_model_runner.py`:同步使用 `get_kv_cache_scale_shape` 获取正确的 scale shape/dtype
- `fastdeploy/worker/input_batch.py`:新增 `seq_lens_kv` 共享输入

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体实现思路清晰,Blackwell attention 的 encoder/decoder 分离前向逻辑结构完整,scale shape/dtype 在多个组件间的适配也较为完备。建议修复 except 捕获方式与运行时错误抛出、确认 seq_lens_kv 更新路径,并补充 custom op 单测后合入。

try:
import blackwell_ops
except:
assert False, "FLASH_MASK_ATTN_BLACKWELL requires blackwell_ops, but import failed."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议except: + assert False 用于运行时错误校验

当前写法有两个问题:

  1. except: 裸捕获会吞掉 SystemExitKeyboardInterrupt 等非 import 异常,应改为 except ImportError:
  2. assert False 在 Python 优化模式(python -O)下会被跳过,导致 blackwell_ops = None 后续调用时出现难以排查的 AttributeError,应改为显式 raise ImportError(...)

建议修复:

if _use_blackwell_attn:
    try:
        import blackwell_ops
    except ImportError as e:
        raise ImportError(
            "FLASH_MASK_ATTN_BLACKWELL requires blackwell_ops, but import failed."
        ) from e
else:
    blackwell_ops = None

self.mask_rollback = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32")
self.preempted_idx = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32", device="cpu")
self.last_preempted_idx = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32", device="cpu")
self.seq_lens_kv = paddle.full(shape=[max_num_seqs, 1], fill_value=0, dtype="int32")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 seq_lens_kv 始终初始化为 0,diff 中未见更新逻辑

seq_lens_kv 被传入 flash_decoder_attn_fwdforward_mixed_blackwell)作为 KV 序列长度参数。若该字段在推理循环中未被更新,decode 阶段将使用全 0 的 KV 长度,导致 attention 结果错误。

请确认:

  1. seq_lens_kv 是否在 make_inputs / update_share_inputs 等路径中有更新逻辑(未出现在本次 diff)?
  2. 若是,建议在 PR 描述中说明;若否,需补充更新逻辑。

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 28, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-28 13:54:41

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前 Required CI 未全部通过:Required 失败任务数 1,等待处理的 Required 任务数 0。失败为单测回归,需优先修复后再合入。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
19(0) 19 18 1 0 0 0

2 任务状态汇总

日志列说明:失败任务直接使用 log_links_markdown 字段(已预生成),运行中任务手动拼接 [Job]({html_url})

2.1 Required任务 : 6/7 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h16m PR问题:新增字段未兼容旧测试/Optional语义 初始化scale形状并改Optional注解 Job -
其余 6 个必选任务通过 - - - - -

2.2 可选任务 — 12/12 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
其余 12 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 测试失败(置信度: 高)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 测试失败
  • 置信度: 高
  • 根因摘要: 新增字段未兼容旧测试/Optional语义
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试 错误 根因
tests/cache_manager/test_cache_transfer_manager.py::TestCacheTransferManager::test_init_storage_buffer_registers_scale_buffers AttributeError: no attribute cache_scale_shape PR 将 _init_storage_buffer() 的 scale buffer 计算改为依赖 self.cache_scale_shape[-1],但该测试仅手动设置 has_cache_scale=True,未初始化新增字段。
tests/graph_optimization/test_graph_opt_backend.py::TestGraphOptBackend::test_static_graph TypeError: forward_meta.seq_lens_kv got NoneType PR 在 ForwardMeta 新增 seq_lens_kv: paddle.Tensor = None,图优化动态维度解析把非 Optional Tensor 视为必填 Tensor,测试构造的 ForwardMeta 中该字段为 None。

根因详情:
本次 PR 修改了 fastdeploy/cache_manager/cache_transfer_manager.py,将 block-wise fp8 scale buffer 的 stride 从 self.block_size 改为 self.cache_scale_shape[-1],用于 Blackwell backend 的 scale 形状差异;但已有单测直接构造 manager 并设置 has_cache_scale=True 时,没有同步补齐 cache_scale_shape,因此在 _init_storage_buffer() 内触发 AttributeError。
同时 PR 在 fastdeploy/model_executor/forward_meta.py 新增 seq_lens_kv: paddle.Tensor = None。静态图优化的 dynamic_dims_marker.py 对 dataclass 的 paddle.Tensor 注解字段会隐式标记动态维度并要求运行时值是 Tensor;当前测试的 ForwardMeta(ids_remove_padding=..., step_use_cudagraph=True) 未传入 seq_lens_kv,所以解析到 None 后报 TypeError。

关键日志:

E AttributeError: 'CacheTransferManager' object has no attribute 'cache_scale_shape'
fastdeploy/cache_manager/cache_transfer_manager.py:426
tests/cache_manager/test_cache_transfer_manager.py:668

E TypeError: data forward_meta.seq_lens_kv has type annotation Tensor but got type <class 'NoneType'>
fastdeploy/model_executor/graph_optimization/dynamic_dims_marker.py:185
tests/graph_optimization/test_graph_opt_backend.py:202

修复建议:

  1. tests/cache_manager/test_cache_transfer_manager.py L658-L665:在设置 has_cache_scale=True 的测试 fixture 中补齐 self.manager.cache_scale_shape,或在 CacheTransferManager._init_storage_buffer() 中为缺失 cache_scale_shape 提供兼容默认值(如按非 Blackwell 路径使用 [num_gpu_blocks, head_num, block_size] / block_size)。
  2. fastdeploy/model_executor/forward_meta.py L167:若 seq_lens_kv 允许为空,应改为 Optional[paddle.Tensor] = None;若 Blackwell 路径必需,则需要同步更新 tests/graph_optimization/test_graph_opt_backend.py L114 的 ForwardMeta fixture,传入合法的 seq_lens_kv Tensor。

修复建议摘要: 初始化scale形状并改Optional注解

关联变更: fastdeploy/cache_manager/cache_transfer_manager.py L175-L186、L424-L427;fastdeploy/model_executor/forward_meta.py L167
链接: 查看日志

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 29.41176% with 60 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/online/20260415@0d7fccd). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...ecutor/layers/attention/flash_mask_attn_backend.py 16.27% 34 Missing and 2 partials ⚠️
fastdeploy/cache_manager/cache_messager.py 14.28% 6 Missing ⚠️
fastdeploy/cache_manager/cache_transfer_manager.py 72.22% 4 Missing and 1 partial ⚠️
fastdeploy/spec_decode/mtp.py 0.00% 5 Missing ⚠️
fastdeploy/worker/gpu_model_runner.py 0.00% 5 Missing ⚠️
fastdeploy/platforms/cuda.py 0.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@                    Coverage Diff                     @@
##             release/online/20260415    #7949   +/-   ##
==========================================================
  Coverage                           ?   72.83%           
==========================================================
  Files                              ?      387           
  Lines                              ?    54099           
  Branches                           ?     8480           
==========================================================
  Hits                               ?    39401           
  Misses                             ?    11984           
  Partials                           ?     2714           
Flag Coverage Δ
GPU 72.83% <29.41%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants