Skip to content

Conversation

@mattheliu
Copy link
Collaborator

@mattheliu mattheliu commented Feb 2, 2026

Motivation

为 FastDeploy 添加 NVIDIA V100 GPU (SM70 架构) 支持,使其能在旧版 GPU 上进行开发测试。

V100 (Volta 架构) 不支持以下特性,需要同时适配编译系统和运行时逻辑:

特性 最低要求 V100 支持
BF16 数据类型 SM80+ (Ampere)
FP8 量化 SM89+ (Ada Lovelace)
cp.async 指令 SM80+ (Ampere)
tanh.approx.f32 PTX SM75+ (Turing)

Modifications

编译系统

  • setup_ops.py: 支持 SM70+ 编译,分离 SM70/SM80+ 特有代码
  • cpp_extensions.cc: 添加 ENABLE_APPEND_ATTENTIONENABLE_BF16 宏控制条件编译

CUDA Kernel

  • gelu_tanh.cu: 修复 tanh.approx.f32 PTX 指令在 SM70 的编译问题,添加 MetaX 兼容性
  • moe_wna16_marlin_*.cu/h: 修复 Marlin GEMM 模板在 SM70 的编译兼容性
  • moe_deepgemm_depermute.cu: 添加 SM70/SM75 条件编译,BF16 算术运算通过 float 转换实现
  • sampling.cuh: 添加缺失的 <limits> 头文件

Python 运行时

  • fastdeploy/platforms/cuda.py:

    • 添加 SM 版本检测方法 (get_sm_version())
    • 添加硬件能力检查 (supports_bf16(), supports_fp8(), supports_async_copy(), supports_marlin())
    • Attention backend 自动 fallback (APPEND_ATTN/MLA_ATTN → FLASH_ATTN)
  • fastdeploy/config.py: BF16→FP16 dtype 自动降级

  • fastdeploy/model_executor/layers/moe/moe.py:

    • Marlin MoE backend → CUTLASS fallback (SM<80)
    • Triton MoE backend → CUTLASS fallback (SM<80)
  • fastdeploy/model_executor/layers/quantization/__init__.py:

    • FP8 量化方法自动 fallback (block_wise_fp8wint8, w4afp8wint4)
  • attention/ops/*.py: 为 SM80+ 专属 ops 添加 try-except 保护

测试

  • test_attention_layer.py: 添加 FP8 SM89+ skip 装饰器
  • test_fusedmoe.py: 添加 FP8 SM89+ skip 装饰器
  • test_w4afp8.py: 添加 FP8 SM89+ skip 装饰器
  • test_ffn.py: 根据 SM 版本自动选择 dtype 和量化配置

SM70/SM75 Fallback 策略

功能 原始 SM70/SM75 Fallback 原因
数据类型 BF16 FP16 BF16 需要 SM80+
Attention Backend APPEND_ATTN FLASH_ATTN cp.async 需要 SM80+
Attention Backend MLA_ATTN FLASH_ATTN cp.async 需要 SM80+
MoE Backend Marlin CUTLASS Marlin 需要 SM80+
MoE Backend Triton CUTLASS tritonmoe_preprocess 需要 SM80+
量化 block_wise_fp8 wint8 FP8 需要 SM89+
量化 w4afp8 wint4 FP8 需要 SM89+

Usage or Command

# 编译 (指定 SM70 架构)
MAX_JOBS=8 bash build.sh 1 python false [70]

# 或使用 setup_ops.py
cd custom_ops && python setup_ops.py install

Accuracy Tests

本 PR 为硬件兼容性支持,不影响模型计算逻辑。V100 上自动 fallback 到 FP16/CUTLASS 等兼容方案,精度与对应方案一致。

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
    • 已添加 SM 版本检测的 skip 装饰器,确保测试在不支持的硬件上正确跳过
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Feb 2, 2026

Thanks for your contribution!

@mattheliu
Copy link
Collaborator Author

/Re-run failed jobs

@mattheliu mattheliu marked this pull request as ready for review February 4, 2026 05:47
@codecov-commenter
Copy link

codecov-commenter commented Feb 4, 2026

Codecov Report

❌ Patch coverage is 40.09434% with 127 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@d60daca). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...loy/model_executor/layers/quantization/__init__.py 28.57% 23 Missing and 2 partials ⚠️
...oy/model_executor/layers/quantization/mix_quant.py 25.00% 20 Missing and 1 partial ⚠️
..._executor/layers/moe/fused_moe_deepgemm_backend.py 18.18% 17 Missing and 1 partial ⚠️
fastdeploy/platforms/cuda.py 68.29% 11 Missing and 2 partials ⚠️
.../model_executor/layers/quantization/weight_only.py 21.42% 7 Missing and 4 partials ⚠️
...del_executor/layers/quantization/block_wise_fp8.py 42.85% 6 Missing and 2 partials ⚠️
fastdeploy/model_executor/layers/moe/moe.py 22.22% 4 Missing and 3 partials ⚠️
...l_executor/layers/moe/fused_moe_cutlass_backend.py 40.00% 6 Missing ⚠️
fastdeploy/config.py 62.50% 1 Missing and 2 partials ⚠️
...cutor/layers/attention/ops/flash_mask_attention.py 0.00% 3 Missing ⚠️
... and 5 more
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6306   +/-   ##
==========================================
  Coverage           ?   68.35%           
==========================================
  Files              ?      391           
  Lines              ?    52702           
  Branches           ?     8229           
==========================================
  Hits               ?    36025           
  Misses             ?    14050           
  Partials           ?     2627           
Flag Coverage Δ
GPU 68.35% <40.09%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Support FP16 inference on V100 by adding SM70 compilation flags, disabling BF16/FP8 quantization, and graceful fallback for SM80+ only ops.
mattheliu and others added 2 commits February 9, 2026 13:03
The tritonmoe_preprocess_func and count_tokens_per_expert_func do not
depend on BF16 features and should be available on all GPU architectures.
Moving them out of the #ifdef ENABLE_BF16 block fixes the CI failure
where ernie-21b model failed to load due to missing tritonmoe_preprocess_func.

Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com>
…upport

- Restore #ifndef PADDLE_WITH_CUSTOM_DEVICE_METAX_GPU guard for tanh_ptx
  function since MetaX compiler doesn't support NVIDIA PTX asm syntax
- Restore MetaX-specific kernel launch using standard <<<>>> syntax
  instead of cudaLaunchKernelEx which is not supported on MetaX
- Keep SM70 (V100) support by using tanhf() fallback when __CUDA_ARCH__ < 750

Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…e#6358

PR PaddlePaddle#6358 removed speculate_get_output_padding_offset.cu and reuses
non-MTP get_output_padding_offset. Keep cpp_extensions.cc consistent
with upstream develop.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…6358

PR PaddlePaddle#6358 renamed parameters and changed signature:
- output_padding_offset -> batch_id_per_token_output
- Added cu_seqlens_q_output parameter
- Removed max_input_length parameter

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add get_attn_mask_q function that was lost during merge conflict resolution.
This function was introduced in PR PaddlePaddle#6354 and is required for FA2/FA3/FA4
with attn_mask_q support.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在为 FastDeploy 增加对 NVIDIA V100(SM70/Volta)GPU 的兼容支持,通过在编译系统与运行时按 SM 架构能力进行条件编译与自动 fallback,避免 BF16/FP8/cp.async 等 SM80+/SM89+ 特性在旧 GPU 上导致编译或运行失败。

Changes:

  • 编译与算子侧:扩展 custom_ops 编译逻辑以覆盖 SM70,并对部分 CUDA kernel 做 SM 条件适配(如 tanh.approx.f32、BF16 算术兼容、缺失头文件等)。
  • 运行时侧:新增 CUDA SM 能力探测与 capability-based fallback(dtype、Attention backend、MoE backend、FP8 quant 方法)。
  • 测试侧:为 FP8 相关测试增加 SM89+ 跳过逻辑,并在部分用例中按 SM 自动选择 dtype/量化配置。

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/quantization/test_w4afp8.py 增加 FP8(SM89+)能力检测并对相关用例加 skip。
tests/layers/test_fusedmoe.py 对 FP8 量化相关测试类增加 SM89+ skip。
tests/layers/test_ffn.py 按 SM 版本切换默认 dtype/量化配置以适配 SM70。
tests/layers/test_attention_layer.py 对 FP8 量化测试增加 SM89+ skip。
fastdeploy/platforms/cuda.py 增加 SM 版本获取与 BF16/FP8/cp.async/Marlin 支持判断,并对 Attention backend 做自动 fallback。
fastdeploy/model_executor/layers/quantization/weight_only.py 对 MoE backend(marlin/triton)按 SM 版本自动 fallback 到 cutlass。
fastdeploy/model_executor/layers/quantization/mix_quant.py 对 FP8 quant type 在不支持硬件上做自动降级(fallback 到 wint8/wint4)。
fastdeploy/model_executor/layers/quantization/block_wise_fp8.py 将 deep_gemm 导入限制到 SM89+,并增强 fp8_quant_blockwise 的版本兼容。
fastdeploy/model_executor/layers/quantization/init.py 在 parse_quant_config 中对 FP8 相关 quant 方法按硬件能力做全局调整/降级。
fastdeploy/model_executor/layers/moe/moe.py MoE backend 在 SM70/SM<80 下自动 fallback 到 cutlass。
fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py deep_gemm 相关导入与 fp8_quant_blockwise 参数兼容封装。
fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py 对部分 ops 导入增加 try/except,增强旧架构容错。
fastdeploy/model_executor/layers/attention/ops/pre_cache_len_concat.py CUDA op import 增加 try/except,并在不可用时抛 NotImplementedError。
fastdeploy/model_executor/layers/attention/ops/gqa_rope_write_cache.py 同上:CUDA op 不可用时抛 NotImplementedError。
fastdeploy/model_executor/layers/attention/ops/get_block_shape_and_split_kv_block.py 对 append_attn 相关 CUDA op 增加缺失保护并在不可用时抛 NotImplementedError。
fastdeploy/model_executor/layers/attention/ops/flash_mask_attention.py CUDA op import 增加 try/except,并在不可用时抛 NotImplementedError。
fastdeploy/model_executor/layers/attention/ops/append_attention.py append_attention CUDA op 按架构可用性进行导入保护与显式错误提示。
fastdeploy/model_executor/layers/attention/mla_attention_backend.py MLA 相关 CUDA op 导入改为容错导入(ImportError 不致模块导入失败)。
fastdeploy/config.py 在配置 post-init 中按硬件能力自动将 BF16 降级到 FP16。
custom_ops/setup_ops.py 扩展编译 sources/flags 以覆盖 SM70,并调整 append_attention/marlin/moe/spec_decode 等编译选择。
custom_ops/gpu_ops/sample_kernels/sampling.cuh 补充 <limits> 头文件。
custom_ops/gpu_ops/moe/moe_wna16_marlin_utils/marlin_template.h 调整 marlin 模板/宏参数以提升 SM70 兼容性与可读性。
custom_ops/gpu_ops/moe/moe_wna16_marlin_utils/kernel.h 修正宏定义缩进/格式并优化注释排版。
custom_ops/gpu_ops/moe/moe_wna16_marlin_gemm.cu 对 SM<80 给出明确不可用行为,并调整 include/格式。
custom_ops/gpu_ops/moe/moe_deepgemm_depermute.cu 为 SM70/SM75 的 BF16 算术添加 float 转换兼容路径。
custom_ops/gpu_ops/gelu_tanh.cu 为 SM70 移除 tanh.approx.f32 依赖并修正 block 维度上限。
custom_ops/gpu_ops/cpp_extensions.cc 通过宏控制 append_attention/BF16 相关导出,并调整部分 speculate/reasoning 接口参数名。
Comments suppressed due to low confidence (1)

custom_ops/gpu_ops/cpp_extensions.cc:1275

  • cpp_extensions.cc 里已不再向 fastdeploy_ops 模块导出 per_token_quant / per_token_quant_padding,但仓库内仍存在对 fastdeploy.model_executor.ops.gpu.per_token_quant(…) 的调用与单测覆盖(例如 fused_moe_triton_backend.py、tests/operators/test_per_token_quant.py)。这会导致 ImportError/AttributeError 并阻断 FP8 相关路径与测试。建议恢复这两个 binding(以及对应的 CUDA 源文件编译),或同步改动所有调用方并移除/更新相关测试。
  m.def("fused_mask_swiglu_fp8_quant",
        &FusedMaskSwigluFP8Quant,
        py::arg("input"),
        py::arg("token_nums_per_expert"),
        py::arg("block_size"),
        py::arg("use_ue8m0") = false,
        "fused mask swiglu and fp8 quant");

Comment on lines +44 to +57
_sm_version = cuda_device.get_device_capability()[0]
print(f"[DEBUG] Detected SM version: {_sm_version}")
if _sm_version >= 8:
paddle.set_default_dtype("bfloat16")
_default_dtype = paddle.bfloat16
# BlockWiseFP8Config requires bfloat16, only available on SM80+
_quant_config = BlockWiseFP8Config(weight_block_size=[128, 128])
print(f"[DEBUG] Using BlockWiseFP8Config for SM{_sm_version}0")
else:
paddle.set_default_dtype("float16")
_default_dtype = paddle.float16
# V100 (SM70) doesn't support FP8 quantization, use None
_quant_config = None
print(f"[DEBUG] Disabling quantization for V100 (SM{_sm_version}0), _quant_config = None")
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_ffn.py 在模块 import 阶段直接调用 cuda_device.get_device_capability() / paddle.device.cuda.get_device_name(),并输出多处 [DEBUG] print。若测试在无 CUDA 环境(或 CUDA 不可用)收集用例时,会在收集阶段直接报错而不是被 skip。建议把 SM 检测放到 setUp/test 内,并用 paddle.is_compiled_with_cuda()/paddle.device.is_compiled_with_cuda 等条件配合 unittest.skipUnless/skipIf;同时移除或改用 logger(并默认关闭)以避免测试输出污染。

Copilot uses AI. Check for mistakes.
Comment on lines 94 to 98
DISPATCH_FLOAT_FP6_DTYPE(input.dtype(), scalar_t, {
uint32_t vec_size = 16 / sizeof(scalar_t);
dim3 grid(num_tokens);
dim3 block(std::max(d / vec_size, 1024U));
dim3 block(std::min(d / vec_size, 1024U));

Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gelu_tanh.cu 的 kernel launch block 维度使用 std::min(d / vec_size, 1024U)。当 d < vec_size 时 d/vec_size=0,会生成 blockDim=0 的非法 launch 配置并导致运行时失败。建议对下界做 clamp(至少为 1,必要时也可按 warp size/最小并行度设置)。

Copilot uses AI. Check for mistakes.
Comment on lines +30 to +41
# These ops may not be available on older GPU architectures (V100/SM70)
moe_expert_dispatch = None
moe_expert_reduce = None

if current_platform.is_cuda():
from fastdeploy.model_executor.ops.gpu import moe_expert_dispatch, moe_expert_reduce
try:
from fastdeploy.model_executor.ops.gpu import (
moe_expert_dispatch,
moe_expert_reduce,
)
except ImportError:
pass
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fused_moe_cutlass_backend.py 里 moe_expert_dispatch/moe_expert_reduce 在 ImportError 时会被置为 None,但后续 apply_tp 中仍会无条件调用这些符号,最终变成 'NoneType is not callable' 且定位困难。建议:如果这些 op 在当前架构/构建中不可用,直接在 import 失败处抛出带架构信息的 NotImplementedError;或在调用前做显式检查并提供可用 backend 的 fallback。

Copilot uses AI. Check for mistakes.
Comment on lines 296 to 300
"gpu_ops/cpp_extensions.cc",
"gpu_ops/share_external_data.cu",
"gpu_ops/fused_mask_swiglu_fp8_quant_kernel.cu",
"gpu_ops/per_token_quant_fp8.cu",
"gpu_ops/update_split_fuse_input.cu",
"gpu_ops/text_image_index_out.cu",
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup_ops.py 的 CUDA sources 列表里移除了 gpu_ops/per_token_quant_fp8.cu,但当前仓库仍有 Python 代码与单测在 import/调用 fastdeploy.model_executor.ops.gpu.per_token_quant 与 per_token_quant_padding(例如 fused_moe_triton_backend.py、tests/operators/test_per_token_quant.py)。这会导致编译产物缺少对应算子并在运行/测试时直接报错。建议:要么恢复该 .cu 到 sources 并在 cpp_extensions.cc 中恢复 pybind 暴露;要么同步改掉所有调用方,统一改用新的 quant wrapper/算子,并删除对应单测。

Copilot uses AI. Check for mistakes.
Comment on lines 35 to 48
if current_platform.is_cuda():
try:
m_grouped_fp8_gemm_nt_contiguous = (
fastdeploy.model_executor.layers.quantization.fp8_utils.deep_gemm.m_grouped_fp8_gemm_nt_contiguous
)
m_grouped_fp8_gemm_nt_masked = (
fastdeploy.model_executor.layers.quantization.fp8_utils.deep_gemm.m_grouped_fp8_gemm_nt_masked
if get_sm_version() == 100:
paddle.compat.enable_torch_proxy(scope={"deep_gemm"})
from deep_gemm import (
m_grouped_fp8_gemm_nt_contiguous,
m_grouped_fp8_gemm_nt_masked,
)
except:
m_grouped_fp8_gemm_nt_contiguous = (
fastdeploy.model_executor.layers.quantization.fp8_utils.deep_gemm.m_grouped_gemm_fp8_fp8_bf16_nt_contiguous
else:
from fastdeploy.model_executor.ops.gpu.deep_gemm import (
m_grouped_gemm_fp8_fp8_bf16_nt_contiguous as m_grouped_fp8_gemm_nt_contiguous,
)
m_grouped_fp8_gemm_nt_masked = (
fastdeploy.model_executor.layers.quantization.fp8_utils.deep_gemm.m_grouped_gemm_fp8_fp8_bf16_nt_masked
from fastdeploy.model_executor.ops.gpu.deep_gemm import (
m_grouped_gemm_fp8_fp8_bf16_nt_masked as m_grouped_fp8_gemm_nt_masked,
)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fused_moe_deepgemm_backend.py 在 current_platform.is_cuda() 且 SM!=100 时无条件从 fastdeploy.model_executor.ops.gpu.deep_gemm 导入符号。如果 deep_gemm 仅在 SM89+ 或特定编译开关下才会生成,这会让 SM70/SM80 等机器在“导入模块”阶段就抛 ImportError,即使运行时已经做了 backend fallback 也无法启动。建议按 SM/编译宏做条件导入(例如仅 SM>=89 才导入),或保留 try/except 并在真正选择 DeepGemm backend 时给出清晰的 NotImplementedError。

Copilot uses AI. Check for mistakes.
Comment on lines 62 to 63


Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block_wise_fp8.py 中 fp8_gemm_nt 在 SM>=89 但 deep_gemm 未编译/导入失败时会保持为 None,但 deep_gemm_fp8_gemm_nt() 里仍会无条件调用 fp8_gemm_nt((x, scale), …)。这会在运行时触发 'NoneType is not callable',错误信息也不直观。建议在注册 op/调用前显式校验 fp8_gemm_nt 是否可用(并结合 SM 版本与编译开关给出明确报错或自动 fallback)。

Suggested change
# Validate fp8_gemm_nt availability when deep_gemm is explicitly enabled.
# This avoids confusing 'NoneType is not callable' errors at runtime.
if current_platform.is_cuda():
_sm_version_check = get_sm_version()
if _sm_version_check >= 89 and bool(envs.FD_USE_DEEP_GEMM) and fp8_gemm_nt is None:
raise RuntimeError(
"FD_USE_DEEP_GEMM is enabled on CUDA SM"
f"{_sm_version_check}, but fp8_gemm_nt could not be imported. "
"Please ensure deep_gemm is compiled for this architecture or "
"disable FD_USE_DEEP_GEMM to fall back to a non-deep_gemm backend."
)

Copilot uses AI. Check for mistakes.
moe_expert_dispatch,
moe_expert_reduce,
)
except ImportError:
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
- Move W8A8 quantization, MOE GEMM, and related features from cc >= 75/80 to cc >= 70
- Add --skip-fp8 flag to auto_gen_template_instantiation.py for SM70 compatibility
- Keep append_attention, MLA, gptq_marlin_repack, winx_unzip at cc >= 80 (require cp.async/ldmatrix)
- Tested: FFN/MOE tests pass, quantization (non-FP8) tests pass on V100

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
from fastdeploy.model_executor.ops.gpu import (
append_attention_with_output as append_attention_with_output_gpu,
)
except ImportError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

70上attention用什么方案呢 有端到端跑通过吗

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前Append Attention是全跳过的,这边在sm70的实现方案也得盘一下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants