-
Notifications
You must be signed in to change notification settings - Fork 697
[Feature][OP] Add V100 (SM70) GPU Support #6306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Thanks for your contribution! |
aa40210 to
f3216c0
Compare
4a1f4d4 to
3b39080
Compare
5cb9172 to
b9dcf58
Compare
|
/Re-run failed jobs |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #6306 +/- ##
==========================================
Coverage ? 68.35%
==========================================
Files ? 391
Lines ? 52702
Branches ? 8229
==========================================
Hits ? 36025
Misses ? 14050
Partials ? 2627
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Support FP16 inference on V100 by adding SM70 compilation flags, disabling BF16/FP8 quantization, and graceful fallback for SM80+ only ops.
The tritonmoe_preprocess_func and count_tokens_per_expert_func do not depend on BF16 features and should be available on all GPU architectures. Moving them out of the #ifdef ENABLE_BF16 block fixes the CI failure where ernie-21b model failed to load due to missing tritonmoe_preprocess_func. Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com>
…upport - Restore #ifndef PADDLE_WITH_CUSTOM_DEVICE_METAX_GPU guard for tanh_ptx function since MetaX compiler doesn't support NVIDIA PTX asm syntax - Restore MetaX-specific kernel launch using standard <<<>>> syntax instead of cudaLaunchKernelEx which is not supported on MetaX - Keep SM70 (V100) support by using tanhf() fallback when __CUDA_ARCH__ < 750 Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com>
5145b41 to
df7202f
Compare
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8150b9c to
37d9bfe
Compare
…e#6358 PR PaddlePaddle#6358 removed speculate_get_output_padding_offset.cu and reuses non-MTP get_output_padding_offset. Keep cpp_extensions.cc consistent with upstream develop. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…6358 PR PaddlePaddle#6358 renamed parameters and changed signature: - output_padding_offset -> batch_id_per_token_output - Added cu_seqlens_q_output parameter - Removed max_input_length parameter Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add get_attn_mask_q function that was lost during merge conflict resolution. This function was introduced in PR PaddlePaddle#6354 and is required for FA2/FA3/FA4 with attn_mask_q support. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
该 PR 旨在为 FastDeploy 增加对 NVIDIA V100(SM70/Volta)GPU 的兼容支持,通过在编译系统与运行时按 SM 架构能力进行条件编译与自动 fallback,避免 BF16/FP8/cp.async 等 SM80+/SM89+ 特性在旧 GPU 上导致编译或运行失败。
Changes:
- 编译与算子侧:扩展 custom_ops 编译逻辑以覆盖 SM70,并对部分 CUDA kernel 做 SM 条件适配(如
tanh.approx.f32、BF16 算术兼容、缺失头文件等)。 - 运行时侧:新增 CUDA SM 能力探测与 capability-based fallback(dtype、Attention backend、MoE backend、FP8 quant 方法)。
- 测试侧:为 FP8 相关测试增加 SM89+ 跳过逻辑,并在部分用例中按 SM 自动选择 dtype/量化配置。
Reviewed changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/quantization/test_w4afp8.py | 增加 FP8(SM89+)能力检测并对相关用例加 skip。 |
| tests/layers/test_fusedmoe.py | 对 FP8 量化相关测试类增加 SM89+ skip。 |
| tests/layers/test_ffn.py | 按 SM 版本切换默认 dtype/量化配置以适配 SM70。 |
| tests/layers/test_attention_layer.py | 对 FP8 量化测试增加 SM89+ skip。 |
| fastdeploy/platforms/cuda.py | 增加 SM 版本获取与 BF16/FP8/cp.async/Marlin 支持判断,并对 Attention backend 做自动 fallback。 |
| fastdeploy/model_executor/layers/quantization/weight_only.py | 对 MoE backend(marlin/triton)按 SM 版本自动 fallback 到 cutlass。 |
| fastdeploy/model_executor/layers/quantization/mix_quant.py | 对 FP8 quant type 在不支持硬件上做自动降级(fallback 到 wint8/wint4)。 |
| fastdeploy/model_executor/layers/quantization/block_wise_fp8.py | 将 deep_gemm 导入限制到 SM89+,并增强 fp8_quant_blockwise 的版本兼容。 |
| fastdeploy/model_executor/layers/quantization/init.py | 在 parse_quant_config 中对 FP8 相关 quant 方法按硬件能力做全局调整/降级。 |
| fastdeploy/model_executor/layers/moe/moe.py | MoE backend 在 SM70/SM<80 下自动 fallback 到 cutlass。 |
| fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py | deep_gemm 相关导入与 fp8_quant_blockwise 参数兼容封装。 |
| fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py | 对部分 ops 导入增加 try/except,增强旧架构容错。 |
| fastdeploy/model_executor/layers/attention/ops/pre_cache_len_concat.py | CUDA op import 增加 try/except,并在不可用时抛 NotImplementedError。 |
| fastdeploy/model_executor/layers/attention/ops/gqa_rope_write_cache.py | 同上:CUDA op 不可用时抛 NotImplementedError。 |
| fastdeploy/model_executor/layers/attention/ops/get_block_shape_and_split_kv_block.py | 对 append_attn 相关 CUDA op 增加缺失保护并在不可用时抛 NotImplementedError。 |
| fastdeploy/model_executor/layers/attention/ops/flash_mask_attention.py | CUDA op import 增加 try/except,并在不可用时抛 NotImplementedError。 |
| fastdeploy/model_executor/layers/attention/ops/append_attention.py | append_attention CUDA op 按架构可用性进行导入保护与显式错误提示。 |
| fastdeploy/model_executor/layers/attention/mla_attention_backend.py | MLA 相关 CUDA op 导入改为容错导入(ImportError 不致模块导入失败)。 |
| fastdeploy/config.py | 在配置 post-init 中按硬件能力自动将 BF16 降级到 FP16。 |
| custom_ops/setup_ops.py | 扩展编译 sources/flags 以覆盖 SM70,并调整 append_attention/marlin/moe/spec_decode 等编译选择。 |
| custom_ops/gpu_ops/sample_kernels/sampling.cuh | 补充 <limits> 头文件。 |
| custom_ops/gpu_ops/moe/moe_wna16_marlin_utils/marlin_template.h | 调整 marlin 模板/宏参数以提升 SM70 兼容性与可读性。 |
| custom_ops/gpu_ops/moe/moe_wna16_marlin_utils/kernel.h | 修正宏定义缩进/格式并优化注释排版。 |
| custom_ops/gpu_ops/moe/moe_wna16_marlin_gemm.cu | 对 SM<80 给出明确不可用行为,并调整 include/格式。 |
| custom_ops/gpu_ops/moe/moe_deepgemm_depermute.cu | 为 SM70/SM75 的 BF16 算术添加 float 转换兼容路径。 |
| custom_ops/gpu_ops/gelu_tanh.cu | 为 SM70 移除 tanh.approx.f32 依赖并修正 block 维度上限。 |
| custom_ops/gpu_ops/cpp_extensions.cc | 通过宏控制 append_attention/BF16 相关导出,并调整部分 speculate/reasoning 接口参数名。 |
Comments suppressed due to low confidence (1)
custom_ops/gpu_ops/cpp_extensions.cc:1275
- cpp_extensions.cc 里已不再向 fastdeploy_ops 模块导出 per_token_quant / per_token_quant_padding,但仓库内仍存在对 fastdeploy.model_executor.ops.gpu.per_token_quant(…) 的调用与单测覆盖(例如 fused_moe_triton_backend.py、tests/operators/test_per_token_quant.py)。这会导致 ImportError/AttributeError 并阻断 FP8 相关路径与测试。建议恢复这两个 binding(以及对应的 CUDA 源文件编译),或同步改动所有调用方并移除/更新相关测试。
m.def("fused_mask_swiglu_fp8_quant",
&FusedMaskSwigluFP8Quant,
py::arg("input"),
py::arg("token_nums_per_expert"),
py::arg("block_size"),
py::arg("use_ue8m0") = false,
"fused mask swiglu and fp8 quant");
| _sm_version = cuda_device.get_device_capability()[0] | ||
| print(f"[DEBUG] Detected SM version: {_sm_version}") | ||
| if _sm_version >= 8: | ||
| paddle.set_default_dtype("bfloat16") | ||
| _default_dtype = paddle.bfloat16 | ||
| # BlockWiseFP8Config requires bfloat16, only available on SM80+ | ||
| _quant_config = BlockWiseFP8Config(weight_block_size=[128, 128]) | ||
| print(f"[DEBUG] Using BlockWiseFP8Config for SM{_sm_version}0") | ||
| else: | ||
| paddle.set_default_dtype("float16") | ||
| _default_dtype = paddle.float16 | ||
| # V100 (SM70) doesn't support FP8 quantization, use None | ||
| _quant_config = None | ||
| print(f"[DEBUG] Disabling quantization for V100 (SM{_sm_version}0), _quant_config = None") |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_ffn.py 在模块 import 阶段直接调用 cuda_device.get_device_capability() / paddle.device.cuda.get_device_name(),并输出多处 [DEBUG] print。若测试在无 CUDA 环境(或 CUDA 不可用)收集用例时,会在收集阶段直接报错而不是被 skip。建议把 SM 检测放到 setUp/test 内,并用 paddle.is_compiled_with_cuda()/paddle.device.is_compiled_with_cuda 等条件配合 unittest.skipUnless/skipIf;同时移除或改用 logger(并默认关闭)以避免测试输出污染。
| DISPATCH_FLOAT_FP6_DTYPE(input.dtype(), scalar_t, { | ||
| uint32_t vec_size = 16 / sizeof(scalar_t); | ||
| dim3 grid(num_tokens); | ||
| dim3 block(std::max(d / vec_size, 1024U)); | ||
| dim3 block(std::min(d / vec_size, 1024U)); | ||
|
|
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gelu_tanh.cu 的 kernel launch block 维度使用 std::min(d / vec_size, 1024U)。当 d < vec_size 时 d/vec_size=0,会生成 blockDim=0 的非法 launch 配置并导致运行时失败。建议对下界做 clamp(至少为 1,必要时也可按 warp size/最小并行度设置)。
| # These ops may not be available on older GPU architectures (V100/SM70) | ||
| moe_expert_dispatch = None | ||
| moe_expert_reduce = None | ||
|
|
||
| if current_platform.is_cuda(): | ||
| from fastdeploy.model_executor.ops.gpu import moe_expert_dispatch, moe_expert_reduce | ||
| try: | ||
| from fastdeploy.model_executor.ops.gpu import ( | ||
| moe_expert_dispatch, | ||
| moe_expert_reduce, | ||
| ) | ||
| except ImportError: | ||
| pass |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fused_moe_cutlass_backend.py 里 moe_expert_dispatch/moe_expert_reduce 在 ImportError 时会被置为 None,但后续 apply_tp 中仍会无条件调用这些符号,最终变成 'NoneType is not callable' 且定位困难。建议:如果这些 op 在当前架构/构建中不可用,直接在 import 失败处抛出带架构信息的 NotImplementedError;或在调用前做显式检查并提供可用 backend 的 fallback。
| "gpu_ops/cpp_extensions.cc", | ||
| "gpu_ops/share_external_data.cu", | ||
| "gpu_ops/fused_mask_swiglu_fp8_quant_kernel.cu", | ||
| "gpu_ops/per_token_quant_fp8.cu", | ||
| "gpu_ops/update_split_fuse_input.cu", | ||
| "gpu_ops/text_image_index_out.cu", |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setup_ops.py 的 CUDA sources 列表里移除了 gpu_ops/per_token_quant_fp8.cu,但当前仓库仍有 Python 代码与单测在 import/调用 fastdeploy.model_executor.ops.gpu.per_token_quant 与 per_token_quant_padding(例如 fused_moe_triton_backend.py、tests/operators/test_per_token_quant.py)。这会导致编译产物缺少对应算子并在运行/测试时直接报错。建议:要么恢复该 .cu 到 sources 并在 cpp_extensions.cc 中恢复 pybind 暴露;要么同步改掉所有调用方,统一改用新的 quant wrapper/算子,并删除对应单测。
| if current_platform.is_cuda(): | ||
| try: | ||
| m_grouped_fp8_gemm_nt_contiguous = ( | ||
| fastdeploy.model_executor.layers.quantization.fp8_utils.deep_gemm.m_grouped_fp8_gemm_nt_contiguous | ||
| ) | ||
| m_grouped_fp8_gemm_nt_masked = ( | ||
| fastdeploy.model_executor.layers.quantization.fp8_utils.deep_gemm.m_grouped_fp8_gemm_nt_masked | ||
| if get_sm_version() == 100: | ||
| paddle.compat.enable_torch_proxy(scope={"deep_gemm"}) | ||
| from deep_gemm import ( | ||
| m_grouped_fp8_gemm_nt_contiguous, | ||
| m_grouped_fp8_gemm_nt_masked, | ||
| ) | ||
| except: | ||
| m_grouped_fp8_gemm_nt_contiguous = ( | ||
| fastdeploy.model_executor.layers.quantization.fp8_utils.deep_gemm.m_grouped_gemm_fp8_fp8_bf16_nt_contiguous | ||
| else: | ||
| from fastdeploy.model_executor.ops.gpu.deep_gemm import ( | ||
| m_grouped_gemm_fp8_fp8_bf16_nt_contiguous as m_grouped_fp8_gemm_nt_contiguous, | ||
| ) | ||
| m_grouped_fp8_gemm_nt_masked = ( | ||
| fastdeploy.model_executor.layers.quantization.fp8_utils.deep_gemm.m_grouped_gemm_fp8_fp8_bf16_nt_masked | ||
| from fastdeploy.model_executor.ops.gpu.deep_gemm import ( | ||
| m_grouped_gemm_fp8_fp8_bf16_nt_masked as m_grouped_fp8_gemm_nt_masked, | ||
| ) |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fused_moe_deepgemm_backend.py 在 current_platform.is_cuda() 且 SM!=100 时无条件从 fastdeploy.model_executor.ops.gpu.deep_gemm 导入符号。如果 deep_gemm 仅在 SM89+ 或特定编译开关下才会生成,这会让 SM70/SM80 等机器在“导入模块”阶段就抛 ImportError,即使运行时已经做了 backend fallback 也无法启动。建议按 SM/编译宏做条件导入(例如仅 SM>=89 才导入),或保留 try/except 并在真正选择 DeepGemm backend 时给出清晰的 NotImplementedError。
|
|
||
|
|
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block_wise_fp8.py 中 fp8_gemm_nt 在 SM>=89 但 deep_gemm 未编译/导入失败时会保持为 None,但 deep_gemm_fp8_gemm_nt() 里仍会无条件调用 fp8_gemm_nt((x, scale), …)。这会在运行时触发 'NoneType is not callable',错误信息也不直观。建议在注册 op/调用前显式校验 fp8_gemm_nt 是否可用(并结合 SM 版本与编译开关给出明确报错或自动 fallback)。
| # Validate fp8_gemm_nt availability when deep_gemm is explicitly enabled. | |
| # This avoids confusing 'NoneType is not callable' errors at runtime. | |
| if current_platform.is_cuda(): | |
| _sm_version_check = get_sm_version() | |
| if _sm_version_check >= 89 and bool(envs.FD_USE_DEEP_GEMM) and fp8_gemm_nt is None: | |
| raise RuntimeError( | |
| "FD_USE_DEEP_GEMM is enabled on CUDA SM" | |
| f"{_sm_version_check}, but fp8_gemm_nt could not be imported. " | |
| "Please ensure deep_gemm is compiled for this architecture or " | |
| "disable FD_USE_DEEP_GEMM to fall back to a non-deep_gemm backend." | |
| ) |
| moe_expert_dispatch, | ||
| moe_expert_reduce, | ||
| ) | ||
| except ImportError: |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
- Move W8A8 quantization, MOE GEMM, and related features from cc >= 75/80 to cc >= 70 - Add --skip-fp8 flag to auto_gen_template_instantiation.py for SM70 compatibility - Keep append_attention, MLA, gptq_marlin_repack, winx_unzip at cc >= 80 (require cp.async/ldmatrix) - Tested: FFN/MOE tests pass, quantization (non-FP8) tests pass on V100 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
| from fastdeploy.model_executor.ops.gpu import ( | ||
| append_attention_with_output as append_attention_with_output_gpu, | ||
| ) | ||
| except ImportError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
70上attention用什么方案呢 有端到端跑通过吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前Append Attention是全跳过的,这边在sm70的实现方案也得盘一下
Motivation
为 FastDeploy 添加 NVIDIA V100 GPU (SM70 架构) 支持,使其能在旧版 GPU 上进行开发测试。
V100 (Volta 架构) 不支持以下特性,需要同时适配编译系统和运行时逻辑:
Modifications
编译系统
setup_ops.py: 支持 SM70+ 编译,分离 SM70/SM80+ 特有代码cpp_extensions.cc: 添加ENABLE_APPEND_ATTENTION和ENABLE_BF16宏控制条件编译CUDA Kernel
gelu_tanh.cu: 修复tanh.approx.f32PTX 指令在 SM70 的编译问题,添加 MetaX 兼容性moe_wna16_marlin_*.cu/h: 修复 Marlin GEMM 模板在 SM70 的编译兼容性moe_deepgemm_depermute.cu: 添加 SM70/SM75 条件编译,BF16 算术运算通过 float 转换实现sampling.cuh: 添加缺失的<limits>头文件Python 运行时
fastdeploy/platforms/cuda.py:get_sm_version())supports_bf16(),supports_fp8(),supports_async_copy(),supports_marlin())fastdeploy/config.py: BF16→FP16 dtype 自动降级fastdeploy/model_executor/layers/moe/moe.py:fastdeploy/model_executor/layers/quantization/__init__.py:block_wise_fp8→wint8,w4afp8→wint4)attention/ops/*.py: 为 SM80+ 专属 ops 添加 try-except 保护测试
test_attention_layer.py: 添加 FP8 SM89+ skip 装饰器test_fusedmoe.py: 添加 FP8 SM89+ skip 装饰器test_w4afp8.py: 添加 FP8 SM89+ skip 装饰器test_ffn.py: 根据 SM 版本自动选择 dtype 和量化配置SM70/SM75 Fallback 策略
Usage or Command
Accuracy Tests
本 PR 为硬件兼容性支持,不影响模型计算逻辑。V100 上自动 fallback 到 FP16/CUTLASS 等兼容方案,精度与对应方案一致。
Checklist
pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.