[Feature][OP] Add V100 (SM70) GPU Support #6306

mattheliu · 2026-02-02T08:43:04Z

Motivation

为 FastDeploy 添加 NVIDIA V100 GPU (SM70 架构) 支持，使其能在旧版 GPU 上进行开发测试。

V100 (Volta 架构) 不支持以下特性，需要同时适配编译系统和运行时逻辑：

特性	最低要求	V100 支持
BF16 数据类型	SM80+ (Ampere)	❌
FP8 量化	SM89+ (Ada Lovelace)	❌
cp.async 指令	SM80+ (Ampere)	❌
tanh.approx.f32 PTX	SM75+ (Turing)	❌

Modifications

编译系统

setup_ops.py: 支持 SM70+ 编译，分离 SM70/SM80+ 特有代码
cpp_extensions.cc: 添加 ENABLE_APPEND_ATTENTION 和 ENABLE_BF16 宏控制条件编译

CUDA Kernel

gelu_tanh.cu: 修复 tanh.approx.f32 PTX 指令在 SM70 的编译问题，添加 MetaX 兼容性
moe_wna16_marlin_*.cu/h: 修复 Marlin GEMM 模板在 SM70 的编译兼容性
moe_deepgemm_depermute.cu: 添加 SM70/SM75 条件编译，BF16 算术运算通过 float 转换实现
sampling.cuh: 添加缺失的 <limits> 头文件

Python 运行时

fastdeploy/platforms/cuda.py:
- 添加 SM 版本检测方法 (get_sm_version())
- 添加硬件能力检查 (supports_bf16(), supports_fp8(), supports_async_copy(), supports_marlin())
- Attention backend 自动 fallback (APPEND_ATTN/MLA_ATTN → FLASH_ATTN)
fastdeploy/config.py: BF16→FP16 dtype 自动降级
fastdeploy/model_executor/layers/moe/moe.py:
- Marlin MoE backend → CUTLASS fallback (SM<80)
- Triton MoE backend → CUTLASS fallback (SM<80)
fastdeploy/model_executor/layers/quantization/__init__.py:
- FP8 量化方法自动 fallback (block_wise_fp8→wint8, w4afp8→wint4)
attention/ops/*.py: 为 SM80+ 专属 ops 添加 try-except 保护

测试

test_attention_layer.py: 添加 FP8 SM89+ skip 装饰器
test_fusedmoe.py: 添加 FP8 SM89+ skip 装饰器
test_w4afp8.py: 添加 FP8 SM89+ skip 装饰器
test_ffn.py: 根据 SM 版本自动选择 dtype 和量化配置

SM70/SM75 Fallback 策略

功能	原始	SM70/SM75 Fallback	原因
数据类型	BF16	FP16	BF16 需要 SM80+
Attention Backend	APPEND_ATTN	FLASH_ATTN	cp.async 需要 SM80+
Attention Backend	MLA_ATTN	FLASH_ATTN	cp.async 需要 SM80+
MoE Backend	Marlin	CUTLASS	Marlin 需要 SM80+
MoE Backend	Triton	CUTLASS	tritonmoe_preprocess 需要 SM80+
量化	block_wise_fp8	wint8	FP8 需要 SM89+
量化	w4afp8	wint4	FP8 需要 SM89+

Usage or Command

# 编译 (指定 SM70 架构)
MAX_JOBS=8 bash build.sh 1 python false [70]

# 或使用 setup_ops.py
cd custom_ops && python setup_ops.py install

Accuracy Tests

本 PR 为硬件兼容性支持，不影响模型计算逻辑。V100 上自动 fallback 到 FP16/CUTLASS 等兼容方案，精度与对应方案一致。

Checklist

Add at least a tag in the PR title.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
- 已添加 SM 版本检测的 skip 装饰器，确保测试在不支持的硬件上正确跳过
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-02-02T08:43:11Z

Thanks for your contribution!

mattheliu · 2026-02-04T05:43:03Z

/Re-run failed jobs

codecov-commenter · 2026-02-04T06:29:47Z

Codecov Report

❌ Patch coverage is 40.09434% with 127 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@d60daca). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...loy/model_executor/layers/quantization/__init__.py	28.57%	23 Missing and 2 partials ⚠️
...oy/model_executor/layers/quantization/mix_quant.py	25.00%	20 Missing and 1 partial ⚠️
..._executor/layers/moe/fused_moe_deepgemm_backend.py	18.18%	17 Missing and 1 partial ⚠️
fastdeploy/platforms/cuda.py	68.29%	11 Missing and 2 partials ⚠️
.../model_executor/layers/quantization/weight_only.py	21.42%	7 Missing and 4 partials ⚠️
...del_executor/layers/quantization/block_wise_fp8.py	42.85%	6 Missing and 2 partials ⚠️
fastdeploy/model_executor/layers/moe/moe.py	22.22%	4 Missing and 3 partials ⚠️
...l_executor/layers/moe/fused_moe_cutlass_backend.py	40.00%	6 Missing ⚠️
fastdeploy/config.py	62.50%	1 Missing and 2 partials ⚠️
...cutor/layers/attention/ops/flash_mask_attention.py	0.00%	3 Missing ⚠️
... and 5 more

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #6306   +/-   ##
==========================================
  Coverage           ?   68.35%           
==========================================
  Files              ?      391           
  Lines              ?    52702           
  Branches           ?     8229           
==========================================
  Hits               ?    36025           
  Misses             ?    14050           
  Partials           ?     2627

Flag	Coverage Δ
GPU	`68.35% <40.09%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Support FP16 inference on V100 by adding SM70 compilation flags, disabling BF16/FP8 quantization, and graceful fallback for SM80+ only ops.

The tritonmoe_preprocess_func and count_tokens_per_expert_func do not depend on BF16 features and should be available on all GPU architectures. Moving them out of the #ifdef ENABLE_BF16 block fixes the CI failure where ernie-21b model failed to load due to missing tritonmoe_preprocess_func. Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com>

…upport - Restore #ifndef PADDLE_WITH_CUSTOM_DEVICE_METAX_GPU guard for tanh_ptx function since MetaX compiler doesn't support NVIDIA PTX asm syntax - Restore MetaX-specific kernel launch using standard <<<>>> syntax instead of cudaLaunchKernelEx which is not supported on MetaX - Keep SM70 (V100) support by using tanhf() fallback when __CUDA_ARCH__ < 750 Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…e#6358 PR PaddlePaddle#6358 removed speculate_get_output_padding_offset.cu and reuses non-MTP get_output_padding_offset. Keep cpp_extensions.cc consistent with upstream develop. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…6358 PR PaddlePaddle#6358 renamed parameters and changed signature: - output_padding_offset -> batch_id_per_token_output - Added cu_seqlens_q_output parameter - Removed max_input_length parameter Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add get_attn_mask_q function that was lost during merge conflict resolution. This function was introduced in PR PaddlePaddle#6354 and is required for FA2/FA3/FA4 with attn_mask_q support. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Copilot

Pull request overview

该 PR 旨在为 FastDeploy 增加对 NVIDIA V100（SM70/Volta）GPU 的兼容支持，通过在编译系统与运行时按 SM 架构能力进行条件编译与自动 fallback，避免 BF16/FP8/cp.async 等 SM80+/SM89+ 特性在旧 GPU 上导致编译或运行失败。

Changes:

编译与算子侧：扩展 custom_ops 编译逻辑以覆盖 SM70，并对部分 CUDA kernel 做 SM 条件适配（如 tanh.approx.f32、BF16 算术兼容、缺失头文件等）。
运行时侧：新增 CUDA SM 能力探测与 capability-based fallback（dtype、Attention backend、MoE backend、FP8 quant 方法）。
测试侧：为 FP8 相关测试增加 SM89+ 跳过逻辑，并在部分用例中按 SM 自动选择 dtype/量化配置。

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
tests/quantization/test_w4afp8.py	增加 FP8（SM89+）能力检测并对相关用例加 skip。
tests/layers/test_fusedmoe.py	对 FP8 量化相关测试类增加 SM89+ skip。
tests/layers/test_ffn.py	按 SM 版本切换默认 dtype/量化配置以适配 SM70。
tests/layers/test_attention_layer.py	对 FP8 量化测试增加 SM89+ skip。
fastdeploy/platforms/cuda.py	增加 SM 版本获取与 BF16/FP8/cp.async/Marlin 支持判断，并对 Attention backend 做自动 fallback。
fastdeploy/model_executor/layers/quantization/weight_only.py	对 MoE backend（marlin/triton）按 SM 版本自动 fallback 到 cutlass。
fastdeploy/model_executor/layers/quantization/mix_quant.py	对 FP8 quant type 在不支持硬件上做自动降级（fallback 到 wint8/wint4）。
fastdeploy/model_executor/layers/quantization/block_wise_fp8.py	将 deep_gemm 导入限制到 SM89+，并增强 fp8_quant_blockwise 的版本兼容。
fastdeploy/model_executor/layers/quantization/init.py	在 parse_quant_config 中对 FP8 相关 quant 方法按硬件能力做全局调整/降级。
fastdeploy/model_executor/layers/moe/moe.py	MoE backend 在 SM70/SM<80 下自动 fallback 到 cutlass。
fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py	deep_gemm 相关导入与 fp8_quant_blockwise 参数兼容封装。
fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py	对部分 ops 导入增加 try/except，增强旧架构容错。
fastdeploy/model_executor/layers/attention/ops/pre_cache_len_concat.py	CUDA op import 增加 try/except，并在不可用时抛 NotImplementedError。
fastdeploy/model_executor/layers/attention/ops/gqa_rope_write_cache.py	同上：CUDA op 不可用时抛 NotImplementedError。
fastdeploy/model_executor/layers/attention/ops/get_block_shape_and_split_kv_block.py	对 append_attn 相关 CUDA op 增加缺失保护并在不可用时抛 NotImplementedError。
fastdeploy/model_executor/layers/attention/ops/flash_mask_attention.py	CUDA op import 增加 try/except，并在不可用时抛 NotImplementedError。
fastdeploy/model_executor/layers/attention/ops/append_attention.py	append_attention CUDA op 按架构可用性进行导入保护与显式错误提示。
fastdeploy/model_executor/layers/attention/mla_attention_backend.py	MLA 相关 CUDA op 导入改为容错导入（ImportError 不致模块导入失败）。
fastdeploy/config.py	在配置 post-init 中按硬件能力自动将 BF16 降级到 FP16。
custom_ops/setup_ops.py	扩展编译 sources/flags 以覆盖 SM70，并调整 append_attention/marlin/moe/spec_decode 等编译选择。
custom_ops/gpu_ops/sample_kernels/sampling.cuh	补充 `<limits>` 头文件。
custom_ops/gpu_ops/moe/moe_wna16_marlin_utils/marlin_template.h	调整 marlin 模板/宏参数以提升 SM70 兼容性与可读性。
custom_ops/gpu_ops/moe/moe_wna16_marlin_utils/kernel.h	修正宏定义缩进/格式并优化注释排版。
custom_ops/gpu_ops/moe/moe_wna16_marlin_gemm.cu	对 SM<80 给出明确不可用行为，并调整 include/格式。
custom_ops/gpu_ops/moe/moe_deepgemm_depermute.cu	为 SM70/SM75 的 BF16 算术添加 float 转换兼容路径。
custom_ops/gpu_ops/gelu_tanh.cu	为 SM70 移除 `tanh.approx.f32` 依赖并修正 block 维度上限。
custom_ops/gpu_ops/cpp_extensions.cc	通过宏控制 append_attention/BF16 相关导出，并调整部分 speculate/reasoning 接口参数名。

Comments suppressed due to low confidence (1)

custom_ops/gpu_ops/cpp_extensions.cc:1275

cpp_extensions.cc 里已不再向 fastdeploy_ops 模块导出 per_token_quant / per_token_quant_padding，但仓库内仍存在对 fastdeploy.model_executor.ops.gpu.per_token_quant(…) 的调用与单测覆盖（例如 fused_moe_triton_backend.py、tests/operators/test_per_token_quant.py）。这会导致 ImportError/AttributeError 并阻断 FP8 相关路径与测试。建议恢复这两个 binding（以及对应的 CUDA 源文件编译），或同步改动所有调用方并移除/更新相关测试。

  m.def("fused_mask_swiglu_fp8_quant",
        &FusedMaskSwigluFP8Quant,
        py::arg("input"),
        py::arg("token_nums_per_expert"),
        py::arg("block_size"),
        py::arg("use_ue8m0") = false,
        "fused mask swiglu and fp8 quant");

Copilot · 2026-02-09T11:37:19Z

tests/layers/test_ffn.py

+_sm_version = cuda_device.get_device_capability()[0]
+print(f"[DEBUG] Detected SM version: {_sm_version}")
+if _sm_version >= 8:
+    paddle.set_default_dtype("bfloat16")
+    _default_dtype = paddle.bfloat16
+    # BlockWiseFP8Config requires bfloat16, only available on SM80+
+    _quant_config = BlockWiseFP8Config(weight_block_size=[128, 128])
+    print(f"[DEBUG] Using BlockWiseFP8Config for SM{_sm_version}0")
+else:
+    paddle.set_default_dtype("float16")
+    _default_dtype = paddle.float16
+    # V100 (SM70) doesn't support FP8 quantization, use None
+    _quant_config = None
+    print(f"[DEBUG] Disabling quantization for V100 (SM{_sm_version}0), _quant_config = None")


test_ffn.py 在模块 import 阶段直接调用 cuda_device.get_device_capability() / paddle.device.cuda.get_device_name()，并输出多处 [DEBUG] print。若测试在无 CUDA 环境（或 CUDA 不可用）收集用例时，会在收集阶段直接报错而不是被 skip。建议把 SM 检测放到 setUp/test 内，并用 paddle.is_compiled_with_cuda()/paddle.device.is_compiled_with_cuda 等条件配合 unittest.skipUnless/skipIf；同时移除或改用 logger（并默认关闭）以避免测试输出污染。

Copilot · 2026-02-09T11:37:19Z

custom_ops/gpu_ops/gelu_tanh.cu

  DISPATCH_FLOAT_FP6_DTYPE(input.dtype(), scalar_t, {
    uint32_t vec_size = 16 / sizeof(scalar_t);
    dim3 grid(num_tokens);
-    dim3 block(std::max(d / vec_size, 1024U));
+    dim3 block(std::min(d / vec_size, 1024U));



gelu_tanh.cu 的 kernel launch block 维度使用 std::min(d / vec_size, 1024U)。当 d < vec_size 时 d/vec_size=0，会生成 blockDim=0 的非法 launch 配置并导致运行时失败。建议对下界做 clamp（至少为 1，必要时也可按 warp size/最小并行度设置）。

Copilot · 2026-02-09T11:37:20Z

fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py

+# These ops may not be available on older GPU architectures (V100/SM70)
+moe_expert_dispatch = None
+moe_expert_reduce = None
+
 if current_platform.is_cuda():
-    from fastdeploy.model_executor.ops.gpu import moe_expert_dispatch, moe_expert_reduce
+    try:
+        from fastdeploy.model_executor.ops.gpu import (
+            moe_expert_dispatch,
+            moe_expert_reduce,
+        )
+    except ImportError:
+        pass


fused_moe_cutlass_backend.py 里 moe_expert_dispatch/moe_expert_reduce 在 ImportError 时会被置为 None，但后续 apply_tp 中仍会无条件调用这些符号，最终变成 'NoneType is not callable' 且定位困难。建议：如果这些 op 在当前架构/构建中不可用，直接在 import 失败处抛出带架构信息的 NotImplementedError；或在调用前做显式检查并提供可用 backend 的 fallback。

Copilot · 2026-02-09T11:37:20Z

custom_ops/setup_ops.py

        "gpu_ops/cpp_extensions.cc",
        "gpu_ops/share_external_data.cu",
        "gpu_ops/fused_mask_swiglu_fp8_quant_kernel.cu",
-        "gpu_ops/per_token_quant_fp8.cu",
        "gpu_ops/update_split_fuse_input.cu",
        "gpu_ops/text_image_index_out.cu",


setup_ops.py 的 CUDA sources 列表里移除了 gpu_ops/per_token_quant_fp8.cu，但当前仓库仍有 Python 代码与单测在 import/调用 fastdeploy.model_executor.ops.gpu.per_token_quant 与 per_token_quant_padding（例如 fused_moe_triton_backend.py、tests/operators/test_per_token_quant.py）。这会导致编译产物缺少对应算子并在运行/测试时直接报错。建议：要么恢复该 .cu 到 sources 并在 cpp_extensions.cc 中恢复 pybind 暴露；要么同步改掉所有调用方，统一改用新的 quant wrapper/算子，并删除对应单测。

Copilot · 2026-02-09T11:37:20Z

fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py

 if current_platform.is_cuda():
-    try:
-        m_grouped_fp8_gemm_nt_contiguous = (
-            fastdeploy.model_executor.layers.quantization.fp8_utils.deep_gemm.m_grouped_fp8_gemm_nt_contiguous
-        )
-        m_grouped_fp8_gemm_nt_masked = (
-            fastdeploy.model_executor.layers.quantization.fp8_utils.deep_gemm.m_grouped_fp8_gemm_nt_masked
+    if get_sm_version() == 100:
+        paddle.compat.enable_torch_proxy(scope={"deep_gemm"})
+        from deep_gemm import (
+            m_grouped_fp8_gemm_nt_contiguous,
+            m_grouped_fp8_gemm_nt_masked,
        )
-    except:
-        m_grouped_fp8_gemm_nt_contiguous = (
-            fastdeploy.model_executor.layers.quantization.fp8_utils.deep_gemm.m_grouped_gemm_fp8_fp8_bf16_nt_contiguous
+    else:
+        from fastdeploy.model_executor.ops.gpu.deep_gemm import (
+            m_grouped_gemm_fp8_fp8_bf16_nt_contiguous as m_grouped_fp8_gemm_nt_contiguous,
        )
-        m_grouped_fp8_gemm_nt_masked = (
-            fastdeploy.model_executor.layers.quantization.fp8_utils.deep_gemm.m_grouped_gemm_fp8_fp8_bf16_nt_masked
+        from fastdeploy.model_executor.ops.gpu.deep_gemm import (
+            m_grouped_gemm_fp8_fp8_bf16_nt_masked as m_grouped_fp8_gemm_nt_masked,
        )


fused_moe_deepgemm_backend.py 在 current_platform.is_cuda() 且 SM!=100 时无条件从 fastdeploy.model_executor.ops.gpu.deep_gemm 导入符号。如果 deep_gemm 仅在 SM89+ 或特定编译开关下才会生成，这会让 SM70/SM80 等机器在“导入模块”阶段就抛 ImportError，即使运行时已经做了 backend fallback 也无法启动。建议按 SM/编译宏做条件导入（例如仅 SM>=89 才导入），或保留 try/except 并在真正选择 DeepGemm backend 时给出清晰的 NotImplementedError。

Copilot · 2026-02-09T11:37:20Z

fastdeploy/model_executor/layers/quantization/block_wise_fp8.py

block_wise_fp8.py 中 fp8_gemm_nt 在 SM>=89 但 deep_gemm 未编译/导入失败时会保持为 None，但 deep_gemm_fp8_gemm_nt() 里仍会无条件调用 fp8_gemm_nt((x, scale), …)。这会在运行时触发 'NoneType is not callable'，错误信息也不直观。建议在注册 op/调用前显式校验 fp8_gemm_nt 是否可用（并结合 SM 版本与编译开关给出明确报错或自动 fallback）。

Suggested change

# Validate fp8_gemm_nt availability when deep_gemm is explicitly enabled.

# This avoids confusing 'NoneType is not callable' errors at runtime.

if current_platform.is_cuda():

_sm_version_check = get_sm_version()

if _sm_version_check >= 89 and bool(envs.FD_USE_DEEP_GEMM) and fp8_gemm_nt is None:

raise RuntimeError(

"FD_USE_DEEP_GEMM is enabled on CUDA SM"

f"{_sm_version_check}, but fp8_gemm_nt could not be imported. "

"Please ensure deep_gemm is compiled for this architecture or "

"disable FD_USE_DEEP_GEMM to fall back to a non-deep_gemm backend."

)

Copilot · 2026-02-09T11:37:21Z

fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py

+            moe_expert_dispatch,
+            moe_expert_reduce,
+        )
+    except ImportError:


'except' clause does nothing but pass and there is no explanatory comment.

- Move W8A8 quantization, MOE GEMM, and related features from cc >= 75/80 to cc >= 70 - Add --skip-fp8 flag to auto_gen_template_instantiation.py for SM70 compatibility - Keep append_attention, MLA, gptq_marlin_repack, winx_unzip at cc >= 80 (require cp.async/ldmatrix) - Tested: FFN/MOE tests pass, quantization (non-FP8) tests pass on V100 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

lizhenyun01 · 2026-02-10T02:23:34Z

fastdeploy/model_executor/layers/attention/ops/append_attention.py

+        from fastdeploy.model_executor.ops.gpu import (
+            append_attention_with_output as append_attention_with_output_gpu,
+        )
+    except ImportError:


70上attention用什么方案呢有端到端跑通过吗

目前Append Attention是全跳过的，这边在sm70的实现方案也得盘一下

mattheliu had a problem deploying to Metax_ci February 2, 2026 08:43 — with GitHub Actions Error

mattheliu force-pushed the fastdeploy_v100 branch from aa40210 to f3216c0 Compare February 2, 2026 08:57

mattheliu had a problem deploying to Metax_ci February 2, 2026 08:57 — with GitHub Actions Error

mattheliu had a problem deploying to Metax_ci February 2, 2026 09:01 — with GitHub Actions Error

mattheliu force-pushed the fastdeploy_v100 branch from 4a1f4d4 to 3b39080 Compare February 2, 2026 09:08

mattheliu had a problem deploying to Metax_ci February 2, 2026 09:08 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 3, 2026 13:12 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 3, 2026 13:27 — with GitHub Actions Error

mattheliu force-pushed the fastdeploy_v100 branch from 5cb9172 to b9dcf58 Compare February 3, 2026 13:35

mattheliu had a problem deploying to Metax_ci February 3, 2026 13:35 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 3, 2026 13:46 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 3, 2026 14:09 — with GitHub Actions Error

mattheliu temporarily deployed to Metax_ci February 3, 2026 14:11 — with GitHub Actions Inactive

mattheliu had a problem deploying to Metax_ci February 4, 2026 03:35 — with GitHub Actions Error

mattheliu had a problem deploying to Metax_ci February 4, 2026 03:35 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 4, 2026 04:07 — with GitHub Actions Failure

mattheliu marked this pull request as ready for review February 4, 2026 05:47

mattheliu had a problem deploying to Metax_ci February 4, 2026 06:32 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 5, 2026 05:02 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 6, 2026 12:38 — with GitHub Actions Error

mattheliu had a problem deploying to Metax_ci February 6, 2026 12:38 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 6, 2026 13:10 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 6, 2026 13:20 — with GitHub Actions Error

mattheliu had a problem deploying to Metax_ci February 6, 2026 13:26 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 6, 2026 14:36 — with GitHub Actions Failure

mattheliu temporarily deployed to Metax_ci February 9, 2026 02:44 — with GitHub Actions Inactive

feat: add V100 (SM70) GPU support

d3f2145

Support FP16 inference on V100 by adding SM70 compilation flags, disabling BF16/FP8 quantization, and graceful fallback for SM80+ only ops.

mattheliu and others added 2 commits February 9, 2026 13:03

mattheliu force-pushed the fastdeploy_v100 branch from 5145b41 to df7202f Compare February 9, 2026 05:03

mattheliu had a problem deploying to Metax_ci February 9, 2026 05:03 — with GitHub Actions Error

chore: remove unrelated files from PR

37d9bfe

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

mattheliu temporarily deployed to Metax_ci February 9, 2026 05:07 — with GitHub Actions Inactive

mattheliu had a problem deploying to Metax_ci February 9, 2026 05:18 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 9, 2026 05:22 — with GitHub Actions Failure

mattheliu force-pushed the fastdeploy_v100 branch from 8150b9c to 37d9bfe Compare February 9, 2026 05:24

mattheliu had a problem deploying to Metax_ci February 9, 2026 05:24 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 9, 2026 05:26 — with GitHub Actions Error

Merge branch 'develop' into fastdeploy_v100

88f188d

mattheliu had a problem deploying to Metax_ci February 9, 2026 05:27 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 9, 2026 05:38 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 9, 2026 05:57 — with GitHub Actions Failure

Merge branch 'develop' into fastdeploy_v100

6bfcafd

mattheliu temporarily deployed to Metax_ci February 9, 2026 10:56 — with GitHub Actions Inactive

qingqing01 requested review from Copilot, jiangjiajun, lizhenyun01, zhoutianzi666 and zoooo0820 February 9, 2026 11:28

Copilot started reviewing on behalf of qingqing01 February 9, 2026 11:29 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

mattheliu temporarily deployed to Metax_ci February 9, 2026 14:16 — with GitHub Actions Inactive

lizhenyun01 reviewed Feb 10, 2026

View reviewed changes

+# Validate fp8_gemm_nt availability when deep_gemm is explicitly enabled.
+# This avoids confusing 'NoneType is not callable' errors at runtime.
+if current_platform.is_cuda():
+    _sm_version_check = get_sm_version()
+    if _sm_version_check >= 89 and bool(envs.FD_USE_DEEP_GEMM) and fp8_gemm_nt is None:
+        raise RuntimeError(
+            "FD_USE_DEEP_GEMM is enabled on CUDA SM"
+            f"{_sm_version_check}, but fp8_gemm_nt could not be imported. "
+            "Please ensure deep_gemm is compiled for this architecture or "
+            "disable FD_USE_DEEP_GEMM to fall back to a non-deep_gemm backend."
+        )

[Feature][OP] Add V100 (SM70) GPU Support #6306

Are you sure you want to change the base?

[Feature][OP] Add V100 (SM70) GPU Support #6306

Conversation

mattheliu commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

编译系统

CUDA Kernel

Python 运行时

测试

SM70/SM75 Fallback 策略

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Feb 2, 2026

Uh oh!

mattheliu commented Feb 4, 2026

Uh oh!

codecov-commenter commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

lizhenyun01 Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

mattheliu Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mattheliu commented Feb 2, 2026 •

edited

Loading

codecov-commenter commented Feb 4, 2026 •

edited

Loading