[Feature] Support NVFP4 Flashinfer-cutedsl MoE on SM100 by mpgemm · Pull Request #6963 · PaddlePaddle/FastDeploy

mpgemm · 2026-03-21T15:31:43Z

Motivation

FastDeploy 集成 flashinfer-cutedsl nvfp4 grouped masked gemm。支持apply_ep_prefill 和 apply_ep_decode。

Modifications

修改围绕两方面，一方面是导入flashinfer-cutedsl-blockscaled-gemm与paddle格式不兼容的问题，另一方面则是 FD框架集成。

解决paddle不兼容的三个问题，需要在 miniconda/envs/yourname/lib/python3.10/site-packages/ 里面修改nvidia-dsl和 flashinfer）：
1：nvidia_cutlass_dsl/python_packages/cutlass/torch.py 将 torch.device 改成 "torch.device"。 （ctrl+f搜索替换）
2：flashinfer/utils.py. get_compute_capability函数下面改成：
@functools.cache
def get_compute_capability(device: torch.device) -> Tuple[int, int]:
return torch.cuda.get_device_capability(device)
if device.type != "cuda":
raise ValueError("device must be a cuda device")
return torch.cuda.get_device_capability(device.index)
注：如果遇到device的问题，将 A.place 换成 A.device 可以解决大部分问题。
3：flashinfer/cute_dsl/blockscaled_gemm.py
首先 import cuda.bindings.driver as cuda
然后将 cutlass_torch.current_stream() 替换成 cuda.CUstream(torch.cuda.current_stream().stream_base.raw_stream) （ctrl+f搜索替换）

FD 框架修改 nvfp4.py，当前已经支持 apply_ep_prefill 和 apply_ep_decode 。目前存在的问题是 call_depermute_prefill_combine 这个算子只支持 top-k=4 or 8，但是 eb-45-fp4 的 top-k=6，所以在当top-k != 4 / 8 时 prefill 会走 python 实现的通道，性能很低。

还需要修改两个 utils.py 文件用于正确加载权重。

增加了一个单测。

Usage or Command

下面是端到端测试脚本
export PYTHONPATH="/root/paddlejob/workspace/output/dcc/FastDeploy":$PYTHONPATH
export MODEL_PATH="/raid0/ERNIE-4.5-21B-A3B-FP4"

export FD_USE_PFCC_DEEP_EP=1
export FD_MOE_BACKEND="flashinfer-cutedsl"
export CUDA_VISIBLE_DEVICES=4,5,6,7

python -m fastdeploy.entrypoints.openai.multi_api_server --ports "9811,9812,9813,9814" --num-servers 4 --args --model "$MODEL_PATH" --ep-prefill-use-worst-num-tokens --disable-custom-all-reduce --tensor-parallel-size 1 --data-parallel-size 4 --no-enable-prefix-caching --max-model-len 65536 --enable-expert-parallel --num-gpu-blocks-override 2048 --max-num-seqs 4 --gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 --graph-optimization-config '{"use_cudagraph":false}'

Accuracy Tests

下面是单测脚本。
export PYTHONPATH="/root/paddlejob/workspace/output/dcc/FastDeploy":$PYTHONPATH

export FD_MOE_BACKEND="flashinfer-cutedsl"
export FD_USE_PFCC_DEEP_EP=1
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

NVFP4_TEST_MODE=decode NVFP4_TEST_ITERS=2 python -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 FastDeploy/tests/layers/test_nvfp4_fusedmoe.py

NVFP4_TEST_MODE=prefill NVFP4_TEST_ITERS=2 python -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 FastDeploy/tests/layers/test_nvfp4_fusedmoe.py

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

CLAassistant · 2026-03-21T15:31:50Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

paddle-bot · 2026-03-21T15:31:50Z