Skip to content

[Model Runner] Support overlap schedule#6259

Merged
zhoutianzi666 merged 11 commits intoPaddlePaddle:developfrom
Sunny-bot1:overlap
Feb 4, 2026
Merged

[Model Runner] Support overlap schedule#6259
zhoutianzi666 merged 11 commits intoPaddlePaddle:developfrom
Sunny-bot1:overlap

Conversation

@Sunny-bot1
Copy link
Collaborator

@Sunny-bot1 Sunny-bot1 commented Jan 28, 2026

Motivation

当前 FD step 之间存在较大的时间间隙,主要瓶颈来自前后处理阶段引入的 DtoH 同步拷贝。该同步操作会阻塞上层调度线程的 CPU 执行,进而延迟后续 kernel 的 launch,导致 GPU 计算与调度流程被迫串行化。

本 PR 及相关改动的目标是引入 GPU 异步调度优化:通过使用异步拷贝与事件同步机制,消除不必要的 CPU–GPU 强同步点,使模型执行与上层调度以及 token 写回并行执行,从而实现 FD step 间隙的缩短,提升整体推理吞吐。

image

Modifications

相关PR:

  1. exist_prefill,v1 scheduler下采用flag变量记录:[Model Runner] Add exist_prefill_flag #6172
  2. fa3前处理进CUDA graph:[Model Runner] Prepare token count and move FA3 initialization into the graph #6170
  3. 拆分execute_model,not_need_stop、sampled_token_ids在 _postprocess 阶段异步拷贝,利用cuda event在save_output阶段进行同步:[Model Runner] Refactor execute_model for GPU async scheduling #6176

本PR:

  • 消除token_num_cpu计算引入的同步拷贝:当连续处理decode batch时复用上一个 batch 的 token_num来进行 launch
  • tp_barrier:使用cpu barrier
  • 实现 execute_model_overlap

效果:

端到端提升10%

模型&配置 TP 并发 输入长度 输出长度 总耗时 解码速度 TPOT
GLM-4.5-Air 8 4 500 10k 1868 88 11.36
GLM-4.5-Air + overlap schedule 8 4 500 10k 1683 98 10.11
ERNIE-4.5-21B-A3B-Paddle 4 64 2k 500 683 80 12.75
ERNIE-4.5-21B-A3B-Paddle + overlap schedule 4 64 2k 500 622 91 11.21


GLM TP8 2 ms->340 us

image

TODO:

  • 优化 insert_task_v1 发生时引入的同步
  • 适配 SAVE_OUTPUT_V1
  • 适配 logprob
  • 适配 MTP
  • 默认开启

Usage or Command

--enable-overlap-schedule: 开启异步调度,默认关闭;由于当前未适配MTP,MTP开启后此开关失效

python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
    --max-model-len 32768 \
    --max-num-seqs 128 \
    --port 8908 \
    --tensor-parallel-size 4 \
    --enable-overlap-schedule

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Jan 28, 2026

Thanks for your contribution!

@Sunny-bot1 Sunny-bot1 marked this pull request as draft February 2, 2026 11:36
@Sunny-bot1 Sunny-bot1 marked this pull request as ready for review February 2, 2026 13:00
@codecov-commenter
Copy link

codecov-commenter commented Feb 2, 2026

Codecov Report

❌ Patch coverage is 74.57627% with 15 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@c745a22). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/worker/gpu_model_runner.py 65.85% 11 Missing and 3 partials ⚠️
fastdeploy/worker/worker_process.py 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6259   +/-   ##
==========================================
  Coverage           ?   67.93%           
==========================================
  Files              ?      389           
  Lines              ?    51886           
  Branches           ?     8077           
==========================================
  Hits               ?    35250           
  Misses             ?    14067           
  Partials           ?     2569           
Flag Coverage Δ
GPU 67.93% <74.57%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zhoutianzi666 zhoutianzi666 merged commit 9b0a82c into PaddlePaddle:develop Feb 4, 2026
20 of 24 checks passed
StareAtYou added a commit to StareAtYou/FastDeploy that referenced this pull request Feb 4, 2026
yuanlehome pushed a commit that referenced this pull request Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants