[Scheduler] Defer block recycling to accelerate LRU node freeing by liyonghua0910 · Pull Request #7885 · PaddlePaddle/FastDeploy

liyonghua0910 · 2026-05-21T12:55:47Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

In the LRU eviction loop of free_block_ids_async, each iteration calls recycle_gpu_blocks individually, which causes frequent heap operations and slows down the overall freeing process. This PR defers block recycling to a single batch call after the loop completes.

Modifications

Defer block recycling in LRU loop: Introduce blocks_deferred_to_recycle list in free_block_ids_async. Instead of calling recycle_gpu_blocks per node inside the while-loop, blocks are appended to this list and recycled in a single batch call after the loop exits, reducing the overhead of repeated heap operations.
Add defer_recycle parameter to _handle_free_gpu_node_without_cpu: When defer_recycle=True, the method returns the list of block IDs to recycle (reverved_dec_block_ids + [block_id]) without calling recycle_gpu_blocks itself, allowing the caller to batch the recycling. When defer_recycle=False, behavior is unchanged (immediate recycling).
Fix: clear node.reverved_dec_block_ids in defer path: When defer_recycle=True, the node's reverved_dec_block_ids is now cleared after copying to the return list, preventing double recycling if the node object is accessed again before GC.
Fix: restore node.cache_status = CacheStatus.CPU: This line was accidentally removed during the refactor. Restored to keep consistency with _handle_free_gpu_node_with_cpu.
Refactor parent disconnection logic: Extract parent disconnection into a separate parent variable for clarity, and add a dedup check with warning log when the parent is already in gpu_lru_leaf_set.
Add warning log for inconsistent LRU state: When a node popped from the LRU heap has shared_count != 0, log a warning to aid diagnosis.

Usage or Command

No additional configuration required. The optimization takes effect automatically.

Accuracy Tests

Only affects KV Cache block recycling timing, no impact on model output accuracy.

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-21T12:55:58Z

Thanks for your contribution!

codecov-commenter · 2026-05-21T13:36:36Z

Codecov Report

❌ Patch coverage is 68.18182% with 7 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@c9c40ee). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/cache_manager/prefix_cache_manager.py	68.18%	4 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7885   +/-   ##
==========================================
  Coverage           ?   67.60%           
==========================================
  Files              ?      467           
  Lines              ?    65105           
  Branches           ?     9983           
==========================================
  Hits               ?    44014           
  Misses             ?    18292           
  Partials           ?     2799

Flag	Coverage Δ
GPU	`77.74% <68.18%> (?)`
XPU	`16.16% <4.54%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot · 2026-05-21T13:58:24Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-29 16:57:37

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 9522e2c
Merge base: c9c40ee (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

有 2 个 required 失败任务需要处理（1 个环境问题 + 1 个待审批），建议处理后合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
55(13)	42	35	6	0	0	0

2 任务状态汇总

2.1 Required任务 : 7/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`xpu_8cards_case_test / run_xpu_8cards_cases`	47m27s	环境问题：XPU 动态库符号缺失导致 PD 服务超时	检查 XPU 镜像和动态库版本后 rerun	Job	🔄×1
❌	`Approval`	17s	需要 Approval	请通过人工审批	Job	-
✅	其余 7 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 28/32 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	16m26s	Job	-
❌	`xpu_unit_test / run_xpu_unit_test`	3m8s	Job	🔄×1
❌	`CI_HPU`	1h52m	Job	-
❌	`Trigger Jenkins for PR`	16s	Job	-
✅	其余 28 个可选任务通过	-	-	-

3 失败详情（仅 required）

xpu_8cards_case_test / run_xpu_8cards_cases — 基础设施（置信度: 高）

xpu_8cards_case_test / run_xpu_8cards_cases

状态: ❌ 失败
错误类型: 基础设施
置信度: 高
根因摘要: XPU 动态库符号缺失导致 PD 服务超时
分析器: 通用分析(fallback)

失败用例:

测试	错误	根因
`tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py::test_pd_separation`	Failed: PD分离服务启动失败	P/D 节点健康检查 10 分钟均为 000
`tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp4.py::test_pd_separation`	Failed: PD分离服务启动失败	P/D 节点健康检查 10 分钟均为 000
`tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp4_cudagraph.py::test_pd_separation`	Failed: PD分离服务启动失败	P/D 节点健康检查 10 分钟均为 000
`tests/xpu_ci/8cards_cases/test_pd_p_tp4ep4_d_tp1ep4.py::test_pd_separation`	Failed: PD分离服务启动失败	P/D 节点健康检查 10 分钟均为 000

根因详情:
日志显示 4 个 XPU 8 卡 PD 分离用例均在 wait_for_pd_health_check() 阶段超时，P 节点与 D 节点 HTTP 状态码持续为 000。失败时打印的 workerlog 中出现 /workspace/FastDeploy/fastdeploy/model_executor/ops/xpu/.../libxft_blocks.so: undefined symbol: ...flash_attention_context_vllm...，说明服务进程未能正常加载 XPU 相关动态库/符号，属于运行环境或二进制依赖不匹配。PR 本身仅修改 fastdeploy/cache_manager/prefix_cache_manager.py 的 KV Cache LRU 回收路径，当前日志未出现该文件相关栈或断言，暂未发现直接关联。

关键日志:

服务健康检查中... 已等待 591 秒,P节点状态码:000,D节点状态码:000
PD分离服务启动超时:经过 10 分钟服务仍未启动!
/usr/bin/python: symbol lookup error: /workspace/FastDeploy/fastdeploy/model_executor/ops/xpu/fastdeploy_ops/../libs/libxft_blocks.so: undefined symbol: _ZN5baidu3xpu3xfa28flash_attention_context_vllm...
FAILED tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py::test_pd_separation
FAILED tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp4.py::test_pd_separation
FAILED tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp4_cudagraph.py::test_pd_separation
FAILED tests/xpu_ci/8cards_cases/test_pd_p_tp4ep4_d_tp1ep4.py::test_pd_separation

修复建议:

优先检查/更新 XPU CI 镜像中 libxft_blocks.so 与底层 XPU runtime/fastdeploy_ops 的版本匹配关系，然后 rerun xpu_8cards_case_test / run_xpu_8cards_cases。
若镜像与依赖无变更，可直接 rerun 排除机器环境瞬态问题；同时注意本 Job 的 artifact 上传还命中 GitHub Actions storage quota，可能导致日志工件上传失败。

关联变更: fastdeploy/cache_manager/prefix_cache_manager.py（LRU 回收优化），当前日志未显示直接关联。

Approval — 需要人工审批

该 Job 需要人工 Approval，完成审批后 CI 才会继续执行。

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-28 10:53:39

📋 Review 摘要

PR 概述：将 LRU 驱逐循环中的 recycle_gpu_blocks 调用延迟到循环结束后批量执行，减少堆操作开销，并修复父节点重复入堆问题。
变更范围：fastdeploy/cache_manager/prefix_cache_manager.py
影响面 Tag：[KVCache]

问题

级别	文件	概述
🟡 建议	`prefix_cache_manager.py:1352`	`[node.block_id]` 包装方式需确认，`None` 可能混入回收列表

📝 PR 规范检查

标题 Tag [Scheduler] 与实际变更路径不符——改动文件位于 fastdeploy/cache_manager/，按影响面判断表应使用 [KVCache]。此外，Checklist 中 "Add unit tests" 已勾选，但 diff 中未见测试文件，PR 描述也未说明不需要单测的原因。

标题建议（可直接复制）：

[KVCache] Defer block recycling to accelerate LRU node freeing

PR 描述建议（点击展开，可直接复制）

## Motivation
In the LRU eviction loop of `free_block_ids_async`, each iteration calls `recycle_gpu_blocks` individually, which causes frequent heap operations and slows down the overall freeing process. This PR defers block recycling to a single batch call after the loop completes.

## Modifications
- Defer `recycle_gpu_blocks` calls inside the LRU freeing loop to a single batch call after the loop, reducing the overhead of repeated heap operations.
- Add `defer_recycle` parameter to `_handle_free_gpu_node_without_cpu` to support deferred block recycling.
- Fix the LRU leaf node freeing logic: disconnect the child node from its parent first, then check whether the parent should be added to the LRU heap, avoiding duplicate freeing.
- Add warning logs to help diagnose duplicate node issues in the LRU heap.

## Usage or Command
No additional configuration required. The optimization takes effect automatically.

## Accuracy Tests
Only affects KV Cache block recycling timing, no impact on model output accuracy.

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体优化思路清晰，延迟批量回收可有效减少堆操作频次，父节点重复入堆的 bug fix 也是必要修正。建议确认 recycle_gpu_blocks 对 flat list 的兼容性及 node.block_id 的 None 边界，并补充单测或在 PR 中说明不需要单测的原因。

PaddlePaddle-bot · 2026-05-28T02:56:57Z

+        logger.info(f"_handle_free_gpu_node_without_cpu: free node {node.node_id}")

-        self.recycle_gpu_blocks(node.reverved_dec_block_ids)
+        blocks_to_recycle = list(node.reverved_dec_block_ids) + [node.block_id]


🟡 建议 [node.block_id] 包装方式需确认

原代码分两次调用 recycle_gpu_blocks：第一次传 list，第二次传单个值 node.block_id。新代码将两者合并为 list(...) + [node.block_id]，始终传 list。

需确认：

recycle_gpu_blocks 是否支持接收 flat list（原第二次调用传的是单值）

node.block_id 是否可能为 None（若存在无 GPU block 的节点，None 会混入回收列表）

建议在此处加 guard：

blocks_to_recycle = list(node.reverved_dec_block_ids) if node.block_id is not None: blocks_to_recycle.append(node.block_id)

liyonghua0910 had a problem deploying to Metax_ci May 21, 2026 12:55 — with GitHub Actions Failure

liyonghua0910 changed the title ~~[KVCache] Defer block recycling to accelerate LRU node freeing~~ [Scheduler] Defer block recycling to accelerate LRU node freeing May 21, 2026

This comment was marked as outdated.

Sign in to view

liyonghua0910 mentioned this pull request May 21, 2026

[Cherry-Pick][Scheduler] Defer block recycling to accelerate LRU node freeing #7886

Open

5 tasks

liyonghua0910 had a problem deploying to Metax_ci May 22, 2026 02:33 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

liyonghua0910 added 3 commits May 28, 2026 02:47

[KVCache] Defer block recycling to accelerate LRU node freeing

f5aca21

[style] precommit

8404edf

[chore] revert some codes

9522e2c

liyonghua0910 force-pushed the develop+20260521_free_blocks branch from 3c777a7 to 9522e2c Compare May 28, 2026 02:47

liyonghua0910 had a problem deploying to Metax_ci May 28, 2026 02:47 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Scheduler] Defer block recycling to accelerate LRU node freeing#7885

[Scheduler] Defer block recycling to accelerate LRU node freeing#7885
liyonghua0910 wants to merge 3 commits into
PaddlePaddle:developfrom
liyonghua0910:develop+20260521_free_blocks

liyonghua0910 commented May 21, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 21, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 21, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot commented May 21, 2026 •

edited

Loading

xpu_8cards_case_test / run_xpu_8cards_cases

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

liyonghua0910 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 21, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PaddlePaddle-bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 7/10 通过

2.2 可选任务 — 28/32 通过

3 失败详情（仅 required）

xpu_8cards_case_test / run_xpu_8cards_cases

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liyonghua0910 commented May 21, 2026 •

edited

Loading

codecov-commenter commented May 21, 2026 •

edited

Loading

PaddlePaddle-bot commented May 21, 2026 •

edited

Loading