Skip to content

[Scheduler] Defer block recycling to accelerate LRU node freeing#7885

Open
liyonghua0910 wants to merge 3 commits into
PaddlePaddle:developfrom
liyonghua0910:develop+20260521_free_blocks
Open

[Scheduler] Defer block recycling to accelerate LRU node freeing#7885
liyonghua0910 wants to merge 3 commits into
PaddlePaddle:developfrom
liyonghua0910:develop+20260521_free_blocks

Conversation

@liyonghua0910
Copy link
Copy Markdown
Collaborator

@liyonghua0910 liyonghua0910 commented May 21, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

In the LRU eviction loop of free_block_ids_async, each iteration calls recycle_gpu_blocks individually, which causes frequent heap operations and slows down the overall freeing process. This PR defers block recycling to a single batch call after the loop completes.

Modifications

  • Defer block recycling in LRU loop: Introduce blocks_deferred_to_recycle list in free_block_ids_async. Instead of calling recycle_gpu_blocks per node inside the while-loop, blocks are appended to this list and recycled in a single batch call after the loop exits, reducing the overhead of repeated heap operations.
  • Add defer_recycle parameter to _handle_free_gpu_node_without_cpu: When defer_recycle=True, the method returns the list of block IDs to recycle (reverved_dec_block_ids + [block_id]) without calling recycle_gpu_blocks itself, allowing the caller to batch the recycling. When defer_recycle=False, behavior is unchanged (immediate recycling).
  • Fix: clear node.reverved_dec_block_ids in defer path: When defer_recycle=True, the node's reverved_dec_block_ids is now cleared after copying to the return list, preventing double recycling if the node object is accessed again before GC.
  • Fix: restore node.cache_status = CacheStatus.CPU: This line was accidentally removed during the refactor. Restored to keep consistency with _handle_free_gpu_node_with_cpu.
  • Refactor parent disconnection logic: Extract parent disconnection into a separate parent variable for clarity, and add a dedup check with warning log when the parent is already in gpu_lru_leaf_set.
  • Add warning log for inconsistent LRU state: When a node popped from the LRU heap has shared_count != 0, log a warning to aid diagnosis.

Usage or Command

No additional configuration required. The optimization takes effect automatically.

Accuracy Tests

Only affects KV Cache block recycling timing, no impact on model output accuracy.

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 21, 2026

Thanks for your contribution!

@liyonghua0910 liyonghua0910 changed the title [KVCache] Defer block recycling to accelerate LRU node freeing [Scheduler] Defer block recycling to accelerate LRU node freeing May 21, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 21, 2026

Codecov Report

❌ Patch coverage is 68.18182% with 7 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@c9c40ee). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/cache_manager/prefix_cache_manager.py 68.18% 4 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7885   +/-   ##
==========================================
  Coverage           ?   67.60%           
==========================================
  Files              ?      467           
  Lines              ?    65105           
  Branches           ?     9983           
==========================================
  Hits               ?    44014           
  Misses             ?    18292           
  Partials           ?     2799           
Flag Coverage Δ
GPU 77.74% <68.18%> (?)
XPU 16.16% <4.54%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 21, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-29 16:57:37

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

2 个 required 失败任务需要处理(1 个环境问题 + 1 个待审批),建议处理后合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
55(13) 42 35 6 0 0 0

2 任务状态汇总

2.1 Required任务 : 7/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
xpu_8cards_case_test / run_xpu_8cards_cases 47m27s 环境问题:XPU 动态库符号缺失导致 PD 服务超时 检查 XPU 镜像和动态库版本后 rerun Job 🔄×1
Approval 17s 需要 Approval 请通过人工审批 Job -
其余 7 个必选任务通过 - - - - -

2.2 可选任务 — 28/32 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 16m26s Job -
xpu_unit_test / run_xpu_unit_test 3m8s Job 🔄×1
CI_HPU 1h52m Job -
Trigger Jenkins for PR 16s Job -
其余 28 个可选任务通过 - - -

3 失败详情(仅 required)

xpu_8cards_case_test / run_xpu_8cards_cases — 基础设施(置信度: 高)

xpu_8cards_case_test / run_xpu_8cards_cases

  • 状态: ❌ 失败
  • 错误类型: 基础设施
  • 置信度: 高
  • 根因摘要: XPU 动态库符号缺失导致 PD 服务超时
  • 分析器: 通用分析(fallback)

失败用例:

测试 错误 根因
tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py::test_pd_separation Failed: PD分离服务启动失败 P/D 节点健康检查 10 分钟均为 000
tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp4.py::test_pd_separation Failed: PD分离服务启动失败 P/D 节点健康检查 10 分钟均为 000
tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp4_cudagraph.py::test_pd_separation Failed: PD分离服务启动失败 P/D 节点健康检查 10 分钟均为 000
tests/xpu_ci/8cards_cases/test_pd_p_tp4ep4_d_tp1ep4.py::test_pd_separation Failed: PD分离服务启动失败 P/D 节点健康检查 10 分钟均为 000

根因详情:
日志显示 4 个 XPU 8 卡 PD 分离用例均在 wait_for_pd_health_check() 阶段超时,P 节点与 D 节点 HTTP 状态码持续为 000。失败时打印的 workerlog 中出现 /workspace/FastDeploy/fastdeploy/model_executor/ops/xpu/.../libxft_blocks.so: undefined symbol: ...flash_attention_context_vllm...,说明服务进程未能正常加载 XPU 相关动态库/符号,属于运行环境或二进制依赖不匹配。PR 本身仅修改 fastdeploy/cache_manager/prefix_cache_manager.py 的 KV Cache LRU 回收路径,当前日志未出现该文件相关栈或断言,暂未发现直接关联。

关键日志:

服务健康检查中... 已等待 591 秒,P节点状态码:000,D节点状态码:000
PD分离服务启动超时:经过 10 分钟服务仍未启动!
/usr/bin/python: symbol lookup error: /workspace/FastDeploy/fastdeploy/model_executor/ops/xpu/fastdeploy_ops/../libs/libxft_blocks.so: undefined symbol: _ZN5baidu3xpu3xfa28flash_attention_context_vllm...
FAILED tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py::test_pd_separation
FAILED tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp4.py::test_pd_separation
FAILED tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp4_cudagraph.py::test_pd_separation
FAILED tests/xpu_ci/8cards_cases/test_pd_p_tp4ep4_d_tp1ep4.py::test_pd_separation

修复建议:

  1. 优先检查/更新 XPU CI 镜像中 libxft_blocks.so 与底层 XPU runtime/fastdeploy_ops 的版本匹配关系,然后 rerun xpu_8cards_case_test / run_xpu_8cards_cases
  2. 若镜像与依赖无变更,可直接 rerun 排除机器环境瞬态问题;同时注意本 Job 的 artifact 上传还命中 GitHub Actions storage quota,可能导致日志工件上传失败。

关联变更: fastdeploy/cache_manager/prefix_cache_manager.py(LRU 回收优化),当前日志未显示直接关联。

Approval — 需要人工审批

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-28 10:53:39

📋 Review 摘要

PR 概述:将 LRU 驱逐循环中的 recycle_gpu_blocks 调用延迟到循环结束后批量执行,减少堆操作开销,并修复父节点重复入堆问题。
变更范围fastdeploy/cache_manager/prefix_cache_manager.py
影响面 Tag[KVCache]

问题

级别 文件 概述
🟡 建议 prefix_cache_manager.py:1352 [node.block_id] 包装方式需确认,None 可能混入回收列表

📝 PR 规范检查

标题 Tag [Scheduler] 与实际变更路径不符——改动文件位于 fastdeploy/cache_manager/,按影响面判断表应使用 [KVCache]。此外,Checklist 中 "Add unit tests" 已勾选,但 diff 中未见测试文件,PR 描述也未说明不需要单测的原因。

标题建议(可直接复制):

  • [KVCache] Defer block recycling to accelerate LRU node freeing
PR 描述建议(点击展开,可直接复制)
## Motivation
In the LRU eviction loop of `free_block_ids_async`, each iteration calls `recycle_gpu_blocks` individually, which causes frequent heap operations and slows down the overall freeing process. This PR defers block recycling to a single batch call after the loop completes.

## Modifications
- Defer `recycle_gpu_blocks` calls inside the LRU freeing loop to a single batch call after the loop, reducing the overhead of repeated heap operations.
- Add `defer_recycle` parameter to `_handle_free_gpu_node_without_cpu` to support deferred block recycling.
- Fix the LRU leaf node freeing logic: disconnect the child node from its parent first, then check whether the parent should be added to the LRU heap, avoiding duplicate freeing.
- Add warning logs to help diagnose duplicate node issues in the LRU heap.

## Usage or Command
No additional configuration required. The optimization takes effect automatically.

## Accuracy Tests
Only affects KV Cache block recycling timing, no impact on model output accuracy.

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体优化思路清晰,延迟批量回收可有效减少堆操作频次,父节点重复入堆的 bug fix 也是必要修正。建议确认 recycle_gpu_blocks 对 flat list 的兼容性及 node.block_idNone 边界,并补充单测或在 PR 中说明不需要单测的原因。

logger.info(f"_handle_free_gpu_node_without_cpu: free node {node.node_id}")

self.recycle_gpu_blocks(node.reverved_dec_block_ids)
blocks_to_recycle = list(node.reverved_dec_block_ids) + [node.block_id]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 [node.block_id] 包装方式需确认

原代码分两次调用 recycle_gpu_blocks:第一次传 list,第二次传单个值 node.block_id。新代码将两者合并为 list(...) + [node.block_id],始终传 list。

需确认:

  1. recycle_gpu_blocks 是否支持接收 flat list(原第二次调用传的是单值)
  2. node.block_id 是否可能为 None(若存在无 GPU block 的节点,None 会混入回收列表)

建议在此处加 guard:

blocks_to_recycle = list(node.reverved_dec_block_ids)
if node.block_id is not None:
    blocks_to_recycle.append(node.block_id)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants