Skip to content

[None][feat] Add radix tree cache priority boost and store chunked context blocks#12696

Draft
lancelly wants to merge 1 commit intoNVIDIA:feat/bench_yfrom
lancelly:feat/bench_y_cherry_pick
Draft

[None][feat] Add radix tree cache priority boost and store chunked context blocks#12696
lancelly wants to merge 1 commit intoNVIDIA:feat/bench_yfrom
lancelly:feat/bench_y_cherry_pick

Conversation

@lancelly
Copy link
Copy Markdown
Collaborator

@lancelly lancelly commented Apr 2, 2026

Summary

Cherry-pick Features A and B from PR #12481 (improve-scheduler) onto feat/bench_y.

  • Feature A: Radix tree cache priority boost — Blocks in the radix tree get priority boosted (kRadixTreeBlockPriority = kDefaultPriority + 10) on release, so truly-free blocks are evicted first, preserving cached data longer. Adds getNumTrulyFreeBlocks() to distinguish empty blocks from cached ones.
  • Feature B: Store chunked context blocks — New storeChunkedContextBlocks() stores full blocks from a completed (non-final) context chunk immediately, enabling prefix sharing with concurrent requests before the full context finishes. Integrated into both TRT and PyTorch backends. SWA windows are skipped.

…ntext blocks

Cherry-pick Features A and B from PR NVIDIA#12481 (improve-scheduler) onto feat/bench_y.

Feature A (Radix tree cache priority boost):
- Blocks in the radix tree get priority boosted (kRadixTreeBlockPriority =
  kDefaultPriority + 10) on release, so truly-free blocks are evicted first,
  preserving cached data longer.
- Adds getNumTrulyFreeBlocks() to distinguish empty blocks from cached ones.

Feature B (Store chunked context blocks):
- New storeChunkedContextBlocks() method stores full blocks from a completed
  (non-final) context chunk immediately, enabling prefix sharing with
  concurrent requests before the full context finishes processing.
- Integrated into both TRT and PyTorch (PyExecutor) backends.
- SWA windows are skipped (their block boundaries are unstable mid-context).

Note: Feature C (reservation-based scheduling) from PR NVIDIA#12481 is intentionally
excluded. The pinning infrastructure (boostReservedBlock/unpinAllReservedBlocks)
is also excluded as it has no call site without Feature C.

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
@lancelly lancelly force-pushed the feat/bench_y_cherry_pick branch from e3dd6a4 to 85a4299 Compare April 2, 2026 15:24
auto const kMaxPriority = executor::KvCacheRetentionConfig::kMaxRetentionPriority;

auto const kDefaultPriority = executor::KvCacheRetentionConfig::kDefaultRetentionPriority;
// No longer used for priority boosting (pinning replaced it), but kept for backward compat.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove this comment later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants