Skip to content

Draft: DO NOT REVIEW : AutoDeploy trtllm attention backend with trtllm's kv cache manager direct operation#11268

Closed
MrGeva wants to merge 9 commits intoNVIDIA:mainfrom
nv-auto-deploy:eg/trtllm_attn_v2
Closed

Draft: DO NOT REVIEW : AutoDeploy trtllm attention backend with trtllm's kv cache manager direct operation#11268
MrGeva wants to merge 9 commits intoNVIDIA:mainfrom
nv-auto-deploy:eg/trtllm_attn_v2

Conversation

@MrGeva
Copy link
Collaborator

@MrGeva MrGeva commented Feb 4, 2026

Summary by CodeRabbit

  • New Features

    • Added TRT-LLM attention backend integration with advanced cache management and CUDA graph support.
    • Introduced KV cache pool support and multiple cache backend options (Simple and PT-based).
    • Added comprehensive KV Cache Architecture documentation.
    • Enabled Mermaid diagram support in documentation.
  • Bug Fixes & Improvements

    • Simplified cache resource handlers for better maintainability.
    • Consolidated attention transform base classes.
    • Updated AutoDeploy documentation structure and paths.
  • Documentation

    • Reorganized AutoDeploy documentation; removed obsolete guides.
    • Added detailed KV cache architecture documentation.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

lucaslie and others added 3 commits January 30, 2026 15:02
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
The key fix is to use AD's pool_mapping values directly without
multiplying by kv_factor. AD's pool_mapping already provides the
correct layer offsets (0, 1, 2, ...) because each layer takes
exactly one "block" worth of K+V data in the unified pool,
regardless of dtype.

Previously, the code was multiplying layer_idx by kv_factor=2,
causing the kernel to compute incorrect addresses:
- Expected layer 1 at: pool_ptr + 1 * block_size
- Got layer 1 at: pool_ptr + 2 * block_size (wrong!)

This fix enables accurate thop.attention execution in AutoDeploy
using AD's KVCacheManager pool directly, without needing the
PTCacheBackend or intermediate buffers.

Note: CUDA graph support requires use_pt_cache_backend=true
due to host operations in metadata preparation.

Signed-off-by: Eli Geva <egeva@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…cheBackend)

Enable CUDA graph support for the thop.attention kernel when
use_pt_cache_backend=False. This allows the torch-cudagraph compile
backend to work correctly with thop.attention.

Key changes:
- Pre-allocate kv_cache_block_offsets with max size in TrtllmLayerState
  to ensure stable tensor addresses for CUDA graphs
- Add is_capturing check in _prepare_trtllm_metadata to set host tensors
  to MAX values and skip device operations during capture
- Add create_host_prepare_function() to TrtllmAttentionGlobalState that
  creates a host_prepare_fn running outside the graph to update tensors
  with current batch values before each forward/replay
- Register host_prepare_fn via get_host_prepare_metadata_function() for
  non-PTCacheBackend mode

Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@MrGeva MrGeva requested review from a team as code owners February 4, 2026 08:42
@MrGeva MrGeva self-assigned this Feb 4, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

This PR restructures the AutoDeploy KV cache management system by introducing typed resource handlers (KVPagedResourceHandler, SSMResourceHandler, CausalConvResourceHandler), removing legacy Triton-based paged KV cache kernels, introducing cache backend abstractions, adding a TRT-LLM attention backend integration, and refactoring cache indexing across attention operators while reorganizing documentation.

Changes

Cohort / File(s) Summary
Documentation restructuring
README.md, examples/auto_deploy/README.md, docs/requirements.txt, docs/source/conf.py, docs/source/features/auto_deploy/..., docs/source/torch/auto_deploy/...
Enabled Mermaid diagram support in Sphinx, migrated AutoDeploy docs from torch path to features path, added KV Cache Architecture documentation, removed legacy documentation pages (benchmarking, example_run, expert_configurations, logging, serving, workflow).
Resource handler abstraction
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py, tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/...
Introduced typed resource handlers (KVPagedResourceHandler for paged KV caches, SSMResourceHandler and CausalConvResourceHandler for state resources) with compatibility checking; added CacheConfig for cache dtype management; extended SequenceInfo with KV cache pool integration.
Cache backend abstraction
tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py, tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py
Introduced CacheBackend interface with SimpleCacheBackend implementation for basic per-layer caching; added comprehensive PTCacheBackend for C++-backed KVCacheManager integration with pool management, metadata tensors, and synchronization utilities for interleaved/contiguous cache layouts.
TRT-LLM attention backend
tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py, tensorrt_llm/_torch/auto_deploy/llm_args.py
Introduced TrtllmAttention backend registered with AttentionRegistry; added TrtllmWorkspaceResourceHandler, TrtllmKVResourceHandler, and global state management; refactored LlmArgs and introduced AutoDeployConfig for streamlined configuration.
Attention operator refactoring
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py, tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py, tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py
Unified KV cache representation from separate k_cache/v_cache to single kv_cache with HND layout; renamed cache_loc parameter to slot_idx; updated metadata handling and cache initializers to use new resource handler types.
Low-level Triton kernel changes
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_kernels/attention_with_kv_cache.py, tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py, tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
Renamed cache_loc_ptr to slot_idx_ptr; added automatic attention scaling computation; removed entirely: triton_attention_internal.py and attention_with_paged_kv_cache.py (paged KV cache Triton kernels).
Cache management integration
tensorrt_llm/_torch/auto_deploy/shim/interface.py, tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py, tensorrt_llm/_torch/auto_deploy/transform/library/...
Enhanced CachedSequenceInterface with multi-layer state management, KV/state resource separation, hybrid cache manager creation, and PTCacheBackend integration; refactored transform base classes to unified _InsertCachedOperator; added cache_config and use_pt_cache_backend configuration.
Test infrastructure
test_trtllm_attention.py, tests/test_trtllm_attention_cuda_graph.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_*, tests/integration/defs/accuracy/test_llm_api_autodeploy.py, tests/integration/test_lists/test-db/*.yml
Added TRT-LLM attention tests; updated flashinfer/resource handler/cached_sequence tests to use new KVPagedResourceHandler and slot_idx semantics; added parametrization for attention backend selection; removed comprehensive paged KV cache test suite; updated integration test configurations.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • kaiyux
  • Shixiaowei02
🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.37% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning PR description is completely empty/missing; only contains the template with no actual explanation of what was changed or why. Add a description explaining the changes, the motivation, test coverage, and confirm the PR checklist items.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed Title is specific and describes the main architectural change: adding TRT-LLM attention backend with KV cache manager integration.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py (2)

1-9: ⚠️ Potential issue | 🟠 Major

Add NVIDIA copyright header for this source file

Source files require an NVIDIA header with the year of the latest meaningful modification.

📌 Suggested header
+ # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+
 from dataclasses import dataclass, fields
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.

462-469: ⚠️ Potential issue | 🟡 Minor

Silence the unused kv_cache argument in the fake op

Ruff reports kv_cache as unused in the fake implementation. Renaming it avoids lint noise without changing behavior.

🧹 Suggested fix
-    kv_cache: torch.Tensor,
+    _kv_cache: torch.Tensor,
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

1-8: ⚠️ Potential issue | 🟠 Major

Add NVIDIA copyright header for this source file

Source files require an NVIDIA header with the year of the latest meaningful modification.

📌 Suggested header
+ # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+
 from importlib.resources import files
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)

1-8: ⚠️ Potential issue | 🟠 Major

Add NVIDIA copyright header for this source file

Source files require an NVIDIA header with the year of the latest meaningful modification.

📌 Suggested header
+ # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+
 import copy
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`:
- Around line 287-319: The field validator CacheConfig._coerce_dtype should not
use assert for validating string dtype names; instead, when value is a str look
up the attribute on torch (as currently done), and if the lookup yields None or
a non-torch.dtype raise a clear exception (e.g., ValueError) with the invalid
input included; update the validator in CacheConfig (decorated with
`@field_validator`("dtype", "mamba_dtype", "delta_dtype", mode="before")) to
replace the assert with an explicit raise that reports the offending value and
expected torch.dtype so invalid dtype strings are rejected reliably in all
runtime modes.
- Around line 434-436: SequenceInfo currently leaves self._num_blocks as None
(set in __init__), causing an assertion in SequenceInfo.num_blocks when code
paths (e.g., PTCacheBackend -> initialize()) access it before
estimate_cache_loc_capacity() runs; fix by initializing self._num_blocks in
__init__ to a sensible default (for example use math.ceil(max_num_tokens /
tokens_per_block) or another conservative estimate derived from constructor args
like max_num_tokens and tokens_per_block) so num_blocks is valid immediately and
later refined inside estimate_cache_loc_capacity(); update __init__ where
self._num_blocks is set and leave estimate_cache_loc_capacity() to overwrite
with the accurate value.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py`:
- Around line 1-2: Update the copyright header in
tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py by changing the year
from 2025 to 2026 in the SPDX header lines (the two top-of-file lines starting
with "# SPDX-FileCopyrightText" and "# SPDX-License-Identifier") so the header
reflects the latest modification year.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py`:
- Around line 321-333: The page_size is being taken as kv_cache.shape[3] which
assumes HND layout; update the logic that computes page_size in the function
(where kv_cache, kv_layout and page_size are used) to derive tokens_per_block
based on kv_layout: if kv_layout indicates HND use shape[3], if it indicates NHD
use shape[1]; replace the hardcoded index with this conditional access so plan
creation and downstream uses of page_size are correct for both layouts.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py`:
- Around line 417-472: get_contiguous_caches currently assumes a single shared
contiguous buffer and raises if per-layer kv head counts differ; update it to
handle per-layer kv head variance by either allocating per-layer contiguous
buffers or moving the uniformity check to initialization with a clear validation
error. Concretely: in PTCacheBackend.get_contiguous_caches, when
self._shared_contiguous_k_cache/_shared_contiguous_v_cache would be created,
branch on whether max(self._config.num_kv_heads_per_layer) ==
self._config.num_kv_heads_per_layer[layer_idx]; if not, allocate per-layer
buffers (e.g., store dict/list keyed by layer_idx instead of the single
_shared_contiguous_k_cache/_shared_contiguous_v_cache) using pool shape from
self._kv_cache_manager.get_primary_pool_data(layer_idx) and the layer-specific
num_kv_heads, or alternatively add validation in the initializer (check
self._config.num_kv_heads_per_layer uniformity while self._initialized is set)
and raise a clear RuntimeError there; ensure logging (ad_logger.info) reflects
per-layer allocation and keep existing use of self._device and dtype.
- Around line 1-2: Update the SPDX header year from 2025 to 2026 at the top of
the file; specifically edit the copyright header lines in
tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py (the
SPDX-FileCopyrightText and/or SPDX-License-Identifier header block) to reflect
2026 as the latest meaningful modification year.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py`:
- Around line 475-626: The function _prepare_trtllm_metadata currently raises if
ad_pool_pointers/ad_pool_mapping are missing which breaks the KV cache handler's
fallback allocation path; change the logic so when use_ad_pool is False you do
not raise but instead initialize predictable defaults for pool pointers/mapping
(e.g., zeros or sentinel values) and proceed to compute kv_cache_block_offsets
from cache_loc/pages_per_seq for the fallback cache layout (ensure
state.host_kv_cache_pool_pointers and state.host_kv_cache_pool_mapping are set
to valid defaults and any logging reflects fallback use), or alternatively move
the RuntimeError to an earlier allocation phase so missing AD pool pointers fail
fast during allocation rather than here (update checks around use_ad_pool,
state.host_kv_cache_pool_pointers, state.host_kv_cache_pool_mapping, and the
block-offset filling loop accordingly).
- Around line 222-305: TrtllmLayerState currently hardcodes device="cuda" in
__post_init__, causing allocations on the wrong GPU; add a device field to the
TrtllmLayerState dataclass (e.g., device: torch.device) and use that field
instead of the string "cuda" when allocating device tensors in __post_init__,
keeping host/pinned tensors on cpu as before; update callers (e.g.,
get_or_create_layer_state) to pass the correct device (kv_cache.device or
SequenceInfo.device) when constructing TrtllmLayerState so allocations follow
the model/kv cache device.
- Around line 1-2: Update the copyright header in trtllm_attention.py to show
the latest modification year 2026 by changing the existing copyright line(s)
that currently show 2025 to 2026; specifically edit the top-of-file SPDX header
lines in trtllm_attention.py (the lines beginning with "#
SPDX-FileCopyrightText" and/or "# SPDX-License-Identifier") so they reference
2026.

In `@tensorrt_llm/_torch/auto_deploy/llm_args.py`:
- Around line 458-467: The validator ensure_max_seq_len currently declares an
unused parameter named info which triggers the ARG lint rule; rename that
parameter to _info in the method signature of ensure_max_seq_len (the
`@field_validator`("max_seq_len", mode="before") classmethod) so it becomes
unused-by-convention and the linter warning is silenced, leaving the body
unchanged and preserving the return behavior that falls back to
AutoDeployConfig.model_fields["max_seq_len"].get_default(call_default_factory=True)
when value is None.

In `@tensorrt_llm/_torch/auto_deploy/shim/interface.py`:
- Around line 785-802: The cache-refresh loop in CachedSequenceInterface only
checks for "k_cache" or "v_cache" and therefore misses new combined keys like
"kv_cache_*"; update the condition in the for-loop that iterates
self._caches.keys() to also detect combined kv names (e.g., check for "kv_cache"
or name.startswith("kv_cache_") or a substring match for "kv_cache") so that
entries created from self._cache_initializers are re-invoked and replaced
(preserving the existing logic that calls
self._cache_initializers[name](self.info), compares data_ptrs, increments
regenerated, and logs via ad_logger.info).

In `@tests/test_trtllm_attention_cuda_graph.py`:
- Around line 244-248: The three separate pytest skipif decorators cause
torch.cuda.get_device_capability() to be called at import time even on CPU-only
builds; update the decorators so the CUDA availability and device capability
checks are combined into a single skipif using short-circuit logic (e.g. combine
torch.cuda.is_available() and torch.cuda.get_device_capability()[0] < 8 into one
condition) while keeping the HAS_PT_CACHE_BACKEND check as its own decorator;
ensure the combined decorator uses a clear reason like "CUDA graphs require SM
8.0+ or CUDA not available" so get_device_capability() is only invoked when CUDA
is available.
🧹 Nitpick comments (23)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)

32-44: Consider annotating mutable class attribute with ClassVar.

The ATTN_BACKEND_CONFIGS dictionary is a mutable class attribute. While this works correctly, Python best practices and type checkers recommend annotating it with typing.ClassVar to explicitly indicate it's a class-level attribute not meant to be instance-specific.

💡 Suggested fix
+from typing import ClassVar, Dict, Any
+
 class TestLlama3_1_8B(LlmapiAccuracyTestHarness):
     MODEL_NAME = "meta-llama/Llama-3.1-8B"
     MODEL_PATH = hf_id_to_local_model_dir(MODEL_NAME)

     # Configuration presets for different attention backends
-    ATTN_BACKEND_CONFIGS = {
+    ATTN_BACKEND_CONFIGS: ClassVar[Dict[str, Dict[str, Any]]] = {
         "flashinfer": {
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_kernels/attention_with_kv_cache.py (1)

1-1: Consider adding NVIDIA copyright header.

This source file appears to be missing the standard NVIDIA copyright header that the coding guidelines require for all TensorRT-LLM source files (.py, .cpp, .cu, etc.).

tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py (1)

23-52: Unused function parameters k and v.

The k and v parameters are passed to this function but never used - the reference computation extracts values directly from kv_cache. This appears intentional since the custom op appends k/v to the cache before this reference function is called, but the unused parameters could be removed for clarity.

💡 Suggested fix
 def _attention_with_fp8_kv_cache(
-    q, k, v, kv_cache, k_scale, v_scale, prefill_seq_len, causal, mask
+    q, kv_cache, k_scale, v_scale, prefill_seq_len, causal, mask
 ):
     """Simulates attention for fp8 kv cache with q,k,v outputs of GEMM in fp16"""
-    batch_size, seq_len, _ = k.shape
+    batch_size, seq_len, _ = q.shape

Note: This would also require updating the caller at line 786-787.

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)

12-17: Use module-qualified imports per style guide.

The new from typing import ... and from pydantic import ... imports drop the module namespace. Please switch to module-qualified imports to align with the repo guideline (e.g., import typing, import pydantic and reference typing.Dict, pydantic.BaseModel, etc.).
As per coding guidelines: Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.


539-546: Initialize KV cache pool members in __init__, not at class scope.

_kv_cache_pool_pointers / _kv_cache_pool_mapping are class attributes; this risks shared state across instances and violates the constructor-initialization guideline. Set them to None in __init__ and keep only type hints at class scope.

Suggested fix
 class SequenceInfo:
     def __init__(...):
         ...
+        self._kv_cache_pool_pointers: Optional[torch.Tensor] = None
+        self._kv_cache_pool_mapping: Optional[torch.Tensor] = None
-    _kv_cache_pool_pointers: Optional[torch.Tensor] = None
-    _kv_cache_pool_mapping: Optional[torch.Tensor] = None
As per coding guidelines: Initialize all externally visible members of a Python class in the constructor.

Also applies to: 1060-1089

tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py (2)

29-31: Use module-qualified imports per style guide.

Please switch to module-qualified imports (e.g., import abc, import typing) instead of from ... import ....
As per coding guidelines: Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.


151-240: Rename unused layer_idx to avoid lint noise.

_allocate_layer_cache doesn’t use layer_idx. Consider renaming it to _layer_idx (or removing it) to silence the lint warning.

Suggested fix
-    def _allocate_layer_cache(self, layer_idx: int) -> Dict[str, torch.Tensor]:
+    def _allocate_layer_cache(self, _layer_idx: int) -> Dict[str, torch.Tensor]:
tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py (1)

42-44: Use module-qualified imports per style guide.

Please switch to module-qualified imports (e.g., import dataclasses, import typing) instead of from ... import ....
As per coding guidelines: Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.

tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py (2)

46-47: Use module-qualified imports per style guide.

Please switch to module-qualified imports (e.g., import dataclasses, import typing) instead of from ... import ....
As per coding guidelines: Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.


471-472: Rename module globals to follow the G_ prefix rule.

_global_state and _trtllm_config are module-level globals and should use the G_ prefix (e.g., G_TRTLLM_GLOBAL_STATE, G_TRTLLM_CONFIG).
As per coding guidelines: Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL).

Also applies to: 1231-1232

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2)

329-331: Consider renaming cache_loc to slot_idx in fused_mla_ref for consistency.

The fused_mla_ref function (lines 256-386) still uses cache_loc as a parameter name (line 264) and passes it to update_kv_cache. While functionally correct, this inconsistency could cause confusion since update_kv_cache now expects slot_idx. The same applies to the fake registration at lines 397-398 and usages at lines 337-338.


1-9: Missing NVIDIA copyright header.

This file should contain the NVIDIA copyright header as required by the coding guidelines for all source files.

Proposed fix
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """Torch reference implementations for attention."""

As per coding guidelines: All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification.

tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_update_kv_cache.py (1)

1-5: Missing NVIDIA copyright header.

This test file should contain the NVIDIA copyright header as required by the coding guidelines.

tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/triton_kernels/test_attention_with_kv_cache.py (1)

1-6: Missing NVIDIA copyright header.

This test file should contain the NVIDIA copyright header as required by the coding guidelines.

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py (1)

1-11: Missing NVIDIA copyright header.

This file should contain the NVIDIA copyright header as required by the coding guidelines.

tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py (1)

1-10: Missing NVIDIA copyright header.

This file should contain the NVIDIA copyright header as required by the coding guidelines.

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py (1)

1-1: Update copyright year to include 2026.

The copyright header shows years 2022-2025, but this file has been meaningfully modified. Per coding guidelines, the year should be updated to reflect the latest meaningful modification.

Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (2)

10-11: Comment on line 10 is misleading.

The comment "Initialize resources first" doesn't accurately describe what this import does. The import simply makes KVPagedResourceHandler available for use in tests below - it doesn't initialize any resources.

📝 Suggested comment fix
-# Initialize resources first (KVPagedResourceHandler is used within tests below)
+# Import KVPagedResourceHandler for paged KV cache resource tests
 from tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface import KVPagedResourceHandler

295-297: Redundant import - KVPagedResourceHandler already imported at module level.

This import is unnecessary since KVPagedResourceHandler is already imported at line 11.

♻️ Remove redundant import
     # Add a resource to verify initialize_resources is called
-    from tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface import (
-        KVPagedResourceHandler,
-    )
-
     dummy_cached_interface.add_resource(
test_trtllm_attention.py (2)

1-6: Consider moving test file to the tests directory.

This standalone test script is at the repository root. For consistency with the project structure, consider moving it to tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_trtllm_attention.py or a similar location.


155-165: Prefix unused variable with underscore.

The unpacked variable host_kv_cache_pool_mapping is never used. Prefix it with an underscore to indicate it's intentionally ignored.

📝 Fix unused variable
         (
             sequence_length,
             host_past_key_value_lengths,
             host_total_kv_lens,
             context_lengths,
             host_context_lengths,
             host_request_types,
             kv_cache_block_offsets,
             host_kv_cache_pool_pointers,
-            host_kv_cache_pool_mapping,
+            _host_kv_cache_pool_mapping,
         ) = result
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py (1)

120-140: Unused variable num_prefill_tokens should be prefixed or removed.

Line 137 unpacks num_prefill_tokens from q.shape but it's never used in the function. The function uses len(input_pos) for num_prefill instead.

📝 Fix unused variable
 def _prefill_attention(
     ...
 ) -> None:
     """Handle prefill phase - context attention with variable sequence lengths."""
     # NOTE: num_prefill_tokens == sum(seq_len)
-    num_prefill_tokens, n_heads, q_d_head = q.shape
+    _, n_heads, q_d_head = q.shape
     max_cache_seq_len, n_kv_heads = k_cache.shape[1:3]
tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)

529-571: Replace debug print statements with logger calls

Raw prints in library code will spam stdout and are hard to control in CI and CUDA-graph capture. Prefer ad_logger.debug(...) or remove the statements entirely.

♻️ Suggested refactor (apply to the whole block)
-        print("[DEBUG CachedSequenceInterface._init_kv_cache_manager]")
-        print(
-            f"  hasattr kv_cache_pool_pointers: {hasattr(self._kv_cache_manager, 'kv_cache_pool_pointers')}"
-        )
+        ad_logger.debug("[CachedSequenceInterface] init_kv_cache_manager")
+        ad_logger.debug(
+            "  hasattr kv_cache_pool_pointers: %s",
+            hasattr(self._kv_cache_manager, "kv_cache_pool_pointers"),
+        )
         if hasattr(self._kv_cache_manager, "kv_cache_pool_pointers"):
             pool_ptrs = self._kv_cache_manager.kv_cache_pool_pointers
             pool_map = self._kv_cache_manager.kv_cache_pool_mapping
-            print(f"  kv_cache_pool_pointers: {pool_ptrs}")
-            print(
-                f"  kv_cache_pool_mapping.shape: {pool_map.shape if pool_map is not None else None}"
-            )
+            ad_logger.debug("  kv_cache_pool_pointers: %s", pool_ptrs)
+            ad_logger.debug(
+                "  kv_cache_pool_mapping.shape: %s",
+                pool_map.shape if pool_map is not None else None,
+            )

             self.info.set_kv_cache_pool_info(pool_ptrs, pool_map)
-            print("  Set pool info on SequenceInfo")
-            print(f"  self.info.kv_cache_pool_pointers: {self.info.kv_cache_pool_pointers}")
+            ad_logger.debug("  Set pool info on SequenceInfo")
+            ad_logger.debug(
+                "  self.info.kv_cache_pool_pointers: %s",
+                self.info.kv_cache_pool_pointers,
+            )

             try:
                 from ..custom_ops.trtllm_attention import _trtllm_config

-                print(f"  _trtllm_config.is_configured: {_trtllm_config.is_configured}")
+                ad_logger.debug(
+                    "  _trtllm_config.is_configured: %s", _trtllm_config.is_configured
+                )
                 if not _trtllm_config.is_configured:
                     _trtllm_config.configure(self.info)
-                    print("  Configured _trtllm_config with SequenceInfo")
-                    print(f"  _trtllm_config._sequence_info: {_trtllm_config._sequence_info}")
+                    ad_logger.debug("  Configured _trtllm_config with SequenceInfo")
+                    ad_logger.debug(
+                        "  _trtllm_config._sequence_info: %s",
+                        _trtllm_config._sequence_info,
+                    )

                 if _trtllm_config._num_layers == 0 and kv_ref is not None:
                     num_kv_heads_list = [h.num_kv_heads for h in kv_managed.values()]
                     _trtllm_config.set_model_config(
                         num_layers=len(kv_managed),
                         num_kv_heads_per_layer=num_kv_heads_list,
                         head_dim=kv_ref.head_dim,
                         dtype=kv_ref.dtype,
                     )
-                    print(
-                        f"  Set model config: num_layers={len(kv_managed)}, "
-                        f"dtype={kv_ref.dtype}, quant_mode={_trtllm_config._quant_mode}"
-                    )
+                    ad_logger.debug(
+                        "  Set model config: num_layers=%s, dtype=%s, quant_mode=%s",
+                        len(kv_managed),
+                        kv_ref.dtype,
+                        _trtllm_config._quant_mode,
+                    )
             except ImportError:
-                print("  TRT-LLM attention import failed")
+                ad_logger.debug("  TRT-LLM attention import failed")
                 pass

Comment on lines +287 to +319
class CacheConfig(BaseModel):
"""Cache configuration for attention-related dtypes."""

model_config = ConfigDict(
arbitrary_types_allowed=True,
extra="forbid",
)

dtype: Optional[torch.dtype] = Field(default=None, description="KV cache dtype.")
mamba_dtype: Optional[torch.dtype] = Field(default=None, description="Mamba cache dtype.")
delta_dtype: Optional[torch.dtype] = Field(
default=torch.float32, description="Delta cache dtype. Defaults to float32."
)

@field_validator("dtype", "mamba_dtype", "delta_dtype", mode="before")
@classmethod
def _coerce_dtype(cls, value):
if value is None or isinstance(value, torch.dtype):
return value
if isinstance(value, str):
dtype = getattr(torch, value, None)
assert isinstance(dtype, torch.dtype), f"Invalid {dtype=}"
return dtype
return value

def __or__(self, other: "CacheConfig") -> "CacheConfig":
"""Combine two CacheConfig objects field-wise using Python's `or` semantics."""
if not isinstance(other, CacheConfig):
raise NotImplementedError(f"Cannot combine CacheConfig with {type(other)}")
merged_kwargs = {}
for field_name in type(self).model_fields.keys():
merged_kwargs[field_name] = getattr(self, field_name) or getattr(other, field_name)
return CacheConfig(**merged_kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid assert for user-facing dtype validation.

_coerce_dtype uses assert, which can be stripped in optimized runs and let invalid values through. Prefer an explicit exception.

Suggested fix
-        if isinstance(value, str):
-            dtype = getattr(torch, value, None)
-            assert isinstance(dtype, torch.dtype), f"Invalid {dtype=}"
-            return dtype
+        if isinstance(value, str):
+            dtype = getattr(torch, value, None)
+            if not isinstance(dtype, torch.dtype):
+                raise ValueError(f"Invalid dtype string: {value!r}")
+            return dtype
🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py` around
lines 287 - 319, The field validator CacheConfig._coerce_dtype should not use
assert for validating string dtype names; instead, when value is a str look up
the attribute on torch (as currently done), and if the lookup yields None or a
non-torch.dtype raise a clear exception (e.g., ValueError) with the invalid
input included; update the validator in CacheConfig (decorated with
`@field_validator`("dtype", "mamba_dtype", "delta_dtype", mode="before")) to
replace the assert with an explicit raise that reports the offending value and
expected torch.dtype so invalid dtype strings are rejected reliably in all
runtime modes.

Comment on lines +434 to +436
# will store num_blocks later...
self._num_blocks = None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, check if the file exists and get its basic structure
file_path="tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py"
if [ -f "$file_path" ]; then
  echo "=== File size ===" 
  wc -l "$file_path"
  echo ""
  echo "=== Lines 430-440 (around line 434) ===" 
  sed -n '430,440p' "$file_path"
  echo ""
  echo "=== Lines 610-640 (around lines 617-636) ===" 
  sed -n '610,640p' "$file_path"
else
  echo "File not found"
fi

Repository: NVIDIA/TensorRT-LLM

Length of output: 2110


🏁 Script executed:

#!/bin/bash
# Find SequenceInfo class definition and num_blocks property
rg -n "class SequenceInfo" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -A 5
echo "---"
rg -n "def num_blocks" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -B 2 -A 5
echo "---"
rg -n "@property" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -A 3 | grep -A 3 "num_blocks"

Repository: NVIDIA/TensorRT-LLM

Length of output: 1047


🏁 Script executed:

#!/bin/bash
# Find all usages of num_blocks and estimate_cache_loc_capacity
echo "=== Usages of num_blocks ===" 
rg -n "\.num_blocks" -g'*.py'
echo ""
echo "=== Calls to estimate_cache_loc_capacity ===" 
rg -n "estimate_cache_loc_capacity" -g'*.py'

Repository: NVIDIA/TensorRT-LLM

Length of output: 9127


🏁 Script executed:

#!/bin/bash
# Check the initialization sequence in interface.py around line 525
echo "=== Interface.py context (line 520-535) ===" 
sed -n '520,535p' tensorrt_llm/_torch/auto_deploy/shim/interface.py
echo ""
echo "=== Lines 1165-1180 in attention_interface.py ===" 
sed -n '1165,1180p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
echo ""
echo "=== Lines 1215-1225 in trtllm_attention.py ===" 
sed -n '1215,1225p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2011


🏁 Script executed:

#!/bin/bash
echo "=== cache_backend.py context (line 190-200) ===" 
sed -n '190,200p' tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py
echo ""
echo "=== pt_cache_backend.py context (line 220-240) ===" 
sed -n '220,240p' tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py
echo ""
echo "=== pt_cache_backend.py context (line 330-345) ===" 
sed -n '330,345p' tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2773


🏁 Script executed:

#!/bin/bash
echo "=== Test case (test_resource_handlers.py lines 50-60) ===" 
sed -n '50,60p' tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_resource_handlers.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 481


🏁 Script executed:

#!/bin/bash
echo "=== Looking for SequenceInfo instantiation patterns ===" 
rg -n "SequenceInfo(" -B 2 -A 5 tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py | head -40
echo ""
echo "=== Check the full __init__ method of SequenceInfo ===" 
sed -n '322,450p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 7826


🏁 Script executed:

#!/bin/bash
echo "=== Check trtllm_attention.py to see initialize call context ===" 
sed -n '1200,1230p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py
echo ""
echo "=== Check the full initialize method in pt_cache_backend ===" 
sed -n '200,230p' tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2668


🏁 Script executed:

#!/bin/bash
echo "=== Checking the estimate_cache_tokens_per_forward method ===" 
sed -n '622,631p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
echo ""
echo "=== Full context of lines 617-645 ===" 
sed -n '617,645p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2142


🏁 Script executed:

#!/bin/bash
echo "=== Check where _setup_pt_cache_backend is called ===" 
rg -n "_setup_pt_cache_backend" tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py -B 3 -A 1
echo ""
echo "=== Check the full trtllm_attention initialization flow ===" 
sed -n '1185,1225p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py
echo ""
echo "=== Check if there are any early accesses before estimate_cache_loc_capacity ===" 
rg -n "\.num_blocks" tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py -B 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 2848


🏁 Script executed:

#!/bin/bash
echo "=== Search for where initialize() is called on cache backends ===" 
rg -n "\.initialize\(" tensorrt_llm/_torch/auto_deploy/custom_ops/ -B 3 -A 1

Repository: NVIDIA/TensorRT-LLM

Length of output: 1146


🏁 Script executed:

#!/bin/bash
echo "=== Check CachedSequenceInterface to see full initialization flow ===" 
sed -n '450,550p' tensorrt_llm/_torch/auto_deploy/shim/interface.py
echo ""
echo "=== Look for where set_model_config is called ===" 
rg -n "set_model_config" tensorrt_llm/_torch/auto_deploy/ -B 2 -A 2 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 7632


🏁 Script executed:

#!/bin/bash
echo "=== Check setup_pt_cache_backend context in trtllm_attention ===" 
sed -n '1150,1230p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 3421


🏁 Script executed:

#!/bin/bash
echo "=== Check allocate() method that accesses num_blocks ===" 
sed -n '1160,1185p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
echo ""
echo "=== Check cache_backend.py initialize() that accesses num_blocks ===" 
sed -n '188,210p' tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py
echo ""
echo "=== Check when cache backends' initialize is called in the flow ===" 
rg -n "_allocate_unmanaged_resources\|handler.allocate" tensorrt_llm/_torch/auto_deploy/shim/interface.py -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 2131


🏁 Script executed:

#!/bin/bash
echo "=== Search for all SequenceInfo creations ===" 
rg -n "SequenceInfo(" --type py -B 1 -A 3 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 202


🏁 Script executed:

#!/bin/bash
echo "=== Check _allocate_unmanaged_resources in detail ===" 
sed -n '475,515p' tensorrt_llm/_torch/auto_deploy/shim/interface.py
echo ""
echo "=== Check the order of calls in __init__ ===" 
rg -n "def __init__" tensorrt_llm/_torch/auto_deploy/shim/interface.py -A 50 | head -70

Repository: NVIDIA/TensorRT-LLM

Length of output: 5344


🏁 Script executed:

#!/bin/bash
echo "=== Check KVPagedResourceHandler.allocate method ===" 
rg -n "class KVPagedResourceHandler" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -A 100 | grep -A 50 "def allocate"

Repository: NVIDIA/TensorRT-LLM

Length of output: 2409


🏁 Script executed:

#!/bin/bash
echo "=== Full _create_kv_cache_manager method ===" 
sed -n '403,530p' tensorrt_llm/_torch/auto_deploy/shim/interface.py | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 4804


🏁 Script executed:

#!/bin/bash
echo "=== Where is _allocate_unmanaged_resources called ===" 
rg -n "_allocate_unmanaged_resources" tensorrt_llm/_torch/auto_deploy/shim/interface.py -B 5 -A 2
echo ""
echo "=== Where is _create_kv_cache_manager called ===" 
rg -n "_create_kv_cache_manager" tensorrt_llm/_torch/auto_deploy/shim/interface.py -B 3 -A 3

Repository: NVIDIA/TensorRT-LLM

Length of output: 2906


🏁 Script executed:

#!/bin/bash
echo "=== Check if allocate can be called before estimate_cache_loc_capacity ===" 
rg -n "allocate\|estimate_cache_loc_capacity" tensorrt_llm/_torch/auto_deploy/shim/interface.py | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 137


🏁 Script executed:

#!/bin/bash
echo "=== Check if TRT-LLM flow is independent ===" 
sed -n '1150,1225p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py | grep -n "num_blocks\|initialize"

Repository: NVIDIA/TensorRT-LLM

Length of output: 759


🏁 Script executed:

#!/bin/bash
echo "=== Check the complete sequence in interface.py from line 640-660 ===" 
sed -n '640,660p' tensorrt_llm/_torch/auto_deploy/shim/interface.py
echo ""
echo "=== Look for where estimate_cache_loc_capacity is called ===" 
sed -n '520,530p' tensorrt_llm/_torch/auto_deploy/shim/interface.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1760


🏁 Script executed:

#!/bin/bash
echo "=== Check pt_cache_backend.initialize at line 225 ===" 
sed -n '220,230p' tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py
echo ""
echo "=== Check if there's any early initialization ===" 
rg -n "\.initialize\(" tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py
echo ""
echo "=== Check the test to see the expected order ===" 
sed -n '50,75p' tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_resource_handlers.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1556


🏁 Script executed:

#!/bin/bash
echo "=== Check where _setup_pt_cache_backend is called ===" 
rg -n "_setup_pt_cache_backend\|setup_pt_cache_backend" -g'*.py'
echo ""
echo "=== Check if there are independent calls ===" 
rg -n "\.pt_cache_backend" tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 714


🏁 Script executed:

#!/bin/bash
echo "=== Check the property that raises the assertion ===" 
sed -n '617,621p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
echo ""
echo "=== Check what happens during initialization without estimate_cache_loc_capacity ===" 
python3 << 'PYTHON_CODE'
# Simulate the scenario
class SequenceInfo:
    def __init__(self, max_seq_len, max_batch_size, tokens_per_block=None, max_num_tokens=None):
        self.max_seq_len = max_seq_len
        self.max_batch_size = max_batch_size
        self.tokens_per_block = tokens_per_block or max_seq_len
        self.max_num_tokens = max_num_tokens or (max_seq_len + 1) * max_batch_size
        self._num_blocks = None  # Currently None
    
    `@property`
    def num_blocks(self):
        assert self._num_blocks is not None, "num_blocks not set yet"
        return self._num_blocks
    
    def estimate_cache_loc_capacity(self, num_blocks):
        self._num_blocks = num_blocks

# Test: can we derive _num_blocks in __init__?
import math
max_num_tokens = 256 * 4  # 256 seq_len, 4 batch
tokens_per_block = 256
derived_blocks = math.ceil(max_num_tokens / tokens_per_block)
print(f"max_num_tokens={max_num_tokens}, tokens_per_block={tokens_per_block}")
print(f"Derived _num_blocks would be: {derived_blocks}")
print(f"This is a reasonable default until estimate_cache_loc_capacity() is called")
PYTHON_CODE

Repository: NVIDIA/TensorRT-LLM

Length of output: 500


🏁 Script executed:

#!/bin/bash
echo "=== Look for test cases that might hit the assertion ===" 
find tests -name "*.py" -type f -exec grep -l "num_blocks\|SequenceInfo" {} \; | head -10 | xargs -I {} sh -c 'echo "=== {} ===" && grep -n "num_blocks\|estimate_cache_loc_capacity" {}'

Repository: NVIDIA/TensorRT-LLM

Length of output: 3960


🏁 Script executed:

#!/bin/bash
echo "=== Check for potential unguarded accesses ===" 
rg -n "sequence_info\.num_blocks|si\.num_blocks" tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py -B 2 -A 1
echo ""
echo "=== Verify the actual initialization order in the TRT-LLM flow ===" 
sed -n '1150,1225p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 3908


Initialize _num_blocks with a computed default in __init__ instead of None.

SequenceInfo.num_blocks asserts _num_blocks is set, but it's only assigned in estimate_cache_loc_capacity(). Code paths that create PTCacheBackend and call initialize() directly (e.g., trtllm_attention.py line 1216) will hit this assertion before estimate_cache_loc_capacity() is called.

Initialize _num_blocks in __init__ using a reasonable estimate based on available parameters (e.g., ceil(max_num_tokens / tokens_per_block)), which will later be refined by estimate_cache_loc_capacity().

🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py` around
lines 434 - 436, SequenceInfo currently leaves self._num_blocks as None (set in
__init__), causing an assertion in SequenceInfo.num_blocks when code paths
(e.g., PTCacheBackend -> initialize()) access it before
estimate_cache_loc_capacity() runs; fix by initializing self._num_blocks in
__init__ to a sensible default (for example use math.ceil(max_num_tokens /
tokens_per_block) or another conservative estimate derived from constructor args
like max_num_tokens and tokens_per_block) so num_blocks is valid immediately and
later refined inside estimate_cache_loc_capacity(); update __init__ where
self._num_blocks is set and leave estimate_cache_loc_capacity() to overwrite
with the accurate value.

Comment on lines +1 to +2
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update copyright year to 2026.

This file is newly added/modified in 2026 but the header still reads 2025.

Suggested fix
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py` around lines 1 -
2, Update the copyright header in
tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py by changing the year
from 2025 to 2026 in the SPDX header lines (the two top-of-file lines starting
with "# SPDX-FileCopyrightText" and "# SPDX-License-Identifier") so the header
reflects the latest modification year.

Comment on lines +321 to 333
# CACHES - combined KV cache with shape [num_blocks, 2, num_kv_heads, tokens_per_block, head_dim]
kv_cache: torch.Tensor,
# CONSTANTS
scale: Optional[float],
k_scale: float,
v_scale: float,
) -> torch.Tensor:
# reshape to standard [b*s, n_heads, head_dim] layout
head_dim = k_cache.shape[-1]
# kv_cache shape: [num_blocks, 2, num_kv_heads, tokens_per_block, head_dim] (HND layout)
head_dim = kv_cache.shape[-1]
page_size = kv_cache.shape[3] # tokens_per_block
q_shape_og = q.shape
b, s = q_shape_og[:2]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and read the file in question
find . -name "flashinfer_attention.py" -type f | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 131


🏁 Script executed:

# Also search for _GlobalFlashInferPlanner to understand how kv_layout is managed
rg "_GlobalFlashInferPlanner" -A 5 -B 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 22373


🏁 Script executed:

# Search for kv_layout references in the codebase
rg "kv_layout" -t py | head -30

Repository: NVIDIA/TensorRT-LLM

Length of output: 4197


🏁 Script executed:

# Look for HND and NHD references to understand these layouts
rg "HND|NHD" -t py | head -40

Repository: NVIDIA/TensorRT-LLM

Length of output: 5937


🏁 Script executed:

# Read the flashinfer_attention.py file around lines 321-333
head -n 340 tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py | tail -n +310

Repository: NVIDIA/TensorRT-LLM

Length of output: 1228


🏁 Script executed:

# Look for the _FlashInferPlanner class definition and kv_layout attribute
rg "class _FlashInferPlanner" -A 30

Repository: NVIDIA/TensorRT-LLM

Length of output: 3350


🏁 Script executed:

# Check how kv_layout is set in the planner
rg "kv_layout\s*=" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 419


🏁 Script executed:

# Look for any existing layout-aware page_size calculations
rg "page_size|tokens_per_block" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 1704


🏁 Script executed:

# Look for NHD layout shape in tests
rg -A 10 "NHD layout" tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py | head -40

Repository: NVIDIA/TensorRT-LLM

Length of output: 1651


🏁 Script executed:

# Look for how KV cache is created in NHD layout tests
rg "kv_layout.*NHD" tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/ -B 5 -A 15 | head -60

Repository: NVIDIA/TensorRT-LLM

Length of output: 7402


🏁 Script executed:

# Check the KVPagedResourceHandler to understand layout shapes
rg "class KVPagedResourceHandler" -A 50 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 5408


🏁 Script executed:

# Look at the allocate method to see NHD shape
rg "def allocate" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -A 30 | grep -A 30 "KVPagedResourceHandler" | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 1922


🏁 Script executed:

# Search for NHD shape construction
rg "NHD" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -B 5 -A 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 1941


🏁 Script executed:

# Check the function signature and if it can receive NHD layout
rg "flashinfer_attention_prepare_metadata" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -B 5 -A 40 | head -60

Repository: NVIDIA/TensorRT-LLM

Length of output: 2304


🏁 Script executed:

# Check where the kv_cache comes from and what layout it has
rg "flashinfer_attention_prepare_metadata" tests/ -B 10 -A 10 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Look at the full test to see if NHD is ever used
rg "kv_layout.*=" tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Look for where flashinfer_attention is called and what kv_cache layout is used
rg "@torch.library.custom_op.*flashinfer_attention\"" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Find the actual function that has the page_size issue at lines 321-333
sed -n '315,340p' tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1061


🏁 Script executed:

# Check what layout is always used in this file
rg "_GlobalFlashInferPlanner.kv_layout" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 178


🏁 Script executed:

# Check if kv_layout can ever be NHD for this operation
rg "kv_layout.*=" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py | head -10

Repository: NVIDIA/TensorRT-LLM

Length of output: 223


🏁 Script executed:

# Let me check the entire flashinfer_attention_wrapper function definition
rg "@torch.library.custom_op.*flashinfer_attention_wrapper" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -A 50 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Check if tests ever use NHD layout with flashinfer_attention_wrapper
rg "flashinfer_attention_wrapper|flashinfer_attention\(" tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py -B 5 -A 5 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Verify the exact tensor shape for NHD by looking at the test allocation
sed -n '1,100p' tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py | tail -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 2021


🏁 Script executed:

# Check what function name contains lines 321-333
sed -n '300,333p' tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py | head -35

Repository: NVIDIA/TensorRT-LLM

Length of output: 1152


🏁 Script executed:

# Search for where this function is called to see what layout is passed
rg "flashinfer_attention_wrapper\(" --type py -B 3 -A 3 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Check the comment at the top of flashinfer_attention.py about layout
head -50 tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1393


🏁 Script executed:

# Check if kv_layout can be changed from default HND
rg "kv_layout\s*=" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 419


🏁 Script executed:

# Check if this file is actually used with NHD layout
rg "flashinfer_mha_with_cache" tests/ --type py -B 5 -A 5 | head -40

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Search for where FlashInferAttention is instantiated to see if NHD is ever set
rg "FlashInferAttention\|_GlobalFlashInferPlanner" tests/ -B 3 -A 3 | head -60

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Check if there's a way to set kv_layout to NHD for the planner
rg "kv_layout.*NHD" tensorrt_llm/_torch/ --type py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1237


🏁 Script executed:

# Check if flashinfer_mha_with_cache can actually receive NHD layout
rg "flashinfer_mha_with_cache" tensorrt_llm/ --type py -B 5 -A 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 2157


🏁 Script executed:

# Look at how kv_cache shape is checked or asserted
rg "kv_cache.shape" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 333


🏁 Script executed:

# Verify the exact NHD shape one more time from the allocate function
sed -n '180,240p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2384


🏁 Script executed:

# Let me get the full NHD allocation to triple-check dimensions
sed -n '228,250p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 899


🏁 Script executed:

# Check if there's any test that uses flashinfer_mha_with_cache with NHD
grep -r "flashinfer_mha_with_cache\|flashinfer_attention_mha_with_cache" tests/ --include="*.py"

Repository: NVIDIA/TensorRT-LLM

Length of output: 1470


🏁 Script executed:

# Let me get the complete NHD allocation code
sed -n '210,235p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1170


🏁 Script executed:

# Find the allocate method in KVPagedResourceHandler
grep -n "def allocate" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 358


🏁 Script executed:

# Get the allocate method with correct line numbers
rg "def allocate.*sequence_info" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -A 40

Repository: NVIDIA/TensorRT-LLM

Length of output: 5927


Derive page_size from kv_layout to keep NHD support correct

The page_size calculation at line 331 uses kv_cache.shape[3], which is correct only for HND layout [num_blocks, 2, num_kv_heads, tokens_per_block, head_dim]. For NHD layout [num_blocks, tokens_per_block, 2, num_kv_heads, head_dim], tokens_per_block is at dimension 1, not dimension 3, so the plans will be created with an incorrect page size if NHD is ever used.

-    # kv_cache shape: [num_blocks, 2, num_kv_heads, tokens_per_block, head_dim] (HND layout)
+    # kv_cache shape depends on kv_layout (HND or NHD)
     head_dim = kv_cache.shape[-1]
-    page_size = kv_cache.shape[3]  # tokens_per_block
+    if _GlobalFlashInferPlanner.kv_layout == "HND":
+        page_size = kv_cache.shape[3]  # tokens_per_block
+    else:  # NHD
+        page_size = kv_cache.shape[1]  # tokens_per_block
🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py` around
lines 321 - 333, The page_size is being taken as kv_cache.shape[3] which assumes
HND layout; update the logic that computes page_size in the function (where
kv_cache, kv_layout and page_size are used) to derive tokens_per_block based on
kv_layout: if kv_layout indicates HND use shape[3], if it indicates NHD use
shape[1]; replace the hardcoded index with this conditional access so plan
creation and downstream uses of page_size are correct for both layouts.

Comment on lines +1 to +2
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update copyright year to 2026.

Suggested fix
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py` around lines
1 - 2, Update the SPDX header year from 2025 to 2026 at the top of the file;
specifically edit the copyright header lines in
tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py (the
SPDX-FileCopyrightText and/or SPDX-License-Identifier header block) to reflect
2026 as the latest meaningful modification year.

Comment on lines 222 to 305
@dataclass
class TrtllmLayerState:
"""Per-layer state for TRT-LLM attention wrapper."""

layer_idx: int
num_heads: int
num_kv_heads: int
head_dim: int
tokens_per_block: int
max_num_requests: int
max_context_length: int
num_layers: int = 0 # Total number of layers for block offset calculation

# Pre-allocated tensors for metadata translation
# Device tensors
sequence_length: torch.Tensor = field(default=None)
context_lengths: torch.Tensor = field(default=None)
kv_cache_block_offsets: torch.Tensor = field(default=None)

# Host tensors (pinned for async H2D)
host_past_key_value_lengths: torch.Tensor = field(default=None)
host_context_lengths: torch.Tensor = field(default=None)
host_request_types: torch.Tensor = field(default=None)
host_total_kv_lens: torch.Tensor = field(default=None)
host_kv_cache_pool_pointers: torch.Tensor = field(default=None)
host_kv_cache_pool_mapping: torch.Tensor = field(default=None)

# Interleaved KV cache buffer for kernel (allocated lazily)
interleaved_kv_cache: torch.Tensor = field(default=None)

def __post_init__(self):
"""Allocate pre-sized tensors."""
if self.sequence_length is None:
device = "cuda"

# Device tensors
self.sequence_length = torch.zeros(
self.max_num_requests, dtype=torch.int32, device=device
)
self.context_lengths = torch.zeros(
self.max_num_requests, dtype=torch.int32, device=device
)

# Pre-allocate kv_cache_block_offsets with MAX size for CUDA graph stability
max_blocks_per_seq = (
self.max_context_length + self.tokens_per_block - 1
) // self.tokens_per_block
self.kv_cache_block_offsets = torch.zeros(
1, # num_pools
self.max_num_requests,
2, # K and V
max_blocks_per_seq,
dtype=torch.int32,
device=device,
)

# Host tensors (pinned memory for async transfers)
self.host_past_key_value_lengths = torch.zeros(
self.max_num_requests, dtype=torch.int32, device="cpu", pin_memory=True
)
self.host_context_lengths = torch.zeros(
self.max_num_requests, dtype=torch.int32, device="cpu", pin_memory=True
)
self.host_request_types = torch.zeros(
self.max_num_requests, dtype=torch.int32, device="cpu", pin_memory=True
)
self.host_total_kv_lens = torch.zeros(
2, dtype=torch.int64, device="cpu", pin_memory=True
)
# Pool pointers: [num_pools, 2] where each row is [k_cache_ptr, v_cache_ptr]
# thop.attention expects 2D tensor: [num_pools, 2]
self.host_kv_cache_pool_pointers = torch.zeros(
1, 2, dtype=torch.int64, device="cpu", pin_memory=True
)
# Pool mapping: 2D [num_layers, 2] format expected by thop.attention
# pool_mapping[layer, 0] = pool_idx (0 for single pool)
# pool_mapping[layer, 1] = layer_offset (0 when using per-layer pointers)
# Use max 256 layers to cover most models
max_layers = 256
self.host_kv_cache_pool_mapping = torch.zeros(
max_layers, 2, dtype=torch.int32, device="cpu", pin_memory=True
)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

TrtllmLayerState hard-codes device to "cuda".

This will allocate on the default CUDA device even if the model runs on a different GPU (or CPU in tests). Please pass the target device into TrtllmLayerState (e.g., via get_or_create_layer_state using kv_cache.device or SequenceInfo.device) and allocate on that device instead.

🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py` around lines
222 - 305, TrtllmLayerState currently hardcodes device="cuda" in __post_init__,
causing allocations on the wrong GPU; add a device field to the TrtllmLayerState
dataclass (e.g., device: torch.device) and use that field instead of the string
"cuda" when allocating device tensors in __post_init__, keeping host/pinned
tensors on cpu as before; update callers (e.g., get_or_create_layer_state) to
pass the correct device (kv_cache.device or SequenceInfo.device) when
constructing TrtllmLayerState so allocations follow the model/kv cache device.

Comment on lines 475 to 626
def _prepare_trtllm_metadata(
batch_info_host: torch.Tensor,
cu_seqlen_host: torch.Tensor,
cu_num_pages: torch.Tensor,
cu_num_pages_host: torch.Tensor,
cache_loc: torch.Tensor,
last_page_len: torch.Tensor,
last_page_len_host: torch.Tensor,
seq_len_with_cache_host: torch.Tensor,
state: TrtllmLayerState,
kv_cache: torch.Tensor,
ad_pool_pointers: Optional[torch.Tensor] = None,
ad_pool_mapping: Optional[torch.Tensor] = None,
) -> Tuple[torch.Tensor, ...]:
"""Prepare TRT-LLM metadata from AD metadata.

For CUDA graph support (like pt_cache_backend):
- During capture: Set host tensors to MAX, skip device operations
- Outside capture: Normal operation

Args:
batch_info_host: [num_prefill, num_prefill_tokens, num_decode]
cu_seqlen_host: Cumulative sequence lengths [num_seq + 1]
cu_num_pages: Cumulative page counts [num_seq + 1]
cu_num_pages_host: Same as cu_num_pages but on host
cache_loc: Flat page indices for all sequences
last_page_len: Tokens in last page per sequence
last_page_len_host: Same on host
seq_len_with_cache_host: Total seq length including cached tokens
state: Per-layer TRT-LLM state
kv_cache: Unified KV cache tensor [num_blocks, kv_factor=2, num_kv_heads, tokens_per_block, head_dim]
ad_pool_pointers: Optional AD pool pointers from KVCacheManager (shape: [num_pools, 2])
ad_pool_mapping: Optional AD pool mapping from KVCacheManager (shape: [num_layers, 2])

Returns:
Tuple of tensors needed by thop.attention
"""
num_prefill, num_prefill_tokens, num_decode = batch_info_host.tolist()
num_seq = num_prefill + num_decode

# Check if in CUDA graph capture mode
is_capturing = torch.cuda.is_current_stream_capturing()

# Compute input sequence lengths from cumulative sums
input_seq_lens = (cu_seqlen_host[1 : num_seq + 1] - cu_seqlen_host[:num_seq]).int()
seq_len_with_cache = seq_len_with_cache_host[:num_seq].int()
past_kv_lens = seq_len_with_cache - input_seq_lens.cpu()

# CUDA GRAPH FIX: Set host tensors to MAX during capture (like pt_cache_backend)
if is_capturing:
max_seq = state.max_context_length
state.host_past_key_value_lengths[:num_seq].fill_(max_seq)
state.host_context_lengths[:num_seq].fill_(max_seq)
state.host_request_types[:num_seq].fill_(1)
state.host_total_kv_lens[0] = 0
state.host_total_kv_lens[1] = max_seq * num_seq
else:
# Normal operation: fill host tensors
state.host_past_key_value_lengths[:num_seq].copy_(past_kv_lens)
state.host_context_lengths[:num_seq].copy_(input_seq_lens.cpu())
state.host_request_types[:num_prefill].fill_(0)
state.host_request_types[num_prefill:num_seq].fill_(1)
context_total_kv = seq_len_with_cache[:num_prefill].sum().item() if num_prefill > 0 else 0
gen_total_kv = seq_len_with_cache[num_prefill:num_seq].sum().item() if num_decode > 0 else 0
state.host_total_kv_lens[0] = context_total_kv
state.host_total_kv_lens[1] = gen_total_kv

# Device operations - skip during capture (like pt_cache_backend's skip_device_ops)
if not is_capturing:
# Sync before copy to catch any previous async errors
torch.cuda.synchronize()

# Copy to pre-allocated tensors
state.sequence_length[:num_seq].copy_(seq_len_with_cache.cuda())
state.context_lengths[:num_seq].copy_(input_seq_lens.cuda())

# Validate kv_cache shape (safe during capture - no device ops)
if len(kv_cache.shape) != 5 or kv_cache.shape[1] != 2:
raise RuntimeError(
f"Expected kv_cache shape [pages, 2, heads, tokens, dim], got {kv_cache.shape}"
)

num_layers = state.num_layers if state.num_layers > 0 else 32

# Pool pointer and block offset setup - skip during capture (contains .item() calls)
if not is_capturing:
# Set up KV cache pool pointers
use_ad_pool = (
ad_pool_pointers is not None
and ad_pool_mapping is not None
and ad_pool_pointers.numel() > 0
and ad_pool_pointers[0, 0].item() != 0
)

if not use_ad_pool:
raise RuntimeError(
f"AD pool not available. ad_pool_pointers={ad_pool_pointers}, "
f"ad_pool_mapping={ad_pool_mapping}"
)

# Use AD's pool pointers directly
state.host_kv_cache_pool_pointers[0, 0] = ad_pool_pointers[0, 0].item()
state.host_kv_cache_pool_pointers[0, 1] = 0

# Use AD's pool mapping directly
for layer_i in range(min(num_layers, ad_pool_mapping.shape[0])):
state.host_kv_cache_pool_mapping[layer_i, 0] = ad_pool_mapping[layer_i, 0].item()
state.host_kv_cache_pool_mapping[layer_i, 1] = ad_pool_mapping[layer_i, 1].item()

# Log pool setup for debugging (only once)
if state.layer_idx == 0 and not hasattr(state, "_pool_logged"):
state._pool_logged = True
ad_logger.debug(
f"[TRT-LLM Attention] Using AD pool directly: "
f"pool_ptr={state.host_kv_cache_pool_pointers[0, 0]}"
)

# Block offsets: convert flat cache_loc to per-sequence block indices
pages_per_seq = (cu_num_pages_host[1 : num_seq + 1] - cu_num_pages_host[:num_seq]).int()
max_blocks = pages_per_seq.max().item() if num_seq > 0 else 1
_global_state.set_max_blocks_per_seq(max_blocks)

# kv_cache_block_offsets is pre-allocated in __post_init__, don't reallocate

# Fill block offsets
kv_factor = 2
multiplier = num_layers * kv_factor
state.kv_cache_block_offsets.zero_()
offset = 0
for i in range(num_seq):
n_pages = pages_per_seq[i].item()
if n_pages > 0:
base_offsets = cache_loc[offset : offset + n_pages] * multiplier
state.kv_cache_block_offsets[0, i, 0, :n_pages] = base_offsets
state.kv_cache_block_offsets[0, i, 1, :n_pages] = base_offsets + 1
offset += n_pages

# Return tensors
# Use pre-allocated tensor size for block offsets (CUDA graph compatibility)
max_blocks_per_seq = state.kv_cache_block_offsets.shape[3]

return (
state.sequence_length[:num_seq],
state.host_past_key_value_lengths[:num_seq],
state.host_total_kv_lens,
state.context_lengths[:num_seq],
state.host_context_lengths[:num_seq],
state.host_request_types[:num_seq],
state.kv_cache_block_offsets[:, :num_seq, :, :max_blocks_per_seq],
state.host_kv_cache_pool_pointers,
state.host_kv_cache_pool_mapping,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fallback path still requires AD pool pointers.

_prepare_trtllm_metadata raises if ad_pool_pointers/ad_pool_mapping are missing, but the KV cache handler includes a fallback allocation path. If that fallback path is ever used (e.g., no KVCacheManager), this will crash at runtime. Consider failing fast earlier (during allocation) with a clearer error, or implement a non-pool-pointer metadata path for the fallback cache.

🧰 Tools
🪛 Ruff (0.14.14)

[warning] 478-478: Unused function argument: cu_num_pages

(ARG001)


[warning] 481-481: Unused function argument: last_page_len

(ARG001)


[warning] 482-482: Unused function argument: last_page_len_host

(ARG001)


[warning] 512-512: Unpacked variable num_prefill_tokens is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 553-555: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 570-573: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py` around lines
475 - 626, The function _prepare_trtllm_metadata currently raises if
ad_pool_pointers/ad_pool_mapping are missing which breaks the KV cache handler's
fallback allocation path; change the logic so when use_ad_pool is False you do
not raise but instead initialize predictable defaults for pool pointers/mapping
(e.g., zeros or sentinel values) and proceed to compute kv_cache_block_offsets
from cache_loc/pages_per_seq for the fallback cache layout (ensure
state.host_kv_cache_pool_pointers and state.host_kv_cache_pool_mapping are set
to valid defaults and any logging reflects fallback use), or alternatively move
the RuntimeError to an earlier allocation phase so missing AD pool pointers fail
fast during allocation rather than here (update checks around use_ad_pool,
state.host_kv_cache_pool_pointers, state.host_kv_cache_pool_mapping, and the
block-offset filling loop accordingly).

Comment on lines +458 to +467
@field_validator("max_seq_len", mode="before")
@classmethod
def ensure_max_seq_len(cls, value: Any, info: ValidationInfo) -> Any:
if value is None:
# Fallback to the AutoDeployConfig default when not provided
return AutoDeployConfig.model_fields["max_seq_len"].get_default(
call_default_factory=True
)
return value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and read the specific section
head -n 470 tensorrt_llm/_torch/auto_deploy/llm_args.py | tail -n 50

Repository: NVIDIA/TensorRT-LLM

Length of output: 1857


🏁 Script executed:

# Also check the overall file structure to understand context better
wc -l tensorrt_llm/_torch/auto_deploy/llm_args.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 110


🏁 Script executed:

# Search for any usage of the `info` variable in this method
sed -n '458,467p' tensorrt_llm/_torch/auto_deploy/llm_args.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 458


🏁 Script executed:

# Find the ensure_no_build_config method to see if it uses info
sed -n '470,490p' tensorrt_llm/_torch/auto_deploy/llm_args.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 884


🏁 Script executed:

# Search for all field_validator methods and check if any actually use the info parameter
rg -A 10 '@field_validator' tensorrt_llm/_torch/auto_deploy/llm_args.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1981


🏁 Script executed:

# Find the _check_for_default_value_only function
rg -B 2 -A 8 'def _check_for_default_value_only' tensorrt_llm/_torch/auto_deploy/llm_args.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 431


🏁 Script executed:

# Check for any ruff or linting configuration
find . -maxdepth 2 -name 'pyproject.toml' -o -name '.ruff.toml' -o -name 'ruff.toml' | head -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 126


🏁 Script executed:

# Check the ruff configuration to see if it flags unused parameters
cat ./ruff.toml

Repository: NVIDIA/TensorRT-LLM

Length of output: 969


Rename unused info parameter to _info

The info parameter is unused in this validator method. Ruff's ARG rule (flake8-unused-arguments) is enabled in the project configuration and would flag this. Renaming to _info follows Python convention for intentionally unused parameters and silences the warning.

-    def ensure_max_seq_len(cls, value: Any, info: ValidationInfo) -> Any:
+    def ensure_max_seq_len(cls, value: Any, _info: ValidationInfo) -> Any:
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 460-460: Unused class method argument: info

(ARG003)

🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/llm_args.py` around lines 458 - 467, The
validator ensure_max_seq_len currently declares an unused parameter named info
which triggers the ARG lint rule; rename that parameter to _info in the method
signature of ensure_max_seq_len (the `@field_validator`("max_seq_len",
mode="before") classmethod) so it becomes unused-by-convention and the linter
warning is silenced, leaving the body unchanged and preserving the return
behavior that falls back to
AutoDeployConfig.model_fields["max_seq_len"].get_default(call_default_factory=True)
when value is None.

Comment on lines +785 to +802
regenerated = 0
# Only regenerate k_cache and v_cache (KV caches that are views)
for name in list(self._caches.keys()):
if "k_cache" in name or "v_cache" in name:
if name in self._cache_initializers:
old_ptr = self._caches[name].data_ptr()
# Re-invoke initializer to get new view
self._caches[name] = self._cache_initializers[name](self.info)
new_ptr = self._caches[name].data_ptr()
regenerated += 1
if regenerated <= 2: # Only log first 2
ad_logger.info(
f"[CachedSequenceInterface] Regenerated {name}: "
f"old_ptr=0x{old_ptr:x}, new_ptr=0x{new_ptr:x}, "
f"shape={self._caches[name].shape}"
)

ad_logger.info(f"[CachedSequenceInterface] Regenerated {regenerated} cache views")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Update cache-view regeneration for combined kv_cache naming

The regeneration logic only looks for "k_cache"/"v_cache". With the new "kv_cache_*" resources, views won’t be refreshed after a PTCacheBackend pool reallocation, leaving stale pointers.

✅ Suggested fix
-        # Only regenerate k_cache and v_cache (KV caches that are views)
+        # Regenerate combined kv_cache (and legacy k/v cache views) after pool reallocation
         for name in list(self._caches.keys()):
-            if "k_cache" in name or "v_cache" in name:
+            if "kv_cache" in name or "k_cache" in name or "v_cache" in name:
🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/shim/interface.py` around lines 785 - 802,
The cache-refresh loop in CachedSequenceInterface only checks for "k_cache" or
"v_cache" and therefore misses new combined keys like "kv_cache_*"; update the
condition in the for-loop that iterates self._caches.keys() to also detect
combined kv names (e.g., check for "kv_cache" or name.startswith("kv_cache_") or
a substring match for "kv_cache") so that entries created from
self._cache_initializers are re-invoked and replaced (preserving the existing
logic that calls self._cache_initializers[name](self.info), compares data_ptrs,
increments regenerated, and logs via ad_logger.info).

Comment on lines +244 to +248
@pytest.mark.skipif(not HAS_PT_CACHE_BACKEND, reason="PTCacheBackend not available")
@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")
@pytest.mark.skipif(
torch.cuda.get_device_capability()[0] < 8, reason="CUDA graphs require SM 8.0+"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and inspect the test file
fd "test_trtllm_attention_cuda_graph.py" -type f

Repository: NVIDIA/TensorRT-LLM

Length of output: 233


🏁 Script executed:

# Check if the file exists and view the relevant lines
if [ -f "tests/test_trtllm_attention_cuda_graph.py" ]; then
  echo "=== File found, checking lines 240-250 ==="
  sed -n '240,250p' tests/test_trtllm_attention_cuda_graph.py
else
  echo "File not found at tests/test_trtllm_attention_cuda_graph.py"
  # Try to find it
  find . -name "*test_trtllm_attention_cuda_graph*" -type f
fi

Repository: NVIDIA/TensorRT-LLM

Length of output: 672


🏁 Script executed:

# Search for similar CUDA capability check patterns in the codebase
rg "get_device_capability" -B 3 -A 1 --type py | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 4483


🌐 Web query:

torch.cuda.get_device_capability() CPU-only builds behavior exception

💡 Result:

In CPU-only PyTorch builds (i.e., PyTorch not built with CUDA support), calling torch.cuda.get_device_capability() will typically raise an error like:

  • AssertionError: Torch not compiled with CUDA enabled

because torch.cuda is lazily initialized and any call that needs CUDA triggers initialization, which fails when CUDA support isn’t compiled in. The docs explicitly note the lazy-init model and that you should use torch.cuda.is_available() to check usability. [1] The get_device_capability API itself is only meaningful when a CUDA device exists. [2]

Robust pattern (works on CPU-only + CUDA builds):

import torch

if torch.backends.cuda.is_built() and torch.cuda.is_available():
    cc = torch.cuda.get_device_capability()
else:
    cc = None  # CPU-only build, or no working CUDA device/driver

torch.backends.cuda.is_built() tells you whether the installed PyTorch binary was built with CUDA support. [3]

If you need “some capability” on CPU-only machines: there is no CUDA compute capability to return—treat it as None (or raise your own clear error) rather than calling torch.cuda.get_device_capability().

Sources: PyTorch CUDA lazy initialization and is_available() guidance [1], torch.cuda.get_device_capability docs [2], torch.backends.cuda.is_built() docs [3], example of the CPU-only build assertion coming from CUDA lazy init [4].

References
[1] https://docs.pytorch.org/docs/stable/cuda.html
[2] https://docs.pytorch.org/docs/2.9/generated/torch.cuda.get_device_capability.html
[3] https://docs.pytorch.org/docs/stable/backends.html
[4] https://bugs.launchpad.net/bugs/2051023


Guard CUDA capability check to prevent import-time failures on CPU-only builds

In @pytest.mark.skipif decorators, each condition expression is evaluated independently at import time. Even though line 245 checks torch.cuda.is_available(), the subsequent decorator on lines 246–248 still evaluates torch.cuda.get_device_capability(), which raises AssertionError on CPU-only PyTorch builds.

Combine the conditions into a single decorator with short-circuit evaluation to prevent the call when CUDA is unavailable:

Suggested fix
 `@pytest.mark.skipif`(not HAS_PT_CACHE_BACKEND, reason="PTCacheBackend not available")
-@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")
-@pytest.mark.skipif(
-    torch.cuda.get_device_capability()[0] < 8, reason="CUDA graphs require SM 8.0+"
+@pytest.mark.skipif(
+    (not torch.cuda.is_available())
+    or (torch.cuda.get_device_capability()[0] < 8),
+    reason="CUDA not available or SM 8.0+ required for CUDA graphs",
 )
🤖 Prompt for AI Agents
In `@tests/test_trtllm_attention_cuda_graph.py` around lines 244 - 248, The three
separate pytest skipif decorators cause torch.cuda.get_device_capability() to be
called at import time even on CPU-only builds; update the decorators so the CUDA
availability and device capability checks are combined into a single skipif
using short-circuit logic (e.g. combine torch.cuda.is_available() and
torch.cuda.get_device_capability()[0] < 8 into one condition) while keeping the
HAS_PT_CACHE_BACKEND check as its own decorator; ensure the combined decorator
uses a clear reason like "CUDA graphs require SM 8.0+ or CUDA not available" so
get_device_capability() is only invoked when CUDA is available.

MrGeva and others added 5 commits February 4, 2026 01:19
Improve performance of thop.attention CUDA graph support by:
- Remove torch.cuda.synchronize() that was blocking CPU
- Replace Python loop with .item() calls with vectorized GPU operations
  using torch.searchsorted and advanced indexing for block offsets
- Compute block offsets once on first layer, copy to remaining 31 layers
- Move GPU tensor creation outside the layer loop
- Use non_blocking=True for device copies

This improves throughput from ~1500 TPS to ~5800 TPS with CUDA graphs.

Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…utation

Add pre-allocated GPU buffers to TrtllmAttentionGlobalState for
vectorized block offset computation, matching PTCacheBackend's pattern:
- _gpu_cu_pages, _gpu_page_positions, _gpu_seq_idx, _gpu_page_idx,
  _gpu_base_offset buffers allocated once and reused
- Use torch.searchsorted/sub/mul with out= parameter to avoid
  per-call tensor allocations

This eliminates allocation overhead in the hot path.

Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Since KVCacheManager uses a single unified pool for all layers, most
metadata tensors are identical across layers. This change shares them:

- Add shared tensors to TrtllmAttentionGlobalState (sequence_length,
  context_lengths, kv_cache_block_offsets, host tensors)
- TrtllmLayerState now references shared tensors via init_from_shared()
- Only host_kv_cache_pool_mapping remains per-layer (layer offsets)
- host_prepare_fn updates shared tensors ONCE instead of 32x

This eliminates 32x redundant tensor updates per forward pass,
improving throughput from ~5840 to ~6233 TPS (closer to PTCacheBackend).

Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Optimize thop.attention metadata preparation to outperform PTCacheBackend:

- Add FAST PATH in _prepare_trtllm_metadata: after host_prepare_fn runs,
  each layer's call just returns pre-computed tensors (almost zero work)
- Track host_prepare_called flag to enable fast path during replay
- Cache current_num_seq to avoid parsing batch_info during fast path
- Move pool pointer initialization to be done once in host_prepare_fn

Performance results:
- Optimized non-PTCacheBackend: ~6600 TPS
- PTCacheBackend: ~6528 TPS
- Improvement: ~1.1% faster than PTCacheBackend

The key insight is that during CUDA graph replay, host_prepare_fn runs
once per forward pass and fills all shared tensors. The 32 per-layer
_prepare_trtllm_metadata calls should do almost nothing - just return
pre-computed tensor slices.

Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Optimize host_prepare_fn to reduce tensor allocation overhead:

- Add pre-allocated pinned CPU buffers for intermediate computations:
  - _cpu_input_seq_lens, _cpu_seq_len_with_cache, _cpu_past_kv_lens
  - _cpu_cu_num_pages, _cpu_pages_per_seq
- Use torch.sub/copy with out= parameters to avoid tensor allocation
- Replace .item() with int() for faster scalar extraction
- Only zero the slice of block_offsets we need ([:, :num_seq, :, :])

Performance results:
- Optimized non-PTCacheBackend: ~6645 TPS
- PTCacheBackend: ~6527 TPS
- Improvement: ~1.8% faster than PTCacheBackend

Note: The remaining ~6.5ms in ad_prepare_inputs is dominated by
framework code in ad_executor.py (Python list operations for request
processing), which is common to both backends.

Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Clean up code for review by removing the PTCacheBackend alternative:

- Remove PTCacheBackend imports and _HAS_PT_CACHE_BACKEND flag
- Remove use_pt_cache_backend config option and related code
- Remove enable_pt_cache_backend/get_pt_cache_backend/is_pt_cache_backend_enabled
- Remove debug SDPA fallback code
- Remove debug logging statements
- Simplify TrtllmAttentionConfig class
- Clean up related code in kvcache.py and interface.py

The direct AD pool integration (KVCacheManager) is now the only code path,
which is optimized with:
- Pre-allocated CPU/GPU buffers
- Shared tensors across layers
- Vectorized GPU block offset computation
- Host prepare function for CUDA graph support

Performance: ~6650 TPS (1.8% faster than PTCacheBackend baseline)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@MrGeva MrGeva changed the title AutoDeploy trtllm attention backend with trtllm's kv cache manager direct operation Draft: DO NOT REVIEW : AutoDeploy trtllm attention backend with trtllm's kv cache manager direct operation Feb 5, 2026
@MrGeva MrGeva closed this Feb 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants