Draft: DO NOT REVIEW : AutoDeploy trtllm attention backend with trtllm's kv cache manager direct operation by MrGeva · Pull Request #11268 · NVIDIA/TensorRT-LLM

MrGeva · 2026-02-04T08:42:54Z

Summary by CodeRabbit

New Features
- Added TRT-LLM attention backend integration with advanced cache management and CUDA graph support.
- Introduced KV cache pool support and multiple cache backend options (Simple and PT-based).
- Added comprehensive KV Cache Architecture documentation.
- Enabled Mermaid diagram support in documentation.
Bug Fixes & Improvements
- Simplified cache resource handlers for better maintainability.
- Consolidated attention transform base classes.
- Updated AutoDeploy documentation structure and paths.
Documentation
- Reorganized AutoDeploy documentation; removed obsolete guides.
- Added detailed KV cache architecture documentation.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

The key fix is to use AD's pool_mapping values directly without multiplying by kv_factor. AD's pool_mapping already provides the correct layer offsets (0, 1, 2, ...) because each layer takes exactly one "block" worth of K+V data in the unified pool, regardless of dtype. Previously, the code was multiplying layer_idx by kv_factor=2, causing the kernel to compute incorrect addresses: - Expected layer 1 at: pool_ptr + 1 * block_size - Got layer 1 at: pool_ptr + 2 * block_size (wrong!) This fix enables accurate thop.attention execution in AutoDeploy using AD's KVCacheManager pool directly, without needing the PTCacheBackend or intermediate buffers. Note: CUDA graph support requires use_pt_cache_backend=true due to host operations in metadata preparation. Signed-off-by: Eli Geva <egeva@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…cheBackend) Enable CUDA graph support for the thop.attention kernel when use_pt_cache_backend=False. This allows the torch-cudagraph compile backend to work correctly with thop.attention. Key changes: - Pre-allocate kv_cache_block_offsets with max size in TrtllmLayerState to ensure stable tensor addresses for CUDA graphs - Add is_capturing check in _prepare_trtllm_metadata to set host tensors to MAX values and skip device operations during capture - Add create_host_prepare_function() to TrtllmAttentionGlobalState that creates a host_prepare_fn running outside the graph to update tensors with current batch values before each forward/replay - Register host_prepare_fn via get_host_prepare_metadata_function() for non-PTCacheBackend mode Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai · 2026-02-04T08:56:57Z

📝 Walkthrough

Walkthrough

This PR restructures the AutoDeploy KV cache management system by introducing typed resource handlers (KVPagedResourceHandler, SSMResourceHandler, CausalConvResourceHandler), removing legacy Triton-based paged KV cache kernels, introducing cache backend abstractions, adding a TRT-LLM attention backend integration, and refactoring cache indexing across attention operators while reorganizing documentation.

Changes

Cohort / File(s)	Summary
Documentation restructuring `README.md`, `examples/auto_deploy/README.md`, `docs/requirements.txt`, `docs/source/conf.py`, `docs/source/features/auto_deploy/...`, `docs/source/torch/auto_deploy/...`	Enabled Mermaid diagram support in Sphinx, migrated AutoDeploy docs from torch path to features path, added KV Cache Architecture documentation, removed legacy documentation pages (benchmarking, example_run, expert_configurations, logging, serving, workflow).
Resource handler abstraction `tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/...`	Introduced typed resource handlers (KVPagedResourceHandler for paged KV caches, SSMResourceHandler and CausalConvResourceHandler for state resources) with compatibility checking; added CacheConfig for cache dtype management; extended SequenceInfo with KV cache pool integration.
Cache backend abstraction `tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py`	Introduced CacheBackend interface with SimpleCacheBackend implementation for basic per-layer caching; added comprehensive PTCacheBackend for C++-backed KVCacheManager integration with pool management, metadata tensors, and synchronization utilities for interleaved/contiguous cache layouts.
TRT-LLM attention backend `tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py`, `tensorrt_llm/_torch/auto_deploy/llm_args.py`	Introduced TrtllmAttention backend registered with AttentionRegistry; added TrtllmWorkspaceResourceHandler, TrtllmKVResourceHandler, and global state management; refactored LlmArgs and introduced AutoDeployConfig for streamlined configuration.
Attention operator refactoring `tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py`	Unified KV cache representation from separate k_cache/v_cache to single kv_cache with HND layout; renamed cache_loc parameter to slot_idx; updated metadata handling and cache initializers to use new resource handler types.
Low-level Triton kernel changes `tensorrt_llm/_torch/auto_deploy/custom_ops/triton_kernels/attention_with_kv_cache.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py`	Renamed cache_loc_ptr to slot_idx_ptr; added automatic attention scaling computation; removed entirely: `triton_attention_internal.py` and `attention_with_paged_kv_cache.py` (paged KV cache Triton kernels).
Cache management integration `tensorrt_llm/_torch/auto_deploy/shim/interface.py`, `tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py`, `tensorrt_llm/_torch/auto_deploy/transform/library/...`	Enhanced CachedSequenceInterface with multi-layer state management, KV/state resource separation, hybrid cache manager creation, and PTCacheBackend integration; refactored transform base classes to unified _InsertCachedOperator; added cache_config and use_pt_cache_backend configuration.
Test infrastructure `test_trtllm_attention.py`, `tests/test_trtllm_attention_cuda_graph.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_`, `tests/integration/defs/accuracy/test_llm_api_autodeploy.py`, `tests/integration/test_lists/test-db/.yml`	Added TRT-LLM attention tests; updated flashinfer/resource handler/cached_sequence tests to use new KVPagedResourceHandler and slot_idx semantics; added parametrization for attention backend selection; removed comprehensive paged KV cache test suite; updated integration test configurations.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

[#7308] [feat] AutoDeploy: graph-less transformers mode for HF #7635: Modifies AutoDeploy transforms and attention integration (attention_interface, kvcache transforms/transformers-mode plumbing) to support transformers-mode cached-attention flow, overlapping with cache management refactoring.

Suggested reviewers

kaiyux
Shixiaowei02

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.37% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	PR description is completely empty/missing; only contains the template with no actual explanation of what was changed or why.	Add a description explaining the changes, the motivation, test coverage, and confirm the PR checklist items.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title is specific and describes the main architectural change: adding TRT-LLM attention backend with KV cache manager integration.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py (2)
1-9: ⚠️ Potential issue | 🟠 Major

Add NVIDIA copyright header for this source file

Source files require an NVIDIA header with the year of the latest meaningful modification.
📌 Suggested header
+ # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+
 from dataclasses import dataclass, fields
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.

462-469: ⚠️ Potential issue | 🟡 Minor

Silence the unused kv_cache argument in the fake op

Ruff reports kv_cache as unused in the fake implementation. Renaming it avoids lint noise without changing behavior.
🧹 Suggested fix
-    kv_cache: torch.Tensor,
+    _kv_cache: torch.Tensor,
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
1-8: ⚠️ Potential issue | 🟠 Major

Add NVIDIA copyright header for this source file

Source files require an NVIDIA header with the year of the latest meaningful modification.
📌 Suggested header
+ # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+
 from importlib.resources import files
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
1-8: ⚠️ Potential issue | 🟠 Major

Add NVIDIA copyright header for this source file

Source files require an NVIDIA header with the year of the latest meaningful modification.
📌 Suggested header
+ # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+
 import copy
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.

🤖 Fix all issues with AI agents

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`:
- Around line 287-319: The field validator CacheConfig._coerce_dtype should not
use assert for validating string dtype names; instead, when value is a str look
up the attribute on torch (as currently done), and if the lookup yields None or
a non-torch.dtype raise a clear exception (e.g., ValueError) with the invalid
input included; update the validator in CacheConfig (decorated with
`@field_validator`("dtype", "mamba_dtype", "delta_dtype", mode="before")) to
replace the assert with an explicit raise that reports the offending value and
expected torch.dtype so invalid dtype strings are rejected reliably in all
runtime modes.
- Around line 434-436: SequenceInfo currently leaves self._num_blocks as None
(set in __init__), causing an assertion in SequenceInfo.num_blocks when code
paths (e.g., PTCacheBackend -> initialize()) access it before
estimate_cache_loc_capacity() runs; fix by initializing self._num_blocks in
__init__ to a sensible default (for example use math.ceil(max_num_tokens /
tokens_per_block) or another conservative estimate derived from constructor args
like max_num_tokens and tokens_per_block) so num_blocks is valid immediately and
later refined inside estimate_cache_loc_capacity(); update __init__ where
self._num_blocks is set and leave estimate_cache_loc_capacity() to overwrite
with the accurate value.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py`:
- Around line 1-2: Update the copyright header in
tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py by changing the year
from 2025 to 2026 in the SPDX header lines (the two top-of-file lines starting
with "# SPDX-FileCopyrightText" and "# SPDX-License-Identifier") so the header
reflects the latest modification year.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py`:
- Around line 321-333: The page_size is being taken as kv_cache.shape[3] which
assumes HND layout; update the logic that computes page_size in the function
(where kv_cache, kv_layout and page_size are used) to derive tokens_per_block
based on kv_layout: if kv_layout indicates HND use shape[3], if it indicates NHD
use shape[1]; replace the hardcoded index with this conditional access so plan
creation and downstream uses of page_size are correct for both layouts.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py`:
- Around line 417-472: get_contiguous_caches currently assumes a single shared
contiguous buffer and raises if per-layer kv head counts differ; update it to
handle per-layer kv head variance by either allocating per-layer contiguous
buffers or moving the uniformity check to initialization with a clear validation
error. Concretely: in PTCacheBackend.get_contiguous_caches, when
self._shared_contiguous_k_cache/_shared_contiguous_v_cache would be created,
branch on whether max(self._config.num_kv_heads_per_layer) ==
self._config.num_kv_heads_per_layer[layer_idx]; if not, allocate per-layer
buffers (e.g., store dict/list keyed by layer_idx instead of the single
_shared_contiguous_k_cache/_shared_contiguous_v_cache) using pool shape from
self._kv_cache_manager.get_primary_pool_data(layer_idx) and the layer-specific
num_kv_heads, or alternatively add validation in the initializer (check
self._config.num_kv_heads_per_layer uniformity while self._initialized is set)
and raise a clear RuntimeError there; ensure logging (ad_logger.info) reflects
per-layer allocation and keep existing use of self._device and dtype.
- Around line 1-2: Update the SPDX header year from 2025 to 2026 at the top of
the file; specifically edit the copyright header lines in
tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py (the
SPDX-FileCopyrightText and/or SPDX-License-Identifier header block) to reflect
2026 as the latest meaningful modification year.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py`:
- Around line 475-626: The function _prepare_trtllm_metadata currently raises if
ad_pool_pointers/ad_pool_mapping are missing which breaks the KV cache handler's
fallback allocation path; change the logic so when use_ad_pool is False you do
not raise but instead initialize predictable defaults for pool pointers/mapping
(e.g., zeros or sentinel values) and proceed to compute kv_cache_block_offsets
from cache_loc/pages_per_seq for the fallback cache layout (ensure
state.host_kv_cache_pool_pointers and state.host_kv_cache_pool_mapping are set
to valid defaults and any logging reflects fallback use), or alternatively move
the RuntimeError to an earlier allocation phase so missing AD pool pointers fail
fast during allocation rather than here (update checks around use_ad_pool,
state.host_kv_cache_pool_pointers, state.host_kv_cache_pool_mapping, and the
block-offset filling loop accordingly).
- Around line 222-305: TrtllmLayerState currently hardcodes device="cuda" in
__post_init__, causing allocations on the wrong GPU; add a device field to the
TrtllmLayerState dataclass (e.g., device: torch.device) and use that field
instead of the string "cuda" when allocating device tensors in __post_init__,
keeping host/pinned tensors on cpu as before; update callers (e.g.,
get_or_create_layer_state) to pass the correct device (kv_cache.device or
SequenceInfo.device) when constructing TrtllmLayerState so allocations follow
the model/kv cache device.
- Around line 1-2: Update the copyright header in trtllm_attention.py to show
the latest modification year 2026 by changing the existing copyright line(s)
that currently show 2025 to 2026; specifically edit the top-of-file SPDX header
lines in trtllm_attention.py (the lines beginning with "#
SPDX-FileCopyrightText" and/or "# SPDX-License-Identifier") so they reference
2026.

In `@tensorrt_llm/_torch/auto_deploy/llm_args.py`:
- Around line 458-467: The validator ensure_max_seq_len currently declares an
unused parameter named info which triggers the ARG lint rule; rename that
parameter to _info in the method signature of ensure_max_seq_len (the
`@field_validator`("max_seq_len", mode="before") classmethod) so it becomes
unused-by-convention and the linter warning is silenced, leaving the body
unchanged and preserving the return behavior that falls back to
AutoDeployConfig.model_fields["max_seq_len"].get_default(call_default_factory=True)
when value is None.

In `@tensorrt_llm/_torch/auto_deploy/shim/interface.py`:
- Around line 785-802: The cache-refresh loop in CachedSequenceInterface only
checks for "k_cache" or "v_cache" and therefore misses new combined keys like
"kv_cache_*"; update the condition in the for-loop that iterates
self._caches.keys() to also detect combined kv names (e.g., check for "kv_cache"
or name.startswith("kv_cache_") or a substring match for "kv_cache") so that
entries created from self._cache_initializers are re-invoked and replaced
(preserving the existing logic that calls
self._cache_initializers[name](self.info), compares data_ptrs, increments
regenerated, and logs via ad_logger.info).

In `@tests/test_trtllm_attention_cuda_graph.py`:
- Around line 244-248: The three separate pytest skipif decorators cause
torch.cuda.get_device_capability() to be called at import time even on CPU-only
builds; update the decorators so the CUDA availability and device capability
checks are combined into a single skipif using short-circuit logic (e.g. combine
torch.cuda.is_available() and torch.cuda.get_device_capability()[0] < 8 into one
condition) while keeping the HAS_PT_CACHE_BACKEND check as its own decorator;
ensure the combined decorator uses a clear reason like "CUDA graphs require SM
8.0+ or CUDA not available" so get_device_capability() is only invoked when CUDA
is available.

🧹 Nitpick comments (23)

tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)
32-44: Consider annotating mutable class attribute with ClassVar.

The ATTN_BACKEND_CONFIGS dictionary is a mutable class attribute. While this works correctly, Python best practices and type checkers recommend annotating it with typing.ClassVar to explicitly indicate it's a class-level attribute not meant to be instance-specific.
💡 Suggested fix
+from typing import ClassVar, Dict, Any
+
 class TestLlama3_1_8B(LlmapiAccuracyTestHarness):
     MODEL_NAME = "meta-llama/Llama-3.1-8B"
     MODEL_PATH = hf_id_to_local_model_dir(MODEL_NAME)

     # Configuration presets for different attention backends
-    ATTN_BACKEND_CONFIGS = {
+    ATTN_BACKEND_CONFIGS: ClassVar[Dict[str, Dict[str, Any]]] = {
         "flashinfer": {
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_kernels/attention_with_kv_cache.py (1)

1-1: Consider adding NVIDIA copyright header.

This source file appears to be missing the standard NVIDIA copyright header that the coding guidelines require for all TensorRT-LLM source files (.py, .cpp, .cu, etc.).
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py (1)
23-52: Unused function parameters k and v.

The k and v parameters are passed to this function but never used - the reference computation extracts values directly from kv_cache. This appears intentional since the custom op appends k/v to the cache before this reference function is called, but the unused parameters could be removed for clarity.
💡 Suggested fix
 def _attention_with_fp8_kv_cache(
-    q, k, v, kv_cache, k_scale, v_scale, prefill_seq_len, causal, mask
+    q, kv_cache, k_scale, v_scale, prefill_seq_len, causal, mask
 ):
     """Simulates attention for fp8 kv cache with q,k,v outputs of GEMM in fp16"""
-    batch_size, seq_len, _ = k.shape
+    batch_size, seq_len, _ = q.shape
Note: This would also require updating the caller at line 786-787.
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
12-17: Use module-qualified imports per style guide.

The new from typing import ... and from pydantic import ... imports drop the module namespace. Please switch to module-qualified imports to align with the repo guideline (e.g., import typing, import pydantic and reference typing.Dict, pydantic.BaseModel, etc.).
As per coding guidelines: Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.

539-546: Initialize KV cache pool members in __init__, not at class scope.

_kv_cache_pool_pointers / _kv_cache_pool_mapping are class attributes; this risks shared state across instances and violates the constructor-initialization guideline. Set them to None in __init__ and keep only type hints at class scope.
Suggested fix
 class SequenceInfo:
     def __init__(...):
         ...
+        self._kv_cache_pool_pointers: Optional[torch.Tensor] = None
+        self._kv_cache_pool_mapping: Optional[torch.Tensor] = None
-    _kv_cache_pool_pointers: Optional[torch.Tensor] = None
-    _kv_cache_pool_mapping: Optional[torch.Tensor] = None
As per coding guidelines: Initialize all externally visible members of a Python class in the constructor.
Also applies to: 1060-1089
tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py (2)
29-31: Use module-qualified imports per style guide.

Please switch to module-qualified imports (e.g., import abc, import typing) instead of from ... import ....
As per coding guidelines: Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.

151-240: Rename unused layer_idx to avoid lint noise.

_allocate_layer_cache doesn’t use layer_idx. Consider renaming it to _layer_idx (or removing it) to silence the lint warning.
Suggested fix
-    def _allocate_layer_cache(self, layer_idx: int) -> Dict[str, torch.Tensor]:
+    def _allocate_layer_cache(self, _layer_idx: int) -> Dict[str, torch.Tensor]:
tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py (1)

42-44: Use module-qualified imports per style guide.

Please switch to module-qualified imports (e.g., import dataclasses, import typing) instead of from ... import ....
As per coding guidelines: Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.

tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py (2)

46-47: Use module-qualified imports per style guide.

Please switch to module-qualified imports (e.g., import dataclasses, import typing) instead of from ... import ....
As per coding guidelines: Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.

471-472: Rename module globals to follow the G_ prefix rule.

_global_state and _trtllm_config are module-level globals and should use the G_ prefix (e.g., G_TRTLLM_GLOBAL_STATE, G_TRTLLM_CONFIG).
As per coding guidelines: Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL).

Also applies to: 1231-1232
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2)
329-331: Consider renaming cache_loc to slot_idx in fused_mla_ref for consistency.

The fused_mla_ref function (lines 256-386) still uses cache_loc as a parameter name (line 264) and passes it to update_kv_cache. While functionally correct, this inconsistency could cause confusion since update_kv_cache now expects slot_idx. The same applies to the fake registration at lines 397-398 and usages at lines 337-338.

1-9: Missing NVIDIA copyright header.

This file should contain the NVIDIA copyright header as required by the coding guidelines for all source files.
Proposed fix
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """Torch reference implementations for attention."""
As per coding guidelines: All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification.
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_update_kv_cache.py (1)

1-5: Missing NVIDIA copyright header.

This test file should contain the NVIDIA copyright header as required by the coding guidelines.

tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/triton_kernels/test_attention_with_kv_cache.py (1)

1-6: Missing NVIDIA copyright header.

This test file should contain the NVIDIA copyright header as required by the coding guidelines.

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py (1)

1-11: Missing NVIDIA copyright header.

This file should contain the NVIDIA copyright header as required by the coding guidelines.

tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py (1)

1-10: Missing NVIDIA copyright header.

This file should contain the NVIDIA copyright header as required by the coding guidelines.
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py (1)
1-1: Update copyright year to include 2026.

The copyright header shows years 2022-2025, but this file has been meaningfully modified. Per coding guidelines, the year should be updated to reflect the latest meaningful modification.
Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (2)
10-11: Comment on line 10 is misleading.

The comment "Initialize resources first" doesn't accurately describe what this import does. The import simply makes KVPagedResourceHandler available for use in tests below - it doesn't initialize any resources.
📝 Suggested comment fix
-# Initialize resources first (KVPagedResourceHandler is used within tests below)
+# Import KVPagedResourceHandler for paged KV cache resource tests
 from tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface import KVPagedResourceHandler
295-297: Redundant import - KVPagedResourceHandler already imported at module level.

This import is unnecessary since KVPagedResourceHandler is already imported at line 11.
♻️ Remove redundant import
     # Add a resource to verify initialize_resources is called
-    from tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface import (
-        KVPagedResourceHandler,
-    )
-
     dummy_cached_interface.add_resource(
test_trtllm_attention.py (2)
1-6: Consider moving test file to the tests directory.

This standalone test script is at the repository root. For consistency with the project structure, consider moving it to tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_trtllm_attention.py or a similar location.

155-165: Prefix unused variable with underscore.

The unpacked variable host_kv_cache_pool_mapping is never used. Prefix it with an underscore to indicate it's intentionally ignored.
📝 Fix unused variable
         (
             sequence_length,
             host_past_key_value_lengths,
             host_total_kv_lens,
             context_lengths,
             host_context_lengths,
             host_request_types,
             kv_cache_block_offsets,
             host_kv_cache_pool_pointers,
-            host_kv_cache_pool_mapping,
+            _host_kv_cache_pool_mapping,
         ) = result
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py (1)
120-140: Unused variable num_prefill_tokens should be prefixed or removed.

Line 137 unpacks num_prefill_tokens from q.shape but it's never used in the function. The function uses len(input_pos) for num_prefill instead.
📝 Fix unused variable
 def _prefill_attention(
     ...
 ) -> None:
     """Handle prefill phase - context attention with variable sequence lengths."""
     # NOTE: num_prefill_tokens == sum(seq_len)
-    num_prefill_tokens, n_heads, q_d_head = q.shape
+    _, n_heads, q_d_head = q.shape
     max_cache_seq_len, n_kv_heads = k_cache.shape[1:3]
tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
529-571: Replace debug print statements with logger calls

Raw prints in library code will spam stdout and are hard to control in CI and CUDA-graph capture. Prefer ad_logger.debug(...) or remove the statements entirely.
♻️ Suggested refactor (apply to the whole block)
-        print("[DEBUG CachedSequenceInterface._init_kv_cache_manager]")
-        print(
-            f"  hasattr kv_cache_pool_pointers: {hasattr(self._kv_cache_manager, 'kv_cache_pool_pointers')}"
-        )
+        ad_logger.debug("[CachedSequenceInterface] init_kv_cache_manager")
+        ad_logger.debug(
+            "  hasattr kv_cache_pool_pointers: %s",
+            hasattr(self._kv_cache_manager, "kv_cache_pool_pointers"),
+        )
         if hasattr(self._kv_cache_manager, "kv_cache_pool_pointers"):
             pool_ptrs = self._kv_cache_manager.kv_cache_pool_pointers
             pool_map = self._kv_cache_manager.kv_cache_pool_mapping
-            print(f"  kv_cache_pool_pointers: {pool_ptrs}")
-            print(
-                f"  kv_cache_pool_mapping.shape: {pool_map.shape if pool_map is not None else None}"
-            )
+            ad_logger.debug("  kv_cache_pool_pointers: %s", pool_ptrs)
+            ad_logger.debug(
+                "  kv_cache_pool_mapping.shape: %s",
+                pool_map.shape if pool_map is not None else None,
+            )

             self.info.set_kv_cache_pool_info(pool_ptrs, pool_map)
-            print("  Set pool info on SequenceInfo")
-            print(f"  self.info.kv_cache_pool_pointers: {self.info.kv_cache_pool_pointers}")
+            ad_logger.debug("  Set pool info on SequenceInfo")
+            ad_logger.debug(
+                "  self.info.kv_cache_pool_pointers: %s",
+                self.info.kv_cache_pool_pointers,
+            )

             try:
                 from ..custom_ops.trtllm_attention import _trtllm_config

-                print(f"  _trtllm_config.is_configured: {_trtllm_config.is_configured}")
+                ad_logger.debug(
+                    "  _trtllm_config.is_configured: %s", _trtllm_config.is_configured
+                )
                 if not _trtllm_config.is_configured:
                     _trtllm_config.configure(self.info)
-                    print("  Configured _trtllm_config with SequenceInfo")
-                    print(f"  _trtllm_config._sequence_info: {_trtllm_config._sequence_info}")
+                    ad_logger.debug("  Configured _trtllm_config with SequenceInfo")
+                    ad_logger.debug(
+                        "  _trtllm_config._sequence_info: %s",
+                        _trtllm_config._sequence_info,
+                    )

                 if _trtllm_config._num_layers == 0 and kv_ref is not None:
                     num_kv_heads_list = [h.num_kv_heads for h in kv_managed.values()]
                     _trtllm_config.set_model_config(
                         num_layers=len(kv_managed),
                         num_kv_heads_per_layer=num_kv_heads_list,
                         head_dim=kv_ref.head_dim,
                         dtype=kv_ref.dtype,
                     )
-                    print(
-                        f"  Set model config: num_layers={len(kv_managed)}, "
-                        f"dtype={kv_ref.dtype}, quant_mode={_trtllm_config._quant_mode}"
-                    )
+                    ad_logger.debug(
+                        "  Set model config: num_layers=%s, dtype=%s, quant_mode=%s",
+                        len(kv_managed),
+                        kv_ref.dtype,
+                        _trtllm_config._quant_mode,
+                    )
             except ImportError:
-                print("  TRT-LLM attention import failed")
+                ad_logger.debug("  TRT-LLM attention import failed")
                 pass

coderabbitai · 2026-02-04T08:57:00Z

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

+class CacheConfig(BaseModel):
+    """Cache configuration for attention-related dtypes."""
+
+    model_config = ConfigDict(
+        arbitrary_types_allowed=True,
+        extra="forbid",
+    )
+
+    dtype: Optional[torch.dtype] = Field(default=None, description="KV cache dtype.")
+    mamba_dtype: Optional[torch.dtype] = Field(default=None, description="Mamba cache dtype.")
+    delta_dtype: Optional[torch.dtype] = Field(
+        default=torch.float32, description="Delta cache dtype. Defaults to float32."
+    )
+
+    @field_validator("dtype", "mamba_dtype", "delta_dtype", mode="before")
+    @classmethod
+    def _coerce_dtype(cls, value):
+        if value is None or isinstance(value, torch.dtype):
+            return value
+        if isinstance(value, str):
+            dtype = getattr(torch, value, None)
+            assert isinstance(dtype, torch.dtype), f"Invalid {dtype=}"
+            return dtype
+        return value
+
+    def __or__(self, other: "CacheConfig") -> "CacheConfig":
+        """Combine two CacheConfig objects field-wise using Python's `or` semantics."""
+        if not isinstance(other, CacheConfig):
+            raise NotImplementedError(f"Cannot combine CacheConfig with {type(other)}")
+        merged_kwargs = {}
+        for field_name in type(self).model_fields.keys():
+            merged_kwargs[field_name] = getattr(self, field_name) or getattr(other, field_name)
+        return CacheConfig(**merged_kwargs)


⚠️ Potential issue | 🟡 Minor

Avoid assert for user-facing dtype validation.

_coerce_dtype uses assert, which can be stripped in optimized runs and let invalid values through. Prefer an explicit exception.

Suggested fix

- if isinstance(value, str): - dtype = getattr(torch, value, None) - assert isinstance(dtype, torch.dtype), f"Invalid {dtype=}" - return dtype + if isinstance(value, str): + dtype = getattr(torch, value, None) + if not isinstance(dtype, torch.dtype): + raise ValueError(f"Invalid dtype string: {value!r}") + return dtype

🤖 Prompt for AI Agents

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py` around lines 287 - 319, The field validator CacheConfig._coerce_dtype should not use assert for validating string dtype names; instead, when value is a str look up the attribute on torch (as currently done), and if the lookup yields None or a non-torch.dtype raise a clear exception (e.g., ValueError) with the invalid input included; update the validator in CacheConfig (decorated with `@field_validator`("dtype", "mamba_dtype", "delta_dtype", mode="before")) to replace the assert with an explicit raise that reports the offending value and expected torch.dtype so invalid dtype strings are rejected reliably in all runtime modes.

coderabbitai · 2026-02-04T08:57:00Z

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

+        # will store num_blocks later...
+        self._num_blocks = None
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, check if the file exists and get its basic structure file_path="tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py" if [ -f "$file_path" ]; then echo "=== File size ===" wc -l "$file_path" echo "" echo "=== Lines 430-440 (around line 434) ===" sed -n '430,440p' "$file_path" echo "" echo "=== Lines 610-640 (around lines 617-636) ===" sed -n '610,640p' "$file_path" else echo "File not found" fi

Repository: NVIDIA/TensorRT-LLM

Length of output: 2110

🏁 Script executed:

#!/bin/bash # Find SequenceInfo class definition and num_blocks property rg -n "class SequenceInfo" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -A 5 echo "---" rg -n "def num_blocks" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -B 2 -A 5 echo "---" rg -n "@property" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -A 3 | grep -A 3 "num_blocks"

Repository: NVIDIA/TensorRT-LLM

Length of output: 1047

🏁 Script executed:

#!/bin/bash # Find all usages of num_blocks and estimate_cache_loc_capacity echo "=== Usages of num_blocks ===" rg -n "\.num_blocks" -g'*.py' echo "" echo "=== Calls to estimate_cache_loc_capacity ===" rg -n "estimate_cache_loc_capacity" -g'*.py'

Repository: NVIDIA/TensorRT-LLM

Length of output: 9127

🏁 Script executed:

#!/bin/bash # Check the initialization sequence in interface.py around line 525 echo "=== Interface.py context (line 520-535) ===" sed -n '520,535p' tensorrt_llm/_torch/auto_deploy/shim/interface.py echo "" echo "=== Lines 1165-1180 in attention_interface.py ===" sed -n '1165,1180p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py echo "" echo "=== Lines 1215-1225 in trtllm_attention.py ===" sed -n '1215,1225p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2011

🏁 Script executed:

#!/bin/bash echo "=== cache_backend.py context (line 190-200) ===" sed -n '190,200p' tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py echo "" echo "=== pt_cache_backend.py context (line 220-240) ===" sed -n '220,240p' tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py echo "" echo "=== pt_cache_backend.py context (line 330-345) ===" sed -n '330,345p' tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2773

🏁 Script executed:

#!/bin/bash echo "=== Test case (test_resource_handlers.py lines 50-60) ===" sed -n '50,60p' tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_resource_handlers.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 481

🏁 Script executed:

#!/bin/bash echo "=== Looking for SequenceInfo instantiation patterns ===" rg -n "SequenceInfo(" -B 2 -A 5 tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py | head -40 echo "" echo "=== Check the full __init__ method of SequenceInfo ===" sed -n '322,450p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 7826

🏁 Script executed:

#!/bin/bash echo "=== Check trtllm_attention.py to see initialize call context ===" sed -n '1200,1230p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py echo "" echo "=== Check the full initialize method in pt_cache_backend ===" sed -n '200,230p' tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2668

🏁 Script executed:

#!/bin/bash echo "=== Checking the estimate_cache_tokens_per_forward method ===" sed -n '622,631p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py echo "" echo "=== Full context of lines 617-645 ===" sed -n '617,645p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2142

🏁 Script executed:

#!/bin/bash echo "=== Check where _setup_pt_cache_backend is called ===" rg -n "_setup_pt_cache_backend" tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py -B 3 -A 1 echo "" echo "=== Check the full trtllm_attention initialization flow ===" sed -n '1185,1225p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py echo "" echo "=== Check if there are any early accesses before estimate_cache_loc_capacity ===" rg -n "\.num_blocks" tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py -B 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 2848

🏁 Script executed:

#!/bin/bash echo "=== Search for where initialize() is called on cache backends ===" rg -n "\.initialize\(" tensorrt_llm/_torch/auto_deploy/custom_ops/ -B 3 -A 1

Repository: NVIDIA/TensorRT-LLM

Length of output: 1146

🏁 Script executed:

#!/bin/bash echo "=== Check CachedSequenceInterface to see full initialization flow ===" sed -n '450,550p' tensorrt_llm/_torch/auto_deploy/shim/interface.py echo "" echo "=== Look for where set_model_config is called ===" rg -n "set_model_config" tensorrt_llm/_torch/auto_deploy/ -B 2 -A 2 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 7632

🏁 Script executed:

#!/bin/bash echo "=== Check setup_pt_cache_backend context in trtllm_attention ===" sed -n '1150,1230p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 3421

🏁 Script executed:

#!/bin/bash echo "=== Check allocate() method that accesses num_blocks ===" sed -n '1160,1185p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py echo "" echo "=== Check cache_backend.py initialize() that accesses num_blocks ===" sed -n '188,210p' tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py echo "" echo "=== Check when cache backends' initialize is called in the flow ===" rg -n "_allocate_unmanaged_resources\|handler.allocate" tensorrt_llm/_torch/auto_deploy/shim/interface.py -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 2131

🏁 Script executed:

#!/bin/bash echo "=== Search for all SequenceInfo creations ===" rg -n "SequenceInfo(" --type py -B 1 -A 3 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 202

🏁 Script executed:

#!/bin/bash echo "=== Check _allocate_unmanaged_resources in detail ===" sed -n '475,515p' tensorrt_llm/_torch/auto_deploy/shim/interface.py echo "" echo "=== Check the order of calls in __init__ ===" rg -n "def __init__" tensorrt_llm/_torch/auto_deploy/shim/interface.py -A 50 | head -70

Repository: NVIDIA/TensorRT-LLM

Length of output: 5344

🏁 Script executed:

#!/bin/bash echo "=== Check KVPagedResourceHandler.allocate method ===" rg -n "class KVPagedResourceHandler" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -A 100 | grep -A 50 "def allocate"

Repository: NVIDIA/TensorRT-LLM

Length of output: 2409

🏁 Script executed:

#!/bin/bash echo "=== Full _create_kv_cache_manager method ===" sed -n '403,530p' tensorrt_llm/_torch/auto_deploy/shim/interface.py | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 4804

🏁 Script executed:

#!/bin/bash echo "=== Where is _allocate_unmanaged_resources called ===" rg -n "_allocate_unmanaged_resources" tensorrt_llm/_torch/auto_deploy/shim/interface.py -B 5 -A 2 echo "" echo "=== Where is _create_kv_cache_manager called ===" rg -n "_create_kv_cache_manager" tensorrt_llm/_torch/auto_deploy/shim/interface.py -B 3 -A 3

Repository: NVIDIA/TensorRT-LLM

Length of output: 2906

🏁 Script executed:

#!/bin/bash echo "=== Check if allocate can be called before estimate_cache_loc_capacity ===" rg -n "allocate\|estimate_cache_loc_capacity" tensorrt_llm/_torch/auto_deploy/shim/interface.py | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 137

🏁 Script executed:

#!/bin/bash echo "=== Check if TRT-LLM flow is independent ===" sed -n '1150,1225p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py | grep -n "num_blocks\|initialize"

Repository: NVIDIA/TensorRT-LLM

Length of output: 759

🏁 Script executed:

#!/bin/bash echo "=== Check the complete sequence in interface.py from line 640-660 ===" sed -n '640,660p' tensorrt_llm/_torch/auto_deploy/shim/interface.py echo "" echo "=== Look for where estimate_cache_loc_capacity is called ===" sed -n '520,530p' tensorrt_llm/_torch/auto_deploy/shim/interface.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1760

🏁 Script executed:

#!/bin/bash echo "=== Check pt_cache_backend.initialize at line 225 ===" sed -n '220,230p' tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py echo "" echo "=== Check if there's any early initialization ===" rg -n "\.initialize\(" tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py echo "" echo "=== Check the test to see the expected order ===" sed -n '50,75p' tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_resource_handlers.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1556

🏁 Script executed:

#!/bin/bash echo "=== Check where _setup_pt_cache_backend is called ===" rg -n "_setup_pt_cache_backend\|setup_pt_cache_backend" -g'*.py' echo "" echo "=== Check if there are independent calls ===" rg -n "\.pt_cache_backend" tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 714

🏁 Script executed:

#!/bin/bash echo "=== Check the property that raises the assertion ===" sed -n '617,621p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py echo "" echo "=== Check what happens during initialization without estimate_cache_loc_capacity ===" python3 << 'PYTHON_CODE' # Simulate the scenario class SequenceInfo: def __init__(self, max_seq_len, max_batch_size, tokens_per_block=None, max_num_tokens=None): self.max_seq_len = max_seq_len self.max_batch_size = max_batch_size self.tokens_per_block = tokens_per_block or max_seq_len self.max_num_tokens = max_num_tokens or (max_seq_len + 1) * max_batch_size self._num_blocks = None # Currently None `@property` def num_blocks(self): assert self._num_blocks is not None, "num_blocks not set yet" return self._num_blocks def estimate_cache_loc_capacity(self, num_blocks): self._num_blocks = num_blocks # Test: can we derive _num_blocks in __init__? import math max_num_tokens = 256 * 4 # 256 seq_len, 4 batch tokens_per_block = 256 derived_blocks = math.ceil(max_num_tokens / tokens_per_block) print(f"max_num_tokens={max_num_tokens}, tokens_per_block={tokens_per_block}") print(f"Derived _num_blocks would be: {derived_blocks}") print(f"This is a reasonable default until estimate_cache_loc_capacity() is called") PYTHON_CODE

Repository: NVIDIA/TensorRT-LLM

Length of output: 500

🏁 Script executed:

#!/bin/bash echo "=== Look for test cases that might hit the assertion ===" find tests -name "*.py" -type f -exec grep -l "num_blocks\|SequenceInfo" {} \; | head -10 | xargs -I {} sh -c 'echo "=== {} ===" && grep -n "num_blocks\|estimate_cache_loc_capacity" {}'

Repository: NVIDIA/TensorRT-LLM

Length of output: 3960

🏁 Script executed:

#!/bin/bash echo "=== Check for potential unguarded accesses ===" rg -n "sequence_info\.num_blocks|si\.num_blocks" tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py -B 2 -A 1 echo "" echo "=== Verify the actual initialization order in the TRT-LLM flow ===" sed -n '1150,1225p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 3908

Initialize _num_blocks with a computed default in __init__ instead of None.

SequenceInfo.num_blocks asserts _num_blocks is set, but it's only assigned in estimate_cache_loc_capacity(). Code paths that create PTCacheBackend and call initialize() directly (e.g., trtllm_attention.py line 1216) will hit this assertion before estimate_cache_loc_capacity() is called.

Initialize _num_blocks in __init__ using a reasonable estimate based on available parameters (e.g., ceil(max_num_tokens / tokens_per_block)), which will later be refined by estimate_cache_loc_capacity().

🤖 Prompt for AI Agents

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py` around lines 434 - 436, SequenceInfo currently leaves self._num_blocks as None (set in __init__), causing an assertion in SequenceInfo.num_blocks when code paths (e.g., PTCacheBackend -> initialize()) access it before estimate_cache_loc_capacity() runs; fix by initializing self._num_blocks in __init__ to a sensible default (for example use math.ceil(max_num_tokens / tokens_per_block) or another conservative estimate derived from constructor args like max_num_tokens and tokens_per_block) so num_blocks is valid immediately and later refined inside estimate_cache_loc_capacity(); update __init__ where self._num_blocks is set and leave estimate_cache_loc_capacity() to overwrite with the accurate value.

coderabbitai · 2026-02-04T08:57:00Z

tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0


⚠️ Potential issue | 🟡 Minor

Update copyright year to 2026.

This file is newly added/modified in 2026 but the header still reads 2025.

Suggested fix

-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# SPDX-License-Identifier: Apache-2.0

# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# SPDX-License-Identifier: Apache-2.0

🤖 Prompt for AI Agents

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py` around lines 1 - 2, Update the copyright header in tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py by changing the year from 2025 to 2026 in the SPDX header lines (the two top-of-file lines starting with "# SPDX-FileCopyrightText" and "# SPDX-License-Identifier") so the header reflects the latest modification year.

coderabbitai · 2026-02-04T08:57:00Z

tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py

+    # CACHES - combined KV cache with shape [num_blocks, 2, num_kv_heads, tokens_per_block, head_dim]
+    kv_cache: torch.Tensor,
    # CONSTANTS
    scale: Optional[float],
    k_scale: float,
    v_scale: float,
 ) -> torch.Tensor:
-    # reshape to standard [b*s, n_heads, head_dim] layout
-    head_dim = k_cache.shape[-1]
+    # kv_cache shape: [num_blocks, 2, num_kv_heads, tokens_per_block, head_dim] (HND layout)
+    head_dim = kv_cache.shape[-1]
+    page_size = kv_cache.shape[3]  # tokens_per_block
    q_shape_og = q.shape
    b, s = q_shape_og[:2]



⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and read the file in question find . -name "flashinfer_attention.py" -type f | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 131

🏁 Script executed:

# Also search for _GlobalFlashInferPlanner to understand how kv_layout is managed rg "_GlobalFlashInferPlanner" -A 5 -B 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 22373

🏁 Script executed:

# Search for kv_layout references in the codebase rg "kv_layout" -t py | head -30

Repository: NVIDIA/TensorRT-LLM

Length of output: 4197

🏁 Script executed:

# Look for HND and NHD references to understand these layouts rg "HND|NHD" -t py | head -40

Repository: NVIDIA/TensorRT-LLM

Length of output: 5937

🏁 Script executed:

# Read the flashinfer_attention.py file around lines 321-333 head -n 340 tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py | tail -n +310

Repository: NVIDIA/TensorRT-LLM

Length of output: 1228

🏁 Script executed:

# Look for the _FlashInferPlanner class definition and kv_layout attribute rg "class _FlashInferPlanner" -A 30

Repository: NVIDIA/TensorRT-LLM

Length of output: 3350

🏁 Script executed:

# Check how kv_layout is set in the planner rg "kv_layout\s*=" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 419

🏁 Script executed:

# Look for any existing layout-aware page_size calculations rg "page_size|tokens_per_block" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 1704

🏁 Script executed:

# Look for NHD layout shape in tests rg -A 10 "NHD layout" tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py | head -40

Repository: NVIDIA/TensorRT-LLM

Length of output: 1651

🏁 Script executed:

# Look for how KV cache is created in NHD layout tests rg "kv_layout.*NHD" tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/ -B 5 -A 15 | head -60

Repository: NVIDIA/TensorRT-LLM

Length of output: 7402

🏁 Script executed:

# Check the KVPagedResourceHandler to understand layout shapes rg "class KVPagedResourceHandler" -A 50 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 5408

🏁 Script executed:

# Look at the allocate method to see NHD shape rg "def allocate" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -A 30 | grep -A 30 "KVPagedResourceHandler" | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 1922

🏁 Script executed:

# Search for NHD shape construction rg "NHD" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -B 5 -A 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 1941

🏁 Script executed:

# Check the function signature and if it can receive NHD layout rg "flashinfer_attention_prepare_metadata" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -B 5 -A 40 | head -60

Repository: NVIDIA/TensorRT-LLM

Length of output: 2304

🏁 Script executed:

# Check where the kv_cache comes from and what layout it has rg "flashinfer_attention_prepare_metadata" tests/ -B 10 -A 10 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Look at the full test to see if NHD is ever used rg "kv_layout.*=" tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Look for where flashinfer_attention is called and what kv_cache layout is used rg "@torch.library.custom_op.*flashinfer_attention\"" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Find the actual function that has the page_size issue at lines 321-333 sed -n '315,340p' tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1061

🏁 Script executed:

# Check what layout is always used in this file rg "_GlobalFlashInferPlanner.kv_layout" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 178

🏁 Script executed:

# Check if kv_layout can ever be NHD for this operation rg "kv_layout.*=" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py | head -10

Repository: NVIDIA/TensorRT-LLM

Length of output: 223

🏁 Script executed:

# Let me check the entire flashinfer_attention_wrapper function definition rg "@torch.library.custom_op.*flashinfer_attention_wrapper" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -A 50 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Check if tests ever use NHD layout with flashinfer_attention_wrapper rg "flashinfer_attention_wrapper|flashinfer_attention\(" tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py -B 5 -A 5 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Verify the exact tensor shape for NHD by looking at the test allocation sed -n '1,100p' tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py | tail -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 2021

🏁 Script executed:

# Check what function name contains lines 321-333 sed -n '300,333p' tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py | head -35

Repository: NVIDIA/TensorRT-LLM

Length of output: 1152

🏁 Script executed:

# Search for where this function is called to see what layout is passed rg "flashinfer_attention_wrapper\(" --type py -B 3 -A 3 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Check the comment at the top of flashinfer_attention.py about layout head -50 tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1393

🏁 Script executed:

# Check if kv_layout can be changed from default HND rg "kv_layout\s*=" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 419

🏁 Script executed:

# Check if this file is actually used with NHD layout rg "flashinfer_mha_with_cache" tests/ --type py -B 5 -A 5 | head -40

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Search for where FlashInferAttention is instantiated to see if NHD is ever set rg "FlashInferAttention\|_GlobalFlashInferPlanner" tests/ -B 3 -A 3 | head -60

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Check if there's a way to set kv_layout to NHD for the planner rg "kv_layout.*NHD" tensorrt_llm/_torch/ --type py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1237

🏁 Script executed:

# Check if flashinfer_mha_with_cache can actually receive NHD layout rg "flashinfer_mha_with_cache" tensorrt_llm/ --type py -B 5 -A 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 2157

🏁 Script executed:

# Look at how kv_cache shape is checked or asserted rg "kv_cache.shape" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 333

🏁 Script executed:

# Verify the exact NHD shape one more time from the allocate function sed -n '180,240p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2384

🏁 Script executed:

# Let me get the full NHD allocation to triple-check dimensions sed -n '228,250p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 899

🏁 Script executed:

# Check if there's any test that uses flashinfer_mha_with_cache with NHD grep -r "flashinfer_mha_with_cache\|flashinfer_attention_mha_with_cache" tests/ --include="*.py"

Repository: NVIDIA/TensorRT-LLM

Length of output: 1470

🏁 Script executed:

# Let me get the complete NHD allocation code sed -n '210,235p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1170

🏁 Script executed:

# Find the allocate method in KVPagedResourceHandler grep -n "def allocate" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 358

🏁 Script executed:

# Get the allocate method with correct line numbers rg "def allocate.*sequence_info" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -A 40

Repository: NVIDIA/TensorRT-LLM

Length of output: 5927

Derive page_size from kv_layout to keep NHD support correct

The page_size calculation at line 331 uses kv_cache.shape[3], which is correct only for HND layout [num_blocks, 2, num_kv_heads, tokens_per_block, head_dim]. For NHD layout [num_blocks, tokens_per_block, 2, num_kv_heads, head_dim], tokens_per_block is at dimension 1, not dimension 3, so the plans will be created with an incorrect page size if NHD is ever used.

- # kv_cache shape: [num_blocks, 2, num_kv_heads, tokens_per_block, head_dim] (HND layout) + # kv_cache shape depends on kv_layout (HND or NHD) head_dim = kv_cache.shape[-1] - page_size = kv_cache.shape[3] # tokens_per_block + if _GlobalFlashInferPlanner.kv_layout == "HND": + page_size = kv_cache.shape[3] # tokens_per_block + else: # NHD + page_size = kv_cache.shape[1] # tokens_per_block

🤖 Prompt for AI Agents

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py` around lines 321 - 333, The page_size is being taken as kv_cache.shape[3] which assumes HND layout; update the logic that computes page_size in the function (where kv_cache, kv_layout and page_size are used) to derive tokens_per_block based on kv_layout: if kv_layout indicates HND use shape[3], if it indicates NHD use shape[1]; replace the hardcoded index with this conditional access so plan creation and downstream uses of page_size are correct for both layouts.

coderabbitai · 2026-02-04T08:57:00Z

tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0


⚠️ Potential issue | 🟡 Minor

Update copyright year to 2026.

Suggested fix

-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# SPDX-License-Identifier: Apache-2.0

# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# SPDX-License-Identifier: Apache-2.0

🤖 Prompt for AI Agents

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py` around lines 1 - 2, Update the SPDX header year from 2025 to 2026 at the top of the file; specifically edit the copyright header lines in tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py (the SPDX-FileCopyrightText and/or SPDX-License-Identifier header block) to reflect 2026 as the latest meaningful modification year.

coderabbitai · 2026-02-04T08:57:01Z

tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py

+@dataclass
+class TrtllmLayerState:
+    """Per-layer state for TRT-LLM attention wrapper."""
+
+    layer_idx: int
+    num_heads: int
+    num_kv_heads: int
+    head_dim: int
+    tokens_per_block: int
+    max_num_requests: int
+    max_context_length: int
+    num_layers: int = 0  # Total number of layers for block offset calculation
+
+    # Pre-allocated tensors for metadata translation
+    # Device tensors
+    sequence_length: torch.Tensor = field(default=None)
+    context_lengths: torch.Tensor = field(default=None)
+    kv_cache_block_offsets: torch.Tensor = field(default=None)
+
+    # Host tensors (pinned for async H2D)
+    host_past_key_value_lengths: torch.Tensor = field(default=None)
+    host_context_lengths: torch.Tensor = field(default=None)
+    host_request_types: torch.Tensor = field(default=None)
+    host_total_kv_lens: torch.Tensor = field(default=None)
+    host_kv_cache_pool_pointers: torch.Tensor = field(default=None)
+    host_kv_cache_pool_mapping: torch.Tensor = field(default=None)
+
+    # Interleaved KV cache buffer for kernel (allocated lazily)
+    interleaved_kv_cache: torch.Tensor = field(default=None)
+
+    def __post_init__(self):
+        """Allocate pre-sized tensors."""
+        if self.sequence_length is None:
+            device = "cuda"
+
+            # Device tensors
+            self.sequence_length = torch.zeros(
+                self.max_num_requests, dtype=torch.int32, device=device
+            )
+            self.context_lengths = torch.zeros(
+                self.max_num_requests, dtype=torch.int32, device=device
+            )
+
+            # Pre-allocate kv_cache_block_offsets with MAX size for CUDA graph stability
+            max_blocks_per_seq = (
+                self.max_context_length + self.tokens_per_block - 1
+            ) // self.tokens_per_block
+            self.kv_cache_block_offsets = torch.zeros(
+                1,  # num_pools
+                self.max_num_requests,
+                2,  # K and V
+                max_blocks_per_seq,
+                dtype=torch.int32,
+                device=device,
+            )
+
+            # Host tensors (pinned memory for async transfers)
+            self.host_past_key_value_lengths = torch.zeros(
+                self.max_num_requests, dtype=torch.int32, device="cpu", pin_memory=True
+            )
+            self.host_context_lengths = torch.zeros(
+                self.max_num_requests, dtype=torch.int32, device="cpu", pin_memory=True
+            )
+            self.host_request_types = torch.zeros(
+                self.max_num_requests, dtype=torch.int32, device="cpu", pin_memory=True
+            )
+            self.host_total_kv_lens = torch.zeros(
+                2, dtype=torch.int64, device="cpu", pin_memory=True
+            )
+            # Pool pointers: [num_pools, 2] where each row is [k_cache_ptr, v_cache_ptr]
+            # thop.attention expects 2D tensor: [num_pools, 2]
+            self.host_kv_cache_pool_pointers = torch.zeros(
+                1, 2, dtype=torch.int64, device="cpu", pin_memory=True
+            )
+            # Pool mapping: 2D [num_layers, 2] format expected by thop.attention
+            # pool_mapping[layer, 0] = pool_idx (0 for single pool)
+            # pool_mapping[layer, 1] = layer_offset (0 when using per-layer pointers)
+            # Use max 256 layers to cover most models
+            max_layers = 256
+            self.host_kv_cache_pool_mapping = torch.zeros(
+                max_layers, 2, dtype=torch.int32, device="cpu", pin_memory=True
+            )
+
+


⚠️ Potential issue | 🟠 Major

TrtllmLayerState hard-codes device to "cuda".

This will allocate on the default CUDA device even if the model runs on a different GPU (or CPU in tests). Please pass the target device into TrtllmLayerState (e.g., via get_or_create_layer_state using kv_cache.device or SequenceInfo.device) and allocate on that device instead.

🤖 Prompt for AI Agents

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py` around lines 222 - 305, TrtllmLayerState currently hardcodes device="cuda" in __post_init__, causing allocations on the wrong GPU; add a device field to the TrtllmLayerState dataclass (e.g., device: torch.device) and use that field instead of the string "cuda" when allocating device tensors in __post_init__, keeping host/pinned tensors on cpu as before; update callers (e.g., get_or_create_layer_state) to pass the correct device (kv_cache.device or SequenceInfo.device) when constructing TrtllmLayerState so allocations follow the model/kv cache device.

coderabbitai · 2026-02-04T08:57:01Z

tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py

+def _prepare_trtllm_metadata(
+    batch_info_host: torch.Tensor,
+    cu_seqlen_host: torch.Tensor,
+    cu_num_pages: torch.Tensor,
+    cu_num_pages_host: torch.Tensor,
+    cache_loc: torch.Tensor,
+    last_page_len: torch.Tensor,
+    last_page_len_host: torch.Tensor,
+    seq_len_with_cache_host: torch.Tensor,
+    state: TrtllmLayerState,
+    kv_cache: torch.Tensor,
+    ad_pool_pointers: Optional[torch.Tensor] = None,
+    ad_pool_mapping: Optional[torch.Tensor] = None,
+) -> Tuple[torch.Tensor, ...]:
+    """Prepare TRT-LLM metadata from AD metadata.
+
+    For CUDA graph support (like pt_cache_backend):
+    - During capture: Set host tensors to MAX, skip device operations
+    - Outside capture: Normal operation
+
+    Args:
+        batch_info_host: [num_prefill, num_prefill_tokens, num_decode]
+        cu_seqlen_host: Cumulative sequence lengths [num_seq + 1]
+        cu_num_pages: Cumulative page counts [num_seq + 1]
+        cu_num_pages_host: Same as cu_num_pages but on host
+        cache_loc: Flat page indices for all sequences
+        last_page_len: Tokens in last page per sequence
+        last_page_len_host: Same on host
+        seq_len_with_cache_host: Total seq length including cached tokens
+        state: Per-layer TRT-LLM state
+        kv_cache: Unified KV cache tensor [num_blocks, kv_factor=2, num_kv_heads, tokens_per_block, head_dim]
+        ad_pool_pointers: Optional AD pool pointers from KVCacheManager (shape: [num_pools, 2])
+        ad_pool_mapping: Optional AD pool mapping from KVCacheManager (shape: [num_layers, 2])
+
+    Returns:
+        Tuple of tensors needed by thop.attention
+    """
+    num_prefill, num_prefill_tokens, num_decode = batch_info_host.tolist()
+    num_seq = num_prefill + num_decode
+
+    # Check if in CUDA graph capture mode
+    is_capturing = torch.cuda.is_current_stream_capturing()
+
+    # Compute input sequence lengths from cumulative sums
+    input_seq_lens = (cu_seqlen_host[1 : num_seq + 1] - cu_seqlen_host[:num_seq]).int()
+    seq_len_with_cache = seq_len_with_cache_host[:num_seq].int()
+    past_kv_lens = seq_len_with_cache - input_seq_lens.cpu()
+
+    # CUDA GRAPH FIX: Set host tensors to MAX during capture (like pt_cache_backend)
+    if is_capturing:
+        max_seq = state.max_context_length
+        state.host_past_key_value_lengths[:num_seq].fill_(max_seq)
+        state.host_context_lengths[:num_seq].fill_(max_seq)
+        state.host_request_types[:num_seq].fill_(1)
+        state.host_total_kv_lens[0] = 0
+        state.host_total_kv_lens[1] = max_seq * num_seq
+    else:
+        # Normal operation: fill host tensors
+        state.host_past_key_value_lengths[:num_seq].copy_(past_kv_lens)
+        state.host_context_lengths[:num_seq].copy_(input_seq_lens.cpu())
+        state.host_request_types[:num_prefill].fill_(0)
+        state.host_request_types[num_prefill:num_seq].fill_(1)
+        context_total_kv = seq_len_with_cache[:num_prefill].sum().item() if num_prefill > 0 else 0
+        gen_total_kv = seq_len_with_cache[num_prefill:num_seq].sum().item() if num_decode > 0 else 0
+        state.host_total_kv_lens[0] = context_total_kv
+        state.host_total_kv_lens[1] = gen_total_kv
+
+    # Device operations - skip during capture (like pt_cache_backend's skip_device_ops)
+    if not is_capturing:
+        # Sync before copy to catch any previous async errors
+        torch.cuda.synchronize()
+
+        # Copy to pre-allocated tensors
+        state.sequence_length[:num_seq].copy_(seq_len_with_cache.cuda())
+        state.context_lengths[:num_seq].copy_(input_seq_lens.cuda())
+
+    # Validate kv_cache shape (safe during capture - no device ops)
+    if len(kv_cache.shape) != 5 or kv_cache.shape[1] != 2:
+        raise RuntimeError(
+            f"Expected kv_cache shape [pages, 2, heads, tokens, dim], got {kv_cache.shape}"
+        )
+
+    num_layers = state.num_layers if state.num_layers > 0 else 32
+
+    # Pool pointer and block offset setup - skip during capture (contains .item() calls)
+    if not is_capturing:
+        # Set up KV cache pool pointers
+        use_ad_pool = (
+            ad_pool_pointers is not None
+            and ad_pool_mapping is not None
+            and ad_pool_pointers.numel() > 0
+            and ad_pool_pointers[0, 0].item() != 0
+        )
+
+        if not use_ad_pool:
+            raise RuntimeError(
+                f"AD pool not available. ad_pool_pointers={ad_pool_pointers}, "
+                f"ad_pool_mapping={ad_pool_mapping}"
+            )
+
+        # Use AD's pool pointers directly
+        state.host_kv_cache_pool_pointers[0, 0] = ad_pool_pointers[0, 0].item()
+        state.host_kv_cache_pool_pointers[0, 1] = 0
+
+        # Use AD's pool mapping directly
+        for layer_i in range(min(num_layers, ad_pool_mapping.shape[0])):
+            state.host_kv_cache_pool_mapping[layer_i, 0] = ad_pool_mapping[layer_i, 0].item()
+            state.host_kv_cache_pool_mapping[layer_i, 1] = ad_pool_mapping[layer_i, 1].item()
+
+        # Log pool setup for debugging (only once)
+        if state.layer_idx == 0 and not hasattr(state, "_pool_logged"):
+            state._pool_logged = True
+            ad_logger.debug(
+                f"[TRT-LLM Attention] Using AD pool directly: "
+                f"pool_ptr={state.host_kv_cache_pool_pointers[0, 0]}"
+            )
+
+        # Block offsets: convert flat cache_loc to per-sequence block indices
+        pages_per_seq = (cu_num_pages_host[1 : num_seq + 1] - cu_num_pages_host[:num_seq]).int()
+        max_blocks = pages_per_seq.max().item() if num_seq > 0 else 1
+        _global_state.set_max_blocks_per_seq(max_blocks)
+
+        # kv_cache_block_offsets is pre-allocated in __post_init__, don't reallocate
+
+        # Fill block offsets
+        kv_factor = 2
+        multiplier = num_layers * kv_factor
+        state.kv_cache_block_offsets.zero_()
+        offset = 0
+        for i in range(num_seq):
+            n_pages = pages_per_seq[i].item()
+            if n_pages > 0:
+                base_offsets = cache_loc[offset : offset + n_pages] * multiplier
+                state.kv_cache_block_offsets[0, i, 0, :n_pages] = base_offsets
+                state.kv_cache_block_offsets[0, i, 1, :n_pages] = base_offsets + 1
+                offset += n_pages
+
+    # Return tensors
+    # Use pre-allocated tensor size for block offsets (CUDA graph compatibility)
+    max_blocks_per_seq = state.kv_cache_block_offsets.shape[3]
+
+    return (
+        state.sequence_length[:num_seq],
+        state.host_past_key_value_lengths[:num_seq],
+        state.host_total_kv_lens,
+        state.context_lengths[:num_seq],
+        state.host_context_lengths[:num_seq],
+        state.host_request_types[:num_seq],
+        state.kv_cache_block_offsets[:, :num_seq, :, :max_blocks_per_seq],
+        state.host_kv_cache_pool_pointers,
+        state.host_kv_cache_pool_mapping,
+    )


⚠️ Potential issue | 🟠 Major

Fallback path still requires AD pool pointers.

_prepare_trtllm_metadata raises if ad_pool_pointers/ad_pool_mapping are missing, but the KV cache handler includes a fallback allocation path. If that fallback path is ever used (e.g., no KVCacheManager), this will crash at runtime. Consider failing fast earlier (during allocation) with a clearer error, or implement a non-pool-pointer metadata path for the fallback cache.

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 478-478: Unused function argument: cu_num_pages

(ARG001)

[warning] 481-481: Unused function argument: last_page_len

(ARG001)

[warning] 482-482: Unused function argument: last_page_len_host

(ARG001)

[warning] 512-512: Unpacked variable num_prefill_tokens is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

[warning] 553-555: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 570-573: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py` around lines 475 - 626, The function _prepare_trtllm_metadata currently raises if ad_pool_pointers/ad_pool_mapping are missing which breaks the KV cache handler's fallback allocation path; change the logic so when use_ad_pool is False you do not raise but instead initialize predictable defaults for pool pointers/mapping (e.g., zeros or sentinel values) and proceed to compute kv_cache_block_offsets from cache_loc/pages_per_seq for the fallback cache layout (ensure state.host_kv_cache_pool_pointers and state.host_kv_cache_pool_mapping are set to valid defaults and any logging reflects fallback use), or alternatively move the RuntimeError to an earlier allocation phase so missing AD pool pointers fail fast during allocation rather than here (update checks around use_ad_pool, state.host_kv_cache_pool_pointers, state.host_kv_cache_pool_mapping, and the block-offset filling loop accordingly).

coderabbitai · 2026-02-04T08:57:01Z

tensorrt_llm/_torch/auto_deploy/llm_args.py

+    @field_validator("max_seq_len", mode="before")
+    @classmethod
+    def ensure_max_seq_len(cls, value: Any, info: ValidationInfo) -> Any:
+        if value is None:
+            # Fallback to the AutoDeployConfig default when not provided
+            return AutoDeployConfig.model_fields["max_seq_len"].get_default(
+                call_default_factory=True
+            )
+        return value
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and read the specific section head -n 470 tensorrt_llm/_torch/auto_deploy/llm_args.py | tail -n 50

Repository: NVIDIA/TensorRT-LLM

Length of output: 1857

🏁 Script executed:

# Also check the overall file structure to understand context better wc -l tensorrt_llm/_torch/auto_deploy/llm_args.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 110

🏁 Script executed:

# Search for any usage of the `info` variable in this method sed -n '458,467p' tensorrt_llm/_torch/auto_deploy/llm_args.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 458

🏁 Script executed:

# Find the ensure_no_build_config method to see if it uses info sed -n '470,490p' tensorrt_llm/_torch/auto_deploy/llm_args.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 884

🏁 Script executed:

# Search for all field_validator methods and check if any actually use the info parameter rg -A 10 '@field_validator' tensorrt_llm/_torch/auto_deploy/llm_args.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1981

🏁 Script executed:

# Find the _check_for_default_value_only function rg -B 2 -A 8 'def _check_for_default_value_only' tensorrt_llm/_torch/auto_deploy/llm_args.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 431

🏁 Script executed:

# Check for any ruff or linting configuration find . -maxdepth 2 -name 'pyproject.toml' -o -name '.ruff.toml' -o -name 'ruff.toml' | head -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 126

🏁 Script executed:

# Check the ruff configuration to see if it flags unused parameters cat ./ruff.toml

Repository: NVIDIA/TensorRT-LLM

Length of output: 969

Rename unused info parameter to _info

The info parameter is unused in this validator method. Ruff's ARG rule (flake8-unused-arguments) is enabled in the project configuration and would flag this. Renaming to _info follows Python convention for intentionally unused parameters and silences the warning.

- def ensure_max_seq_len(cls, value: Any, info: ValidationInfo) -> Any: + def ensure_max_seq_len(cls, value: Any, _info: ValidationInfo) -> Any:

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 460-460: Unused class method argument: info

(ARG003)

🤖 Prompt for AI Agents

In `@tensorrt_llm/_torch/auto_deploy/llm_args.py` around lines 458 - 467, The validator ensure_max_seq_len currently declares an unused parameter named info which triggers the ARG lint rule; rename that parameter to _info in the method signature of ensure_max_seq_len (the `@field_validator`("max_seq_len", mode="before") classmethod) so it becomes unused-by-convention and the linter warning is silenced, leaving the body unchanged and preserving the return behavior that falls back to AutoDeployConfig.model_fields["max_seq_len"].get_default(call_default_factory=True) when value is None.

coderabbitai · 2026-02-04T08:57:01Z

tensorrt_llm/_torch/auto_deploy/shim/interface.py

+        regenerated = 0
+        # Only regenerate k_cache and v_cache (KV caches that are views)
+        for name in list(self._caches.keys()):
+            if "k_cache" in name or "v_cache" in name:
+                if name in self._cache_initializers:
+                    old_ptr = self._caches[name].data_ptr()
+                    # Re-invoke initializer to get new view
+                    self._caches[name] = self._cache_initializers[name](self.info)
+                    new_ptr = self._caches[name].data_ptr()
+                    regenerated += 1
+                    if regenerated <= 2:  # Only log first 2
+                        ad_logger.info(
+                            f"[CachedSequenceInterface] Regenerated {name}: "
+                            f"old_ptr=0x{old_ptr:x}, new_ptr=0x{new_ptr:x}, "
+                            f"shape={self._caches[name].shape}"
+                        )
+
+        ad_logger.info(f"[CachedSequenceInterface] Regenerated {regenerated} cache views")


⚠️ Potential issue | 🟠 Major

Update cache-view regeneration for combined kv_cache naming

The regeneration logic only looks for "k_cache"/"v_cache". With the new "kv_cache_*" resources, views won’t be refreshed after a PTCacheBackend pool reallocation, leaving stale pointers.

✅ Suggested fix

- # Only regenerate k_cache and v_cache (KV caches that are views) + # Regenerate combined kv_cache (and legacy k/v cache views) after pool reallocation for name in list(self._caches.keys()): - if "k_cache" in name or "v_cache" in name: + if "kv_cache" in name or "k_cache" in name or "v_cache" in name:

🤖 Prompt for AI Agents

In `@tensorrt_llm/_torch/auto_deploy/shim/interface.py` around lines 785 - 802, The cache-refresh loop in CachedSequenceInterface only checks for "k_cache" or "v_cache" and therefore misses new combined keys like "kv_cache_*"; update the condition in the for-loop that iterates self._caches.keys() to also detect combined kv names (e.g., check for "kv_cache" or name.startswith("kv_cache_") or a substring match for "kv_cache") so that entries created from self._cache_initializers are re-invoked and replaced (preserving the existing logic that calls self._cache_initializers[name](self.info), compares data_ptrs, increments regenerated, and logs via ad_logger.info).

coderabbitai · 2026-02-04T08:57:01Z

tests/test_trtllm_attention_cuda_graph.py

+    @pytest.mark.skipif(not HAS_PT_CACHE_BACKEND, reason="PTCacheBackend not available")
+    @pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")
+    @pytest.mark.skipif(
+        torch.cuda.get_device_capability()[0] < 8, reason="CUDA graphs require SM 8.0+"
+    )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and inspect the test file fd "test_trtllm_attention_cuda_graph.py" -type f

Repository: NVIDIA/TensorRT-LLM

Length of output: 233

🏁 Script executed:

# Check if the file exists and view the relevant lines if [ -f "tests/test_trtllm_attention_cuda_graph.py" ]; then echo "=== File found, checking lines 240-250 ===" sed -n '240,250p' tests/test_trtllm_attention_cuda_graph.py else echo "File not found at tests/test_trtllm_attention_cuda_graph.py" # Try to find it find . -name "*test_trtllm_attention_cuda_graph*" -type f fi

Repository: NVIDIA/TensorRT-LLM

Length of output: 672

🏁 Script executed:

# Search for similar CUDA capability check patterns in the codebase rg "get_device_capability" -B 3 -A 1 --type py | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 4483

🌐 Web query:

torch.cuda.get_device_capability() CPU-only builds behavior exception

💡 Result:

In CPU-only PyTorch builds (i.e., PyTorch not built with CUDA support), calling torch.cuda.get_device_capability() will typically raise an error like:

AssertionError: Torch not compiled with CUDA enabled

because torch.cuda is lazily initialized and any call that needs CUDA triggers initialization, which fails when CUDA support isn’t compiled in. The docs explicitly note the lazy-init model and that you should use torch.cuda.is_available() to check usability. [1] The get_device_capability API itself is only meaningful when a CUDA device exists. [2]

Robust pattern (works on CPU-only + CUDA builds):

import torch if torch.backends.cuda.is_built() and torch.cuda.is_available(): cc = torch.cuda.get_device_capability() else: cc = None # CPU-only build, or no working CUDA device/driver

torch.backends.cuda.is_built() tells you whether the installed PyTorch binary was built with CUDA support. [3]

If you need “some capability” on CPU-only machines: there is no CUDA compute capability to return—treat it as None (or raise your own clear error) rather than calling torch.cuda.get_device_capability().

Sources: PyTorch CUDA lazy initialization and is_available() guidance [1], torch.cuda.get_device_capability docs [2], torch.backends.cuda.is_built() docs [3], example of the CPU-only build assertion coming from CUDA lazy init [4].

References
[1] https://docs.pytorch.org/docs/stable/cuda.html
[2] https://docs.pytorch.org/docs/2.9/generated/torch.cuda.get_device_capability.html
[3] https://docs.pytorch.org/docs/stable/backends.html
[4] https://bugs.launchpad.net/bugs/2051023

Guard CUDA capability check to prevent import-time failures on CPU-only builds

In @pytest.mark.skipif decorators, each condition expression is evaluated independently at import time. Even though line 245 checks torch.cuda.is_available(), the subsequent decorator on lines 246–248 still evaluates torch.cuda.get_device_capability(), which raises AssertionError on CPU-only PyTorch builds.

Combine the conditions into a single decorator with short-circuit evaluation to prevent the call when CUDA is unavailable:

Suggested fix

`@pytest.mark.skipif`(not HAS_PT_CACHE_BACKEND, reason="PTCacheBackend not available") -@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available") -@pytest.mark.skipif( - torch.cuda.get_device_capability()[0] < 8, reason="CUDA graphs require SM 8.0+" +@pytest.mark.skipif( + (not torch.cuda.is_available()) + or (torch.cuda.get_device_capability()[0] < 8), + reason="CUDA not available or SM 8.0+ required for CUDA graphs", )

🤖 Prompt for AI Agents

In `@tests/test_trtllm_attention_cuda_graph.py` around lines 244 - 248, The three separate pytest skipif decorators cause torch.cuda.get_device_capability() to be called at import time even on CPU-only builds; update the decorators so the CUDA availability and device capability checks are combined into a single skipif using short-circuit logic (e.g. combine torch.cuda.is_available() and torch.cuda.get_device_capability()[0] < 8 into one condition) while keeping the HAS_PT_CACHE_BACKEND check as its own decorator; ensure the combined decorator uses a clear reason like "CUDA graphs require SM 8.0+ or CUDA not available" so get_device_capability() is only invoked when CUDA is available.

Improve performance of thop.attention CUDA graph support by: - Remove torch.cuda.synchronize() that was blocking CPU - Replace Python loop with .item() calls with vectorized GPU operations using torch.searchsorted and advanced indexing for block offsets - Compute block offsets once on first layer, copy to remaining 31 layers - Move GPU tensor creation outside the layer loop - Use non_blocking=True for device copies This improves throughput from ~1500 TPS to ~5800 TPS with CUDA graphs. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…utation Add pre-allocated GPU buffers to TrtllmAttentionGlobalState for vectorized block offset computation, matching PTCacheBackend's pattern: - _gpu_cu_pages, _gpu_page_positions, _gpu_seq_idx, _gpu_page_idx, _gpu_base_offset buffers allocated once and reused - Use torch.searchsorted/sub/mul with out= parameter to avoid per-call tensor allocations This eliminates allocation overhead in the hot path. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Since KVCacheManager uses a single unified pool for all layers, most metadata tensors are identical across layers. This change shares them: - Add shared tensors to TrtllmAttentionGlobalState (sequence_length, context_lengths, kv_cache_block_offsets, host tensors) - TrtllmLayerState now references shared tensors via init_from_shared() - Only host_kv_cache_pool_mapping remains per-layer (layer offsets) - host_prepare_fn updates shared tensors ONCE instead of 32x This eliminates 32x redundant tensor updates per forward pass, improving throughput from ~5840 to ~6233 TPS (closer to PTCacheBackend). Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Optimize thop.attention metadata preparation to outperform PTCacheBackend: - Add FAST PATH in _prepare_trtllm_metadata: after host_prepare_fn runs, each layer's call just returns pre-computed tensors (almost zero work) - Track host_prepare_called flag to enable fast path during replay - Cache current_num_seq to avoid parsing batch_info during fast path - Move pool pointer initialization to be done once in host_prepare_fn Performance results: - Optimized non-PTCacheBackend: ~6600 TPS - PTCacheBackend: ~6528 TPS - Improvement: ~1.1% faster than PTCacheBackend The key insight is that during CUDA graph replay, host_prepare_fn runs once per forward pass and fills all shared tensors. The 32 per-layer _prepare_trtllm_metadata calls should do almost nothing - just return pre-computed tensor slices. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Optimize host_prepare_fn to reduce tensor allocation overhead: - Add pre-allocated pinned CPU buffers for intermediate computations: - _cpu_input_seq_lens, _cpu_seq_len_with_cache, _cpu_past_kv_lens - _cpu_cu_num_pages, _cpu_pages_per_seq - Use torch.sub/copy with out= parameters to avoid tensor allocation - Replace .item() with int() for faster scalar extraction - Only zero the slice of block_offsets we need ([:, :num_seq, :, :]) Performance results: - Optimized non-PTCacheBackend: ~6645 TPS - PTCacheBackend: ~6527 TPS - Improvement: ~1.8% faster than PTCacheBackend Note: The remaining ~6.5ms in ad_prepare_inputs is dominated by framework code in ad_executor.py (Python list operations for request processing), which is common to both backends. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Clean up code for review by removing the PTCacheBackend alternative: - Remove PTCacheBackend imports and _HAS_PT_CACHE_BACKEND flag - Remove use_pt_cache_backend config option and related code - Remove enable_pt_cache_backend/get_pt_cache_backend/is_pt_cache_backend_enabled - Remove debug SDPA fallback code - Remove debug logging statements - Simplify TrtllmAttentionConfig class - Clean up related code in kvcache.py and interface.py The direct AD pool integration (KVCacheManager) is now the only code path, which is optimized with: - Pre-allocated CPU/GPU buffers - Shared tensors across layers - Vectorized GPU block offset computation - Host prepare function for CUDA graph support Performance: ~6650 TPS (1.8% faster than PTCacheBackend baseline) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

lucaslie and others added 3 commits January 30, 2026 15:02

[NVIDIA#10966][feat] AutoDeploy: kv cache manager integration [2/2]

a348171

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

MrGeva requested review from a team as code owners February 4, 2026 08:42

MrGeva requested review from Shixiaowei02, kaiyux and tcherckez-nvidia February 4, 2026 08:42

MrGeva self-assigned this Feb 4, 2026

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

MrGeva and others added 5 commits February 4, 2026 01:19

MrGeva mentioned this pull request Feb 4, 2026

[Bug]: AutoDeploy llama3.1 8B FP8 trtllm-bench perf is 50-60% than manual flow #10705

Open

5 tasks

MrGeva changed the title ~~AutoDeploy trtllm attention backend with trtllm's kv cache manager direct operation~~ Draft: DO NOT REVIEW : AutoDeploy trtllm attention backend with trtllm's kv cache manager direct operation Feb 5, 2026

MrGeva closed this Feb 10, 2026

		# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
		# SPDX-License-Identifier: Apache-2.0

Conversation

MrGeva commented Feb 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MrGeva commented Feb 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 4, 2026 •

edited

Loading