Draft: DO NOT REVIEW : AutoDeploy trtllm attention backend with trtllm's kv cache manager direct operation#11268
Draft: DO NOT REVIEW : AutoDeploy trtllm attention backend with trtllm's kv cache manager direct operation#11268MrGeva wants to merge 9 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
The key fix is to use AD's pool_mapping values directly without multiplying by kv_factor. AD's pool_mapping already provides the correct layer offsets (0, 1, 2, ...) because each layer takes exactly one "block" worth of K+V data in the unified pool, regardless of dtype. Previously, the code was multiplying layer_idx by kv_factor=2, causing the kernel to compute incorrect addresses: - Expected layer 1 at: pool_ptr + 1 * block_size - Got layer 1 at: pool_ptr + 2 * block_size (wrong!) This fix enables accurate thop.attention execution in AutoDeploy using AD's KVCacheManager pool directly, without needing the PTCacheBackend or intermediate buffers. Note: CUDA graph support requires use_pt_cache_backend=true due to host operations in metadata preparation. Signed-off-by: Eli Geva <egeva@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…cheBackend) Enable CUDA graph support for the thop.attention kernel when use_pt_cache_backend=False. This allows the torch-cudagraph compile backend to work correctly with thop.attention. Key changes: - Pre-allocate kv_cache_block_offsets with max size in TrtllmLayerState to ensure stable tensor addresses for CUDA graphs - Add is_capturing check in _prepare_trtllm_metadata to set host tensors to MAX values and skip device operations during capture - Add create_host_prepare_function() to TrtllmAttentionGlobalState that creates a host_prepare_fn running outside the graph to update tensors with current batch values before each forward/replay - Register host_prepare_fn via get_host_prepare_metadata_function() for non-PTCacheBackend mode Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>
📝 WalkthroughWalkthroughThis PR restructures the AutoDeploy KV cache management system by introducing typed resource handlers (KVPagedResourceHandler, SSMResourceHandler, CausalConvResourceHandler), removing legacy Triton-based paged KV cache kernels, introducing cache backend abstractions, adding a TRT-LLM attention backend integration, and refactoring cache indexing across attention operators while reorganizing documentation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py (2)
1-9:⚠️ Potential issue | 🟠 MajorAdd NVIDIA copyright header for this source file
Source files require an NVIDIA header with the year of the latest meaningful modification.
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.📌 Suggested header
+ # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0 + from dataclasses import dataclass, fields
462-469:⚠️ Potential issue | 🟡 MinorSilence the unused kv_cache argument in the fake op
Ruff reports
kv_cacheas unused in the fake implementation. Renaming it avoids lint noise without changing behavior.🧹 Suggested fix
- kv_cache: torch.Tensor, + _kv_cache: torch.Tensor,tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
1-8:⚠️ Potential issue | 🟠 MajorAdd NVIDIA copyright header for this source file
Source files require an NVIDIA header with the year of the latest meaningful modification.
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.📌 Suggested header
+ # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0 + from importlib.resources import filestensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
1-8:⚠️ Potential issue | 🟠 MajorAdd NVIDIA copyright header for this source file
Source files require an NVIDIA header with the year of the latest meaningful modification.
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.📌 Suggested header
+ # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0 + import copy
🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`:
- Around line 287-319: The field validator CacheConfig._coerce_dtype should not
use assert for validating string dtype names; instead, when value is a str look
up the attribute on torch (as currently done), and if the lookup yields None or
a non-torch.dtype raise a clear exception (e.g., ValueError) with the invalid
input included; update the validator in CacheConfig (decorated with
`@field_validator`("dtype", "mamba_dtype", "delta_dtype", mode="before")) to
replace the assert with an explicit raise that reports the offending value and
expected torch.dtype so invalid dtype strings are rejected reliably in all
runtime modes.
- Around line 434-436: SequenceInfo currently leaves self._num_blocks as None
(set in __init__), causing an assertion in SequenceInfo.num_blocks when code
paths (e.g., PTCacheBackend -> initialize()) access it before
estimate_cache_loc_capacity() runs; fix by initializing self._num_blocks in
__init__ to a sensible default (for example use math.ceil(max_num_tokens /
tokens_per_block) or another conservative estimate derived from constructor args
like max_num_tokens and tokens_per_block) so num_blocks is valid immediately and
later refined inside estimate_cache_loc_capacity(); update __init__ where
self._num_blocks is set and leave estimate_cache_loc_capacity() to overwrite
with the accurate value.
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py`:
- Around line 1-2: Update the copyright header in
tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py by changing the year
from 2025 to 2026 in the SPDX header lines (the two top-of-file lines starting
with "# SPDX-FileCopyrightText" and "# SPDX-License-Identifier") so the header
reflects the latest modification year.
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py`:
- Around line 321-333: The page_size is being taken as kv_cache.shape[3] which
assumes HND layout; update the logic that computes page_size in the function
(where kv_cache, kv_layout and page_size are used) to derive tokens_per_block
based on kv_layout: if kv_layout indicates HND use shape[3], if it indicates NHD
use shape[1]; replace the hardcoded index with this conditional access so plan
creation and downstream uses of page_size are correct for both layouts.
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py`:
- Around line 417-472: get_contiguous_caches currently assumes a single shared
contiguous buffer and raises if per-layer kv head counts differ; update it to
handle per-layer kv head variance by either allocating per-layer contiguous
buffers or moving the uniformity check to initialization with a clear validation
error. Concretely: in PTCacheBackend.get_contiguous_caches, when
self._shared_contiguous_k_cache/_shared_contiguous_v_cache would be created,
branch on whether max(self._config.num_kv_heads_per_layer) ==
self._config.num_kv_heads_per_layer[layer_idx]; if not, allocate per-layer
buffers (e.g., store dict/list keyed by layer_idx instead of the single
_shared_contiguous_k_cache/_shared_contiguous_v_cache) using pool shape from
self._kv_cache_manager.get_primary_pool_data(layer_idx) and the layer-specific
num_kv_heads, or alternatively add validation in the initializer (check
self._config.num_kv_heads_per_layer uniformity while self._initialized is set)
and raise a clear RuntimeError there; ensure logging (ad_logger.info) reflects
per-layer allocation and keep existing use of self._device and dtype.
- Around line 1-2: Update the SPDX header year from 2025 to 2026 at the top of
the file; specifically edit the copyright header lines in
tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py (the
SPDX-FileCopyrightText and/or SPDX-License-Identifier header block) to reflect
2026 as the latest meaningful modification year.
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py`:
- Around line 475-626: The function _prepare_trtllm_metadata currently raises if
ad_pool_pointers/ad_pool_mapping are missing which breaks the KV cache handler's
fallback allocation path; change the logic so when use_ad_pool is False you do
not raise but instead initialize predictable defaults for pool pointers/mapping
(e.g., zeros or sentinel values) and proceed to compute kv_cache_block_offsets
from cache_loc/pages_per_seq for the fallback cache layout (ensure
state.host_kv_cache_pool_pointers and state.host_kv_cache_pool_mapping are set
to valid defaults and any logging reflects fallback use), or alternatively move
the RuntimeError to an earlier allocation phase so missing AD pool pointers fail
fast during allocation rather than here (update checks around use_ad_pool,
state.host_kv_cache_pool_pointers, state.host_kv_cache_pool_mapping, and the
block-offset filling loop accordingly).
- Around line 222-305: TrtllmLayerState currently hardcodes device="cuda" in
__post_init__, causing allocations on the wrong GPU; add a device field to the
TrtllmLayerState dataclass (e.g., device: torch.device) and use that field
instead of the string "cuda" when allocating device tensors in __post_init__,
keeping host/pinned tensors on cpu as before; update callers (e.g.,
get_or_create_layer_state) to pass the correct device (kv_cache.device or
SequenceInfo.device) when constructing TrtllmLayerState so allocations follow
the model/kv cache device.
- Around line 1-2: Update the copyright header in trtllm_attention.py to show
the latest modification year 2026 by changing the existing copyright line(s)
that currently show 2025 to 2026; specifically edit the top-of-file SPDX header
lines in trtllm_attention.py (the lines beginning with "#
SPDX-FileCopyrightText" and/or "# SPDX-License-Identifier") so they reference
2026.
In `@tensorrt_llm/_torch/auto_deploy/llm_args.py`:
- Around line 458-467: The validator ensure_max_seq_len currently declares an
unused parameter named info which triggers the ARG lint rule; rename that
parameter to _info in the method signature of ensure_max_seq_len (the
`@field_validator`("max_seq_len", mode="before") classmethod) so it becomes
unused-by-convention and the linter warning is silenced, leaving the body
unchanged and preserving the return behavior that falls back to
AutoDeployConfig.model_fields["max_seq_len"].get_default(call_default_factory=True)
when value is None.
In `@tensorrt_llm/_torch/auto_deploy/shim/interface.py`:
- Around line 785-802: The cache-refresh loop in CachedSequenceInterface only
checks for "k_cache" or "v_cache" and therefore misses new combined keys like
"kv_cache_*"; update the condition in the for-loop that iterates
self._caches.keys() to also detect combined kv names (e.g., check for "kv_cache"
or name.startswith("kv_cache_") or a substring match for "kv_cache") so that
entries created from self._cache_initializers are re-invoked and replaced
(preserving the existing logic that calls
self._cache_initializers[name](self.info), compares data_ptrs, increments
regenerated, and logs via ad_logger.info).
In `@tests/test_trtllm_attention_cuda_graph.py`:
- Around line 244-248: The three separate pytest skipif decorators cause
torch.cuda.get_device_capability() to be called at import time even on CPU-only
builds; update the decorators so the CUDA availability and device capability
checks are combined into a single skipif using short-circuit logic (e.g. combine
torch.cuda.is_available() and torch.cuda.get_device_capability()[0] < 8 into one
condition) while keeping the HAS_PT_CACHE_BACKEND check as its own decorator;
ensure the combined decorator uses a clear reason like "CUDA graphs require SM
8.0+ or CUDA not available" so get_device_capability() is only invoked when CUDA
is available.
🧹 Nitpick comments (23)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)
32-44: Consider annotating mutable class attribute withClassVar.The
ATTN_BACKEND_CONFIGSdictionary is a mutable class attribute. While this works correctly, Python best practices and type checkers recommend annotating it withtyping.ClassVarto explicitly indicate it's a class-level attribute not meant to be instance-specific.💡 Suggested fix
+from typing import ClassVar, Dict, Any + class TestLlama3_1_8B(LlmapiAccuracyTestHarness): MODEL_NAME = "meta-llama/Llama-3.1-8B" MODEL_PATH = hf_id_to_local_model_dir(MODEL_NAME) # Configuration presets for different attention backends - ATTN_BACKEND_CONFIGS = { + ATTN_BACKEND_CONFIGS: ClassVar[Dict[str, Dict[str, Any]]] = { "flashinfer": {tensorrt_llm/_torch/auto_deploy/custom_ops/triton_kernels/attention_with_kv_cache.py (1)
1-1: Consider adding NVIDIA copyright header.This source file appears to be missing the standard NVIDIA copyright header that the coding guidelines require for all TensorRT-LLM source files (
.py,.cpp,.cu, etc.).tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py (1)
23-52: Unused function parameterskandv.The
kandvparameters are passed to this function but never used - the reference computation extracts values directly fromkv_cache. This appears intentional since the custom op appends k/v to the cache before this reference function is called, but the unused parameters could be removed for clarity.💡 Suggested fix
def _attention_with_fp8_kv_cache( - q, k, v, kv_cache, k_scale, v_scale, prefill_seq_len, causal, mask + q, kv_cache, k_scale, v_scale, prefill_seq_len, causal, mask ): """Simulates attention for fp8 kv cache with q,k,v outputs of GEMM in fp16""" - batch_size, seq_len, _ = k.shape + batch_size, seq_len, _ = q.shapeNote: This would also require updating the caller at line 786-787.
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
12-17: Use module-qualified imports per style guide.The new
from typing import ...andfrom pydantic import ...imports drop the module namespace. Please switch to module-qualified imports to align with the repo guideline (e.g.,import typing,import pydanticand referencetyping.Dict,pydantic.BaseModel, etc.).
As per coding guidelines: Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.
539-546: Initialize KV cache pool members in__init__, not at class scope.
_kv_cache_pool_pointers/_kv_cache_pool_mappingare class attributes; this risks shared state across instances and violates the constructor-initialization guideline. Set them toNonein__init__and keep only type hints at class scope.As per coding guidelines: Initialize all externally visible members of a Python class in the constructor.Suggested fix
class SequenceInfo: def __init__(...): ... + self._kv_cache_pool_pointers: Optional[torch.Tensor] = None + self._kv_cache_pool_mapping: Optional[torch.Tensor] = None - _kv_cache_pool_pointers: Optional[torch.Tensor] = None - _kv_cache_pool_mapping: Optional[torch.Tensor] = NoneAlso applies to: 1060-1089
tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py (2)
29-31: Use module-qualified imports per style guide.Please switch to module-qualified imports (e.g.,
import abc,import typing) instead offrom ... import ....
As per coding guidelines: Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.
151-240: Rename unusedlayer_idxto avoid lint noise.
_allocate_layer_cachedoesn’t uselayer_idx. Consider renaming it to_layer_idx(or removing it) to silence the lint warning.Suggested fix
- def _allocate_layer_cache(self, layer_idx: int) -> Dict[str, torch.Tensor]: + def _allocate_layer_cache(self, _layer_idx: int) -> Dict[str, torch.Tensor]:tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py (1)
42-44: Use module-qualified imports per style guide.Please switch to module-qualified imports (e.g.,
import dataclasses,import typing) instead offrom ... import ....
As per coding guidelines: Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py (2)
46-47: Use module-qualified imports per style guide.Please switch to module-qualified imports (e.g.,
import dataclasses,import typing) instead offrom ... import ....
As per coding guidelines: Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.
471-472: Rename module globals to follow theG_prefix rule.
_global_stateand_trtllm_configare module-level globals and should use theG_prefix (e.g.,G_TRTLLM_GLOBAL_STATE,G_TRTLLM_CONFIG).
As per coding guidelines: Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL).Also applies to: 1231-1232
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2)
329-331: Consider renamingcache_loctoslot_idxinfused_mla_reffor consistency.The
fused_mla_reffunction (lines 256-386) still usescache_locas a parameter name (line 264) and passes it toupdate_kv_cache. While functionally correct, this inconsistency could cause confusion sinceupdate_kv_cachenow expectsslot_idx. The same applies to the fake registration at lines 397-398 and usages at lines 337-338.
1-9: Missing NVIDIA copyright header.This file should contain the NVIDIA copyright header as required by the coding guidelines for all source files.
Proposed fix
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + """Torch reference implementations for attention."""As per coding guidelines: All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification.
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_update_kv_cache.py (1)
1-5: Missing NVIDIA copyright header.This test file should contain the NVIDIA copyright header as required by the coding guidelines.
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/triton_kernels/test_attention_with_kv_cache.py (1)
1-6: Missing NVIDIA copyright header.This test file should contain the NVIDIA copyright header as required by the coding guidelines.
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py (1)
1-11: Missing NVIDIA copyright header.This file should contain the NVIDIA copyright header as required by the coding guidelines.
tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py (1)
1-10: Missing NVIDIA copyright header.This file should contain the NVIDIA copyright header as required by the coding guidelines.
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py (1)
1-1: Update copyright year to include 2026.The copyright header shows years 2022-2025, but this file has been meaningfully modified. Per coding guidelines, the year should be updated to reflect the latest meaningful modification.
Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (2)
10-11: Comment on line 10 is misleading.The comment "Initialize resources first" doesn't accurately describe what this import does. The import simply makes
KVPagedResourceHandleravailable for use in tests below - it doesn't initialize any resources.📝 Suggested comment fix
-# Initialize resources first (KVPagedResourceHandler is used within tests below) +# Import KVPagedResourceHandler for paged KV cache resource tests from tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface import KVPagedResourceHandler
295-297: Redundant import -KVPagedResourceHandleralready imported at module level.This import is unnecessary since
KVPagedResourceHandleris already imported at line 11.♻️ Remove redundant import
# Add a resource to verify initialize_resources is called - from tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface import ( - KVPagedResourceHandler, - ) - dummy_cached_interface.add_resource(test_trtllm_attention.py (2)
1-6: Consider moving test file to the tests directory.This standalone test script is at the repository root. For consistency with the project structure, consider moving it to
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_trtllm_attention.pyor a similar location.
155-165: Prefix unused variable with underscore.The unpacked variable
host_kv_cache_pool_mappingis never used. Prefix it with an underscore to indicate it's intentionally ignored.📝 Fix unused variable
( sequence_length, host_past_key_value_lengths, host_total_kv_lens, context_lengths, host_context_lengths, host_request_types, kv_cache_block_offsets, host_kv_cache_pool_pointers, - host_kv_cache_pool_mapping, + _host_kv_cache_pool_mapping, ) = resulttensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py (1)
120-140: Unused variablenum_prefill_tokensshould be prefixed or removed.Line 137 unpacks
num_prefill_tokensfromq.shapebut it's never used in the function. The function useslen(input_pos)fornum_prefillinstead.📝 Fix unused variable
def _prefill_attention( ... ) -> None: """Handle prefill phase - context attention with variable sequence lengths.""" # NOTE: num_prefill_tokens == sum(seq_len) - num_prefill_tokens, n_heads, q_d_head = q.shape + _, n_heads, q_d_head = q.shape max_cache_seq_len, n_kv_heads = k_cache.shape[1:3]tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
529-571: Replace debug print statements with logger callsRaw prints in library code will spam stdout and are hard to control in CI and CUDA-graph capture. Prefer
ad_logger.debug(...)or remove the statements entirely.♻️ Suggested refactor (apply to the whole block)
- print("[DEBUG CachedSequenceInterface._init_kv_cache_manager]") - print( - f" hasattr kv_cache_pool_pointers: {hasattr(self._kv_cache_manager, 'kv_cache_pool_pointers')}" - ) + ad_logger.debug("[CachedSequenceInterface] init_kv_cache_manager") + ad_logger.debug( + " hasattr kv_cache_pool_pointers: %s", + hasattr(self._kv_cache_manager, "kv_cache_pool_pointers"), + ) if hasattr(self._kv_cache_manager, "kv_cache_pool_pointers"): pool_ptrs = self._kv_cache_manager.kv_cache_pool_pointers pool_map = self._kv_cache_manager.kv_cache_pool_mapping - print(f" kv_cache_pool_pointers: {pool_ptrs}") - print( - f" kv_cache_pool_mapping.shape: {pool_map.shape if pool_map is not None else None}" - ) + ad_logger.debug(" kv_cache_pool_pointers: %s", pool_ptrs) + ad_logger.debug( + " kv_cache_pool_mapping.shape: %s", + pool_map.shape if pool_map is not None else None, + ) self.info.set_kv_cache_pool_info(pool_ptrs, pool_map) - print(" Set pool info on SequenceInfo") - print(f" self.info.kv_cache_pool_pointers: {self.info.kv_cache_pool_pointers}") + ad_logger.debug(" Set pool info on SequenceInfo") + ad_logger.debug( + " self.info.kv_cache_pool_pointers: %s", + self.info.kv_cache_pool_pointers, + ) try: from ..custom_ops.trtllm_attention import _trtllm_config - print(f" _trtllm_config.is_configured: {_trtllm_config.is_configured}") + ad_logger.debug( + " _trtllm_config.is_configured: %s", _trtllm_config.is_configured + ) if not _trtllm_config.is_configured: _trtllm_config.configure(self.info) - print(" Configured _trtllm_config with SequenceInfo") - print(f" _trtllm_config._sequence_info: {_trtllm_config._sequence_info}") + ad_logger.debug(" Configured _trtllm_config with SequenceInfo") + ad_logger.debug( + " _trtllm_config._sequence_info: %s", + _trtllm_config._sequence_info, + ) if _trtllm_config._num_layers == 0 and kv_ref is not None: num_kv_heads_list = [h.num_kv_heads for h in kv_managed.values()] _trtllm_config.set_model_config( num_layers=len(kv_managed), num_kv_heads_per_layer=num_kv_heads_list, head_dim=kv_ref.head_dim, dtype=kv_ref.dtype, ) - print( - f" Set model config: num_layers={len(kv_managed)}, " - f"dtype={kv_ref.dtype}, quant_mode={_trtllm_config._quant_mode}" - ) + ad_logger.debug( + " Set model config: num_layers=%s, dtype=%s, quant_mode=%s", + len(kv_managed), + kv_ref.dtype, + _trtllm_config._quant_mode, + ) except ImportError: - print(" TRT-LLM attention import failed") + ad_logger.debug(" TRT-LLM attention import failed") pass
| class CacheConfig(BaseModel): | ||
| """Cache configuration for attention-related dtypes.""" | ||
|
|
||
| model_config = ConfigDict( | ||
| arbitrary_types_allowed=True, | ||
| extra="forbid", | ||
| ) | ||
|
|
||
| dtype: Optional[torch.dtype] = Field(default=None, description="KV cache dtype.") | ||
| mamba_dtype: Optional[torch.dtype] = Field(default=None, description="Mamba cache dtype.") | ||
| delta_dtype: Optional[torch.dtype] = Field( | ||
| default=torch.float32, description="Delta cache dtype. Defaults to float32." | ||
| ) | ||
|
|
||
| @field_validator("dtype", "mamba_dtype", "delta_dtype", mode="before") | ||
| @classmethod | ||
| def _coerce_dtype(cls, value): | ||
| if value is None or isinstance(value, torch.dtype): | ||
| return value | ||
| if isinstance(value, str): | ||
| dtype = getattr(torch, value, None) | ||
| assert isinstance(dtype, torch.dtype), f"Invalid {dtype=}" | ||
| return dtype | ||
| return value | ||
|
|
||
| def __or__(self, other: "CacheConfig") -> "CacheConfig": | ||
| """Combine two CacheConfig objects field-wise using Python's `or` semantics.""" | ||
| if not isinstance(other, CacheConfig): | ||
| raise NotImplementedError(f"Cannot combine CacheConfig with {type(other)}") | ||
| merged_kwargs = {} | ||
| for field_name in type(self).model_fields.keys(): | ||
| merged_kwargs[field_name] = getattr(self, field_name) or getattr(other, field_name) | ||
| return CacheConfig(**merged_kwargs) |
There was a problem hiding this comment.
Avoid assert for user-facing dtype validation.
_coerce_dtype uses assert, which can be stripped in optimized runs and let invalid values through. Prefer an explicit exception.
Suggested fix
- if isinstance(value, str):
- dtype = getattr(torch, value, None)
- assert isinstance(dtype, torch.dtype), f"Invalid {dtype=}"
- return dtype
+ if isinstance(value, str):
+ dtype = getattr(torch, value, None)
+ if not isinstance(dtype, torch.dtype):
+ raise ValueError(f"Invalid dtype string: {value!r}")
+ return dtype🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py` around
lines 287 - 319, The field validator CacheConfig._coerce_dtype should not use
assert for validating string dtype names; instead, when value is a str look up
the attribute on torch (as currently done), and if the lookup yields None or a
non-torch.dtype raise a clear exception (e.g., ValueError) with the invalid
input included; update the validator in CacheConfig (decorated with
`@field_validator`("dtype", "mamba_dtype", "delta_dtype", mode="before")) to
replace the assert with an explicit raise that reports the offending value and
expected torch.dtype so invalid dtype strings are rejected reliably in all
runtime modes.
| # will store num_blocks later... | ||
| self._num_blocks = None | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, check if the file exists and get its basic structure
file_path="tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py"
if [ -f "$file_path" ]; then
echo "=== File size ==="
wc -l "$file_path"
echo ""
echo "=== Lines 430-440 (around line 434) ==="
sed -n '430,440p' "$file_path"
echo ""
echo "=== Lines 610-640 (around lines 617-636) ==="
sed -n '610,640p' "$file_path"
else
echo "File not found"
fiRepository: NVIDIA/TensorRT-LLM
Length of output: 2110
🏁 Script executed:
#!/bin/bash
# Find SequenceInfo class definition and num_blocks property
rg -n "class SequenceInfo" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -A 5
echo "---"
rg -n "def num_blocks" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -B 2 -A 5
echo "---"
rg -n "@property" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -A 3 | grep -A 3 "num_blocks"Repository: NVIDIA/TensorRT-LLM
Length of output: 1047
🏁 Script executed:
#!/bin/bash
# Find all usages of num_blocks and estimate_cache_loc_capacity
echo "=== Usages of num_blocks ==="
rg -n "\.num_blocks" -g'*.py'
echo ""
echo "=== Calls to estimate_cache_loc_capacity ==="
rg -n "estimate_cache_loc_capacity" -g'*.py'Repository: NVIDIA/TensorRT-LLM
Length of output: 9127
🏁 Script executed:
#!/bin/bash
# Check the initialization sequence in interface.py around line 525
echo "=== Interface.py context (line 520-535) ==="
sed -n '520,535p' tensorrt_llm/_torch/auto_deploy/shim/interface.py
echo ""
echo "=== Lines 1165-1180 in attention_interface.py ==="
sed -n '1165,1180p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
echo ""
echo "=== Lines 1215-1225 in trtllm_attention.py ==="
sed -n '1215,1225p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 2011
🏁 Script executed:
#!/bin/bash
echo "=== cache_backend.py context (line 190-200) ==="
sed -n '190,200p' tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py
echo ""
echo "=== pt_cache_backend.py context (line 220-240) ==="
sed -n '220,240p' tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py
echo ""
echo "=== pt_cache_backend.py context (line 330-345) ==="
sed -n '330,345p' tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 2773
🏁 Script executed:
#!/bin/bash
echo "=== Test case (test_resource_handlers.py lines 50-60) ==="
sed -n '50,60p' tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_resource_handlers.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 481
🏁 Script executed:
#!/bin/bash
echo "=== Looking for SequenceInfo instantiation patterns ==="
rg -n "SequenceInfo(" -B 2 -A 5 tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py | head -40
echo ""
echo "=== Check the full __init__ method of SequenceInfo ==="
sed -n '322,450p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 7826
🏁 Script executed:
#!/bin/bash
echo "=== Check trtllm_attention.py to see initialize call context ==="
sed -n '1200,1230p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py
echo ""
echo "=== Check the full initialize method in pt_cache_backend ==="
sed -n '200,230p' tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 2668
🏁 Script executed:
#!/bin/bash
echo "=== Checking the estimate_cache_tokens_per_forward method ==="
sed -n '622,631p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
echo ""
echo "=== Full context of lines 617-645 ==="
sed -n '617,645p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 2142
🏁 Script executed:
#!/bin/bash
echo "=== Check where _setup_pt_cache_backend is called ==="
rg -n "_setup_pt_cache_backend" tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py -B 3 -A 1
echo ""
echo "=== Check the full trtllm_attention initialization flow ==="
sed -n '1185,1225p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py
echo ""
echo "=== Check if there are any early accesses before estimate_cache_loc_capacity ==="
rg -n "\.num_blocks" tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py -B 5Repository: NVIDIA/TensorRT-LLM
Length of output: 2848
🏁 Script executed:
#!/bin/bash
echo "=== Search for where initialize() is called on cache backends ==="
rg -n "\.initialize\(" tensorrt_llm/_torch/auto_deploy/custom_ops/ -B 3 -A 1Repository: NVIDIA/TensorRT-LLM
Length of output: 1146
🏁 Script executed:
#!/bin/bash
echo "=== Check CachedSequenceInterface to see full initialization flow ==="
sed -n '450,550p' tensorrt_llm/_torch/auto_deploy/shim/interface.py
echo ""
echo "=== Look for where set_model_config is called ==="
rg -n "set_model_config" tensorrt_llm/_torch/auto_deploy/ -B 2 -A 2 | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 7632
🏁 Script executed:
#!/bin/bash
echo "=== Check setup_pt_cache_backend context in trtllm_attention ==="
sed -n '1150,1230p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 3421
🏁 Script executed:
#!/bin/bash
echo "=== Check allocate() method that accesses num_blocks ==="
sed -n '1160,1185p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
echo ""
echo "=== Check cache_backend.py initialize() that accesses num_blocks ==="
sed -n '188,210p' tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py
echo ""
echo "=== Check when cache backends' initialize is called in the flow ==="
rg -n "_allocate_unmanaged_resources\|handler.allocate" tensorrt_llm/_torch/auto_deploy/shim/interface.py -B 2 -A 2Repository: NVIDIA/TensorRT-LLM
Length of output: 2131
🏁 Script executed:
#!/bin/bash
echo "=== Search for all SequenceInfo creations ==="
rg -n "SequenceInfo(" --type py -B 1 -A 3 | head -80Repository: NVIDIA/TensorRT-LLM
Length of output: 202
🏁 Script executed:
#!/bin/bash
echo "=== Check _allocate_unmanaged_resources in detail ==="
sed -n '475,515p' tensorrt_llm/_torch/auto_deploy/shim/interface.py
echo ""
echo "=== Check the order of calls in __init__ ==="
rg -n "def __init__" tensorrt_llm/_torch/auto_deploy/shim/interface.py -A 50 | head -70Repository: NVIDIA/TensorRT-LLM
Length of output: 5344
🏁 Script executed:
#!/bin/bash
echo "=== Check KVPagedResourceHandler.allocate method ==="
rg -n "class KVPagedResourceHandler" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -A 100 | grep -A 50 "def allocate"Repository: NVIDIA/TensorRT-LLM
Length of output: 2409
🏁 Script executed:
#!/bin/bash
echo "=== Full _create_kv_cache_manager method ==="
sed -n '403,530p' tensorrt_llm/_torch/auto_deploy/shim/interface.py | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 4804
🏁 Script executed:
#!/bin/bash
echo "=== Where is _allocate_unmanaged_resources called ==="
rg -n "_allocate_unmanaged_resources" tensorrt_llm/_torch/auto_deploy/shim/interface.py -B 5 -A 2
echo ""
echo "=== Where is _create_kv_cache_manager called ==="
rg -n "_create_kv_cache_manager" tensorrt_llm/_torch/auto_deploy/shim/interface.py -B 3 -A 3Repository: NVIDIA/TensorRT-LLM
Length of output: 2906
🏁 Script executed:
#!/bin/bash
echo "=== Check if allocate can be called before estimate_cache_loc_capacity ==="
rg -n "allocate\|estimate_cache_loc_capacity" tensorrt_llm/_torch/auto_deploy/shim/interface.py | head -20Repository: NVIDIA/TensorRT-LLM
Length of output: 137
🏁 Script executed:
#!/bin/bash
echo "=== Check if TRT-LLM flow is independent ==="
sed -n '1150,1225p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py | grep -n "num_blocks\|initialize"Repository: NVIDIA/TensorRT-LLM
Length of output: 759
🏁 Script executed:
#!/bin/bash
echo "=== Check the complete sequence in interface.py from line 640-660 ==="
sed -n '640,660p' tensorrt_llm/_torch/auto_deploy/shim/interface.py
echo ""
echo "=== Look for where estimate_cache_loc_capacity is called ==="
sed -n '520,530p' tensorrt_llm/_torch/auto_deploy/shim/interface.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1760
🏁 Script executed:
#!/bin/bash
echo "=== Check pt_cache_backend.initialize at line 225 ==="
sed -n '220,230p' tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py
echo ""
echo "=== Check if there's any early initialization ==="
rg -n "\.initialize\(" tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py
echo ""
echo "=== Check the test to see the expected order ==="
sed -n '50,75p' tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_resource_handlers.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1556
🏁 Script executed:
#!/bin/bash
echo "=== Check where _setup_pt_cache_backend is called ==="
rg -n "_setup_pt_cache_backend\|setup_pt_cache_backend" -g'*.py'
echo ""
echo "=== Check if there are independent calls ==="
rg -n "\.pt_cache_backend" tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py | head -20Repository: NVIDIA/TensorRT-LLM
Length of output: 714
🏁 Script executed:
#!/bin/bash
echo "=== Check the property that raises the assertion ==="
sed -n '617,621p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
echo ""
echo "=== Check what happens during initialization without estimate_cache_loc_capacity ==="
python3 << 'PYTHON_CODE'
# Simulate the scenario
class SequenceInfo:
def __init__(self, max_seq_len, max_batch_size, tokens_per_block=None, max_num_tokens=None):
self.max_seq_len = max_seq_len
self.max_batch_size = max_batch_size
self.tokens_per_block = tokens_per_block or max_seq_len
self.max_num_tokens = max_num_tokens or (max_seq_len + 1) * max_batch_size
self._num_blocks = None # Currently None
`@property`
def num_blocks(self):
assert self._num_blocks is not None, "num_blocks not set yet"
return self._num_blocks
def estimate_cache_loc_capacity(self, num_blocks):
self._num_blocks = num_blocks
# Test: can we derive _num_blocks in __init__?
import math
max_num_tokens = 256 * 4 # 256 seq_len, 4 batch
tokens_per_block = 256
derived_blocks = math.ceil(max_num_tokens / tokens_per_block)
print(f"max_num_tokens={max_num_tokens}, tokens_per_block={tokens_per_block}")
print(f"Derived _num_blocks would be: {derived_blocks}")
print(f"This is a reasonable default until estimate_cache_loc_capacity() is called")
PYTHON_CODERepository: NVIDIA/TensorRT-LLM
Length of output: 500
🏁 Script executed:
#!/bin/bash
echo "=== Look for test cases that might hit the assertion ==="
find tests -name "*.py" -type f -exec grep -l "num_blocks\|SequenceInfo" {} \; | head -10 | xargs -I {} sh -c 'echo "=== {} ===" && grep -n "num_blocks\|estimate_cache_loc_capacity" {}'Repository: NVIDIA/TensorRT-LLM
Length of output: 3960
🏁 Script executed:
#!/bin/bash
echo "=== Check for potential unguarded accesses ==="
rg -n "sequence_info\.num_blocks|si\.num_blocks" tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py -B 2 -A 1
echo ""
echo "=== Verify the actual initialization order in the TRT-LLM flow ==="
sed -n '1150,1225p' tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py | head -80Repository: NVIDIA/TensorRT-LLM
Length of output: 3908
Initialize _num_blocks with a computed default in __init__ instead of None.
SequenceInfo.num_blocks asserts _num_blocks is set, but it's only assigned in estimate_cache_loc_capacity(). Code paths that create PTCacheBackend and call initialize() directly (e.g., trtllm_attention.py line 1216) will hit this assertion before estimate_cache_loc_capacity() is called.
Initialize _num_blocks in __init__ using a reasonable estimate based on available parameters (e.g., ceil(max_num_tokens / tokens_per_block)), which will later be refined by estimate_cache_loc_capacity().
🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py` around
lines 434 - 436, SequenceInfo currently leaves self._num_blocks as None (set in
__init__), causing an assertion in SequenceInfo.num_blocks when code paths
(e.g., PTCacheBackend -> initialize()) access it before
estimate_cache_loc_capacity() runs; fix by initializing self._num_blocks in
__init__ to a sensible default (for example use math.ceil(max_num_tokens /
tokens_per_block) or another conservative estimate derived from constructor args
like max_num_tokens and tokens_per_block) so num_blocks is valid immediately and
later refined inside estimate_cache_loc_capacity(); update __init__ where
self._num_blocks is set and leave estimate_cache_loc_capacity() to overwrite
with the accurate value.
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 |
There was a problem hiding this comment.
Update copyright year to 2026.
This file is newly added/modified in 2026 but the header still reads 2025.
Suggested fix
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 |
🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py` around lines 1 -
2, Update the copyright header in
tensorrt_llm/_torch/auto_deploy/custom_ops/cache_backend.py by changing the year
from 2025 to 2026 in the SPDX header lines (the two top-of-file lines starting
with "# SPDX-FileCopyrightText" and "# SPDX-License-Identifier") so the header
reflects the latest modification year.
| # CACHES - combined KV cache with shape [num_blocks, 2, num_kv_heads, tokens_per_block, head_dim] | ||
| kv_cache: torch.Tensor, | ||
| # CONSTANTS | ||
| scale: Optional[float], | ||
| k_scale: float, | ||
| v_scale: float, | ||
| ) -> torch.Tensor: | ||
| # reshape to standard [b*s, n_heads, head_dim] layout | ||
| head_dim = k_cache.shape[-1] | ||
| # kv_cache shape: [num_blocks, 2, num_kv_heads, tokens_per_block, head_dim] (HND layout) | ||
| head_dim = kv_cache.shape[-1] | ||
| page_size = kv_cache.shape[3] # tokens_per_block | ||
| q_shape_og = q.shape | ||
| b, s = q_shape_og[:2] | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and read the file in question
find . -name "flashinfer_attention.py" -type f | head -20Repository: NVIDIA/TensorRT-LLM
Length of output: 131
🏁 Script executed:
# Also search for _GlobalFlashInferPlanner to understand how kv_layout is managed
rg "_GlobalFlashInferPlanner" -A 5 -B 5Repository: NVIDIA/TensorRT-LLM
Length of output: 22373
🏁 Script executed:
# Search for kv_layout references in the codebase
rg "kv_layout" -t py | head -30Repository: NVIDIA/TensorRT-LLM
Length of output: 4197
🏁 Script executed:
# Look for HND and NHD references to understand these layouts
rg "HND|NHD" -t py | head -40Repository: NVIDIA/TensorRT-LLM
Length of output: 5937
🏁 Script executed:
# Read the flashinfer_attention.py file around lines 321-333
head -n 340 tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py | tail -n +310Repository: NVIDIA/TensorRT-LLM
Length of output: 1228
🏁 Script executed:
# Look for the _FlashInferPlanner class definition and kv_layout attribute
rg "class _FlashInferPlanner" -A 30Repository: NVIDIA/TensorRT-LLM
Length of output: 3350
🏁 Script executed:
# Check how kv_layout is set in the planner
rg "kv_layout\s*=" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -B 2 -A 2Repository: NVIDIA/TensorRT-LLM
Length of output: 419
🏁 Script executed:
# Look for any existing layout-aware page_size calculations
rg "page_size|tokens_per_block" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -B 2 -A 2Repository: NVIDIA/TensorRT-LLM
Length of output: 1704
🏁 Script executed:
# Look for NHD layout shape in tests
rg -A 10 "NHD layout" tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py | head -40Repository: NVIDIA/TensorRT-LLM
Length of output: 1651
🏁 Script executed:
# Look for how KV cache is created in NHD layout tests
rg "kv_layout.*NHD" tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/ -B 5 -A 15 | head -60Repository: NVIDIA/TensorRT-LLM
Length of output: 7402
🏁 Script executed:
# Check the KVPagedResourceHandler to understand layout shapes
rg "class KVPagedResourceHandler" -A 50 | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 5408
🏁 Script executed:
# Look at the allocate method to see NHD shape
rg "def allocate" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -A 30 | grep -A 30 "KVPagedResourceHandler" | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 1922
🏁 Script executed:
# Search for NHD shape construction
rg "NHD" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -B 5 -A 5Repository: NVIDIA/TensorRT-LLM
Length of output: 1941
🏁 Script executed:
# Check the function signature and if it can receive NHD layout
rg "flashinfer_attention_prepare_metadata" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -B 5 -A 40 | head -60Repository: NVIDIA/TensorRT-LLM
Length of output: 2304
🏁 Script executed:
# Check where the kv_cache comes from and what layout it has
rg "flashinfer_attention_prepare_metadata" tests/ -B 10 -A 10 | head -80Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Look at the full test to see if NHD is ever used
rg "kv_layout.*=" tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py | head -20Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Look for where flashinfer_attention is called and what kv_cache layout is used
rg "@torch.library.custom_op.*flashinfer_attention\"" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -A 2Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Find the actual function that has the page_size issue at lines 321-333
sed -n '315,340p' tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1061
🏁 Script executed:
# Check what layout is always used in this file
rg "_GlobalFlashInferPlanner.kv_layout" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 178
🏁 Script executed:
# Check if kv_layout can ever be NHD for this operation
rg "kv_layout.*=" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py | head -10Repository: NVIDIA/TensorRT-LLM
Length of output: 223
🏁 Script executed:
# Let me check the entire flashinfer_attention_wrapper function definition
rg "@torch.library.custom_op.*flashinfer_attention_wrapper" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -A 50 | head -80Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Check if tests ever use NHD layout with flashinfer_attention_wrapper
rg "flashinfer_attention_wrapper|flashinfer_attention\(" tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py -B 5 -A 5 | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Verify the exact tensor shape for NHD by looking at the test allocation
sed -n '1,100p' tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py | tail -50Repository: NVIDIA/TensorRT-LLM
Length of output: 2021
🏁 Script executed:
# Check what function name contains lines 321-333
sed -n '300,333p' tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py | head -35Repository: NVIDIA/TensorRT-LLM
Length of output: 1152
🏁 Script executed:
# Search for where this function is called to see what layout is passed
rg "flashinfer_attention_wrapper\(" --type py -B 3 -A 3 | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Check the comment at the top of flashinfer_attention.py about layout
head -50 tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1393
🏁 Script executed:
# Check if kv_layout can be changed from default HND
rg "kv_layout\s*=" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -B 2 -A 2Repository: NVIDIA/TensorRT-LLM
Length of output: 419
🏁 Script executed:
# Check if this file is actually used with NHD layout
rg "flashinfer_mha_with_cache" tests/ --type py -B 5 -A 5 | head -40Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Search for where FlashInferAttention is instantiated to see if NHD is ever set
rg "FlashInferAttention\|_GlobalFlashInferPlanner" tests/ -B 3 -A 3 | head -60Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Check if there's a way to set kv_layout to NHD for the planner
rg "kv_layout.*NHD" tensorrt_llm/_torch/ --type pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1237
🏁 Script executed:
# Check if flashinfer_mha_with_cache can actually receive NHD layout
rg "flashinfer_mha_with_cache" tensorrt_llm/ --type py -B 5 -A 5Repository: NVIDIA/TensorRT-LLM
Length of output: 2157
🏁 Script executed:
# Look at how kv_cache shape is checked or asserted
rg "kv_cache.shape" tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py -B 2 -A 2Repository: NVIDIA/TensorRT-LLM
Length of output: 333
🏁 Script executed:
# Verify the exact NHD shape one more time from the allocate function
sed -n '180,240p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 2384
🏁 Script executed:
# Let me get the full NHD allocation to triple-check dimensions
sed -n '228,250p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 899
🏁 Script executed:
# Check if there's any test that uses flashinfer_mha_with_cache with NHD
grep -r "flashinfer_mha_with_cache\|flashinfer_attention_mha_with_cache" tests/ --include="*.py"Repository: NVIDIA/TensorRT-LLM
Length of output: 1470
🏁 Script executed:
# Let me get the complete NHD allocation code
sed -n '210,235p' tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1170
🏁 Script executed:
# Find the allocate method in KVPagedResourceHandler
grep -n "def allocate" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 358
🏁 Script executed:
# Get the allocate method with correct line numbers
rg "def allocate.*sequence_info" tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py -A 40Repository: NVIDIA/TensorRT-LLM
Length of output: 5927
Derive page_size from kv_layout to keep NHD support correct
The page_size calculation at line 331 uses kv_cache.shape[3], which is correct only for HND layout [num_blocks, 2, num_kv_heads, tokens_per_block, head_dim]. For NHD layout [num_blocks, tokens_per_block, 2, num_kv_heads, head_dim], tokens_per_block is at dimension 1, not dimension 3, so the plans will be created with an incorrect page size if NHD is ever used.
- # kv_cache shape: [num_blocks, 2, num_kv_heads, tokens_per_block, head_dim] (HND layout)
+ # kv_cache shape depends on kv_layout (HND or NHD)
head_dim = kv_cache.shape[-1]
- page_size = kv_cache.shape[3] # tokens_per_block
+ if _GlobalFlashInferPlanner.kv_layout == "HND":
+ page_size = kv_cache.shape[3] # tokens_per_block
+ else: # NHD
+ page_size = kv_cache.shape[1] # tokens_per_block🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py` around
lines 321 - 333, The page_size is being taken as kv_cache.shape[3] which assumes
HND layout; update the logic that computes page_size in the function (where
kv_cache, kv_layout and page_size are used) to derive tokens_per_block based on
kv_layout: if kv_layout indicates HND use shape[3], if it indicates NHD use
shape[1]; replace the hardcoded index with this conditional access so plan
creation and downstream uses of page_size are correct for both layouts.
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 |
There was a problem hiding this comment.
Update copyright year to 2026.
Suggested fix
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 |
🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py` around lines
1 - 2, Update the SPDX header year from 2025 to 2026 at the top of the file;
specifically edit the copyright header lines in
tensorrt_llm/_torch/auto_deploy/custom_ops/pt_cache_backend.py (the
SPDX-FileCopyrightText and/or SPDX-License-Identifier header block) to reflect
2026 as the latest meaningful modification year.
| @dataclass | ||
| class TrtllmLayerState: | ||
| """Per-layer state for TRT-LLM attention wrapper.""" | ||
|
|
||
| layer_idx: int | ||
| num_heads: int | ||
| num_kv_heads: int | ||
| head_dim: int | ||
| tokens_per_block: int | ||
| max_num_requests: int | ||
| max_context_length: int | ||
| num_layers: int = 0 # Total number of layers for block offset calculation | ||
|
|
||
| # Pre-allocated tensors for metadata translation | ||
| # Device tensors | ||
| sequence_length: torch.Tensor = field(default=None) | ||
| context_lengths: torch.Tensor = field(default=None) | ||
| kv_cache_block_offsets: torch.Tensor = field(default=None) | ||
|
|
||
| # Host tensors (pinned for async H2D) | ||
| host_past_key_value_lengths: torch.Tensor = field(default=None) | ||
| host_context_lengths: torch.Tensor = field(default=None) | ||
| host_request_types: torch.Tensor = field(default=None) | ||
| host_total_kv_lens: torch.Tensor = field(default=None) | ||
| host_kv_cache_pool_pointers: torch.Tensor = field(default=None) | ||
| host_kv_cache_pool_mapping: torch.Tensor = field(default=None) | ||
|
|
||
| # Interleaved KV cache buffer for kernel (allocated lazily) | ||
| interleaved_kv_cache: torch.Tensor = field(default=None) | ||
|
|
||
| def __post_init__(self): | ||
| """Allocate pre-sized tensors.""" | ||
| if self.sequence_length is None: | ||
| device = "cuda" | ||
|
|
||
| # Device tensors | ||
| self.sequence_length = torch.zeros( | ||
| self.max_num_requests, dtype=torch.int32, device=device | ||
| ) | ||
| self.context_lengths = torch.zeros( | ||
| self.max_num_requests, dtype=torch.int32, device=device | ||
| ) | ||
|
|
||
| # Pre-allocate kv_cache_block_offsets with MAX size for CUDA graph stability | ||
| max_blocks_per_seq = ( | ||
| self.max_context_length + self.tokens_per_block - 1 | ||
| ) // self.tokens_per_block | ||
| self.kv_cache_block_offsets = torch.zeros( | ||
| 1, # num_pools | ||
| self.max_num_requests, | ||
| 2, # K and V | ||
| max_blocks_per_seq, | ||
| dtype=torch.int32, | ||
| device=device, | ||
| ) | ||
|
|
||
| # Host tensors (pinned memory for async transfers) | ||
| self.host_past_key_value_lengths = torch.zeros( | ||
| self.max_num_requests, dtype=torch.int32, device="cpu", pin_memory=True | ||
| ) | ||
| self.host_context_lengths = torch.zeros( | ||
| self.max_num_requests, dtype=torch.int32, device="cpu", pin_memory=True | ||
| ) | ||
| self.host_request_types = torch.zeros( | ||
| self.max_num_requests, dtype=torch.int32, device="cpu", pin_memory=True | ||
| ) | ||
| self.host_total_kv_lens = torch.zeros( | ||
| 2, dtype=torch.int64, device="cpu", pin_memory=True | ||
| ) | ||
| # Pool pointers: [num_pools, 2] where each row is [k_cache_ptr, v_cache_ptr] | ||
| # thop.attention expects 2D tensor: [num_pools, 2] | ||
| self.host_kv_cache_pool_pointers = torch.zeros( | ||
| 1, 2, dtype=torch.int64, device="cpu", pin_memory=True | ||
| ) | ||
| # Pool mapping: 2D [num_layers, 2] format expected by thop.attention | ||
| # pool_mapping[layer, 0] = pool_idx (0 for single pool) | ||
| # pool_mapping[layer, 1] = layer_offset (0 when using per-layer pointers) | ||
| # Use max 256 layers to cover most models | ||
| max_layers = 256 | ||
| self.host_kv_cache_pool_mapping = torch.zeros( | ||
| max_layers, 2, dtype=torch.int32, device="cpu", pin_memory=True | ||
| ) | ||
|
|
||
|
|
There was a problem hiding this comment.
TrtllmLayerState hard-codes device to "cuda".
This will allocate on the default CUDA device even if the model runs on a different GPU (or CPU in tests). Please pass the target device into TrtllmLayerState (e.g., via get_or_create_layer_state using kv_cache.device or SequenceInfo.device) and allocate on that device instead.
🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py` around lines
222 - 305, TrtllmLayerState currently hardcodes device="cuda" in __post_init__,
causing allocations on the wrong GPU; add a device field to the TrtllmLayerState
dataclass (e.g., device: torch.device) and use that field instead of the string
"cuda" when allocating device tensors in __post_init__, keeping host/pinned
tensors on cpu as before; update callers (e.g., get_or_create_layer_state) to
pass the correct device (kv_cache.device or SequenceInfo.device) when
constructing TrtllmLayerState so allocations follow the model/kv cache device.
| def _prepare_trtllm_metadata( | ||
| batch_info_host: torch.Tensor, | ||
| cu_seqlen_host: torch.Tensor, | ||
| cu_num_pages: torch.Tensor, | ||
| cu_num_pages_host: torch.Tensor, | ||
| cache_loc: torch.Tensor, | ||
| last_page_len: torch.Tensor, | ||
| last_page_len_host: torch.Tensor, | ||
| seq_len_with_cache_host: torch.Tensor, | ||
| state: TrtllmLayerState, | ||
| kv_cache: torch.Tensor, | ||
| ad_pool_pointers: Optional[torch.Tensor] = None, | ||
| ad_pool_mapping: Optional[torch.Tensor] = None, | ||
| ) -> Tuple[torch.Tensor, ...]: | ||
| """Prepare TRT-LLM metadata from AD metadata. | ||
|
|
||
| For CUDA graph support (like pt_cache_backend): | ||
| - During capture: Set host tensors to MAX, skip device operations | ||
| - Outside capture: Normal operation | ||
|
|
||
| Args: | ||
| batch_info_host: [num_prefill, num_prefill_tokens, num_decode] | ||
| cu_seqlen_host: Cumulative sequence lengths [num_seq + 1] | ||
| cu_num_pages: Cumulative page counts [num_seq + 1] | ||
| cu_num_pages_host: Same as cu_num_pages but on host | ||
| cache_loc: Flat page indices for all sequences | ||
| last_page_len: Tokens in last page per sequence | ||
| last_page_len_host: Same on host | ||
| seq_len_with_cache_host: Total seq length including cached tokens | ||
| state: Per-layer TRT-LLM state | ||
| kv_cache: Unified KV cache tensor [num_blocks, kv_factor=2, num_kv_heads, tokens_per_block, head_dim] | ||
| ad_pool_pointers: Optional AD pool pointers from KVCacheManager (shape: [num_pools, 2]) | ||
| ad_pool_mapping: Optional AD pool mapping from KVCacheManager (shape: [num_layers, 2]) | ||
|
|
||
| Returns: | ||
| Tuple of tensors needed by thop.attention | ||
| """ | ||
| num_prefill, num_prefill_tokens, num_decode = batch_info_host.tolist() | ||
| num_seq = num_prefill + num_decode | ||
|
|
||
| # Check if in CUDA graph capture mode | ||
| is_capturing = torch.cuda.is_current_stream_capturing() | ||
|
|
||
| # Compute input sequence lengths from cumulative sums | ||
| input_seq_lens = (cu_seqlen_host[1 : num_seq + 1] - cu_seqlen_host[:num_seq]).int() | ||
| seq_len_with_cache = seq_len_with_cache_host[:num_seq].int() | ||
| past_kv_lens = seq_len_with_cache - input_seq_lens.cpu() | ||
|
|
||
| # CUDA GRAPH FIX: Set host tensors to MAX during capture (like pt_cache_backend) | ||
| if is_capturing: | ||
| max_seq = state.max_context_length | ||
| state.host_past_key_value_lengths[:num_seq].fill_(max_seq) | ||
| state.host_context_lengths[:num_seq].fill_(max_seq) | ||
| state.host_request_types[:num_seq].fill_(1) | ||
| state.host_total_kv_lens[0] = 0 | ||
| state.host_total_kv_lens[1] = max_seq * num_seq | ||
| else: | ||
| # Normal operation: fill host tensors | ||
| state.host_past_key_value_lengths[:num_seq].copy_(past_kv_lens) | ||
| state.host_context_lengths[:num_seq].copy_(input_seq_lens.cpu()) | ||
| state.host_request_types[:num_prefill].fill_(0) | ||
| state.host_request_types[num_prefill:num_seq].fill_(1) | ||
| context_total_kv = seq_len_with_cache[:num_prefill].sum().item() if num_prefill > 0 else 0 | ||
| gen_total_kv = seq_len_with_cache[num_prefill:num_seq].sum().item() if num_decode > 0 else 0 | ||
| state.host_total_kv_lens[0] = context_total_kv | ||
| state.host_total_kv_lens[1] = gen_total_kv | ||
|
|
||
| # Device operations - skip during capture (like pt_cache_backend's skip_device_ops) | ||
| if not is_capturing: | ||
| # Sync before copy to catch any previous async errors | ||
| torch.cuda.synchronize() | ||
|
|
||
| # Copy to pre-allocated tensors | ||
| state.sequence_length[:num_seq].copy_(seq_len_with_cache.cuda()) | ||
| state.context_lengths[:num_seq].copy_(input_seq_lens.cuda()) | ||
|
|
||
| # Validate kv_cache shape (safe during capture - no device ops) | ||
| if len(kv_cache.shape) != 5 or kv_cache.shape[1] != 2: | ||
| raise RuntimeError( | ||
| f"Expected kv_cache shape [pages, 2, heads, tokens, dim], got {kv_cache.shape}" | ||
| ) | ||
|
|
||
| num_layers = state.num_layers if state.num_layers > 0 else 32 | ||
|
|
||
| # Pool pointer and block offset setup - skip during capture (contains .item() calls) | ||
| if not is_capturing: | ||
| # Set up KV cache pool pointers | ||
| use_ad_pool = ( | ||
| ad_pool_pointers is not None | ||
| and ad_pool_mapping is not None | ||
| and ad_pool_pointers.numel() > 0 | ||
| and ad_pool_pointers[0, 0].item() != 0 | ||
| ) | ||
|
|
||
| if not use_ad_pool: | ||
| raise RuntimeError( | ||
| f"AD pool not available. ad_pool_pointers={ad_pool_pointers}, " | ||
| f"ad_pool_mapping={ad_pool_mapping}" | ||
| ) | ||
|
|
||
| # Use AD's pool pointers directly | ||
| state.host_kv_cache_pool_pointers[0, 0] = ad_pool_pointers[0, 0].item() | ||
| state.host_kv_cache_pool_pointers[0, 1] = 0 | ||
|
|
||
| # Use AD's pool mapping directly | ||
| for layer_i in range(min(num_layers, ad_pool_mapping.shape[0])): | ||
| state.host_kv_cache_pool_mapping[layer_i, 0] = ad_pool_mapping[layer_i, 0].item() | ||
| state.host_kv_cache_pool_mapping[layer_i, 1] = ad_pool_mapping[layer_i, 1].item() | ||
|
|
||
| # Log pool setup for debugging (only once) | ||
| if state.layer_idx == 0 and not hasattr(state, "_pool_logged"): | ||
| state._pool_logged = True | ||
| ad_logger.debug( | ||
| f"[TRT-LLM Attention] Using AD pool directly: " | ||
| f"pool_ptr={state.host_kv_cache_pool_pointers[0, 0]}" | ||
| ) | ||
|
|
||
| # Block offsets: convert flat cache_loc to per-sequence block indices | ||
| pages_per_seq = (cu_num_pages_host[1 : num_seq + 1] - cu_num_pages_host[:num_seq]).int() | ||
| max_blocks = pages_per_seq.max().item() if num_seq > 0 else 1 | ||
| _global_state.set_max_blocks_per_seq(max_blocks) | ||
|
|
||
| # kv_cache_block_offsets is pre-allocated in __post_init__, don't reallocate | ||
|
|
||
| # Fill block offsets | ||
| kv_factor = 2 | ||
| multiplier = num_layers * kv_factor | ||
| state.kv_cache_block_offsets.zero_() | ||
| offset = 0 | ||
| for i in range(num_seq): | ||
| n_pages = pages_per_seq[i].item() | ||
| if n_pages > 0: | ||
| base_offsets = cache_loc[offset : offset + n_pages] * multiplier | ||
| state.kv_cache_block_offsets[0, i, 0, :n_pages] = base_offsets | ||
| state.kv_cache_block_offsets[0, i, 1, :n_pages] = base_offsets + 1 | ||
| offset += n_pages | ||
|
|
||
| # Return tensors | ||
| # Use pre-allocated tensor size for block offsets (CUDA graph compatibility) | ||
| max_blocks_per_seq = state.kv_cache_block_offsets.shape[3] | ||
|
|
||
| return ( | ||
| state.sequence_length[:num_seq], | ||
| state.host_past_key_value_lengths[:num_seq], | ||
| state.host_total_kv_lens, | ||
| state.context_lengths[:num_seq], | ||
| state.host_context_lengths[:num_seq], | ||
| state.host_request_types[:num_seq], | ||
| state.kv_cache_block_offsets[:, :num_seq, :, :max_blocks_per_seq], | ||
| state.host_kv_cache_pool_pointers, | ||
| state.host_kv_cache_pool_mapping, | ||
| ) |
There was a problem hiding this comment.
Fallback path still requires AD pool pointers.
_prepare_trtllm_metadata raises if ad_pool_pointers/ad_pool_mapping are missing, but the KV cache handler includes a fallback allocation path. If that fallback path is ever used (e.g., no KVCacheManager), this will crash at runtime. Consider failing fast earlier (during allocation) with a clearer error, or implement a non-pool-pointer metadata path for the fallback cache.
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 478-478: Unused function argument: cu_num_pages
(ARG001)
[warning] 481-481: Unused function argument: last_page_len
(ARG001)
[warning] 482-482: Unused function argument: last_page_len_host
(ARG001)
[warning] 512-512: Unpacked variable num_prefill_tokens is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
[warning] 553-555: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 570-573: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_attention.py` around lines
475 - 626, The function _prepare_trtllm_metadata currently raises if
ad_pool_pointers/ad_pool_mapping are missing which breaks the KV cache handler's
fallback allocation path; change the logic so when use_ad_pool is False you do
not raise but instead initialize predictable defaults for pool pointers/mapping
(e.g., zeros or sentinel values) and proceed to compute kv_cache_block_offsets
from cache_loc/pages_per_seq for the fallback cache layout (ensure
state.host_kv_cache_pool_pointers and state.host_kv_cache_pool_mapping are set
to valid defaults and any logging reflects fallback use), or alternatively move
the RuntimeError to an earlier allocation phase so missing AD pool pointers fail
fast during allocation rather than here (update checks around use_ad_pool,
state.host_kv_cache_pool_pointers, state.host_kv_cache_pool_mapping, and the
block-offset filling loop accordingly).
| @field_validator("max_seq_len", mode="before") | ||
| @classmethod | ||
| def ensure_max_seq_len(cls, value: Any, info: ValidationInfo) -> Any: | ||
| if value is None: | ||
| # Fallback to the AutoDeployConfig default when not provided | ||
| return AutoDeployConfig.model_fields["max_seq_len"].get_default( | ||
| call_default_factory=True | ||
| ) | ||
| return value | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and read the specific section
head -n 470 tensorrt_llm/_torch/auto_deploy/llm_args.py | tail -n 50Repository: NVIDIA/TensorRT-LLM
Length of output: 1857
🏁 Script executed:
# Also check the overall file structure to understand context better
wc -l tensorrt_llm/_torch/auto_deploy/llm_args.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 110
🏁 Script executed:
# Search for any usage of the `info` variable in this method
sed -n '458,467p' tensorrt_llm/_torch/auto_deploy/llm_args.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 458
🏁 Script executed:
# Find the ensure_no_build_config method to see if it uses info
sed -n '470,490p' tensorrt_llm/_torch/auto_deploy/llm_args.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 884
🏁 Script executed:
# Search for all field_validator methods and check if any actually use the info parameter
rg -A 10 '@field_validator' tensorrt_llm/_torch/auto_deploy/llm_args.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1981
🏁 Script executed:
# Find the _check_for_default_value_only function
rg -B 2 -A 8 'def _check_for_default_value_only' tensorrt_llm/_torch/auto_deploy/llm_args.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 431
🏁 Script executed:
# Check for any ruff or linting configuration
find . -maxdepth 2 -name 'pyproject.toml' -o -name '.ruff.toml' -o -name 'ruff.toml' | head -5Repository: NVIDIA/TensorRT-LLM
Length of output: 126
🏁 Script executed:
# Check the ruff configuration to see if it flags unused parameters
cat ./ruff.tomlRepository: NVIDIA/TensorRT-LLM
Length of output: 969
Rename unused info parameter to _info
The info parameter is unused in this validator method. Ruff's ARG rule (flake8-unused-arguments) is enabled in the project configuration and would flag this. Renaming to _info follows Python convention for intentionally unused parameters and silences the warning.
- def ensure_max_seq_len(cls, value: Any, info: ValidationInfo) -> Any:
+ def ensure_max_seq_len(cls, value: Any, _info: ValidationInfo) -> Any:🧰 Tools
🪛 Ruff (0.14.14)
[warning] 460-460: Unused class method argument: info
(ARG003)
🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/llm_args.py` around lines 458 - 467, The
validator ensure_max_seq_len currently declares an unused parameter named info
which triggers the ARG lint rule; rename that parameter to _info in the method
signature of ensure_max_seq_len (the `@field_validator`("max_seq_len",
mode="before") classmethod) so it becomes unused-by-convention and the linter
warning is silenced, leaving the body unchanged and preserving the return
behavior that falls back to
AutoDeployConfig.model_fields["max_seq_len"].get_default(call_default_factory=True)
when value is None.
| regenerated = 0 | ||
| # Only regenerate k_cache and v_cache (KV caches that are views) | ||
| for name in list(self._caches.keys()): | ||
| if "k_cache" in name or "v_cache" in name: | ||
| if name in self._cache_initializers: | ||
| old_ptr = self._caches[name].data_ptr() | ||
| # Re-invoke initializer to get new view | ||
| self._caches[name] = self._cache_initializers[name](self.info) | ||
| new_ptr = self._caches[name].data_ptr() | ||
| regenerated += 1 | ||
| if regenerated <= 2: # Only log first 2 | ||
| ad_logger.info( | ||
| f"[CachedSequenceInterface] Regenerated {name}: " | ||
| f"old_ptr=0x{old_ptr:x}, new_ptr=0x{new_ptr:x}, " | ||
| f"shape={self._caches[name].shape}" | ||
| ) | ||
|
|
||
| ad_logger.info(f"[CachedSequenceInterface] Regenerated {regenerated} cache views") |
There was a problem hiding this comment.
Update cache-view regeneration for combined kv_cache naming
The regeneration logic only looks for "k_cache"/"v_cache". With the new "kv_cache_*" resources, views won’t be refreshed after a PTCacheBackend pool reallocation, leaving stale pointers.
✅ Suggested fix
- # Only regenerate k_cache and v_cache (KV caches that are views)
+ # Regenerate combined kv_cache (and legacy k/v cache views) after pool reallocation
for name in list(self._caches.keys()):
- if "k_cache" in name or "v_cache" in name:
+ if "kv_cache" in name or "k_cache" in name or "v_cache" in name:🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/auto_deploy/shim/interface.py` around lines 785 - 802,
The cache-refresh loop in CachedSequenceInterface only checks for "k_cache" or
"v_cache" and therefore misses new combined keys like "kv_cache_*"; update the
condition in the for-loop that iterates self._caches.keys() to also detect
combined kv names (e.g., check for "kv_cache" or name.startswith("kv_cache_") or
a substring match for "kv_cache") so that entries created from
self._cache_initializers are re-invoked and replaced (preserving the existing
logic that calls self._cache_initializers[name](self.info), compares data_ptrs,
increments regenerated, and logs via ad_logger.info).
| @pytest.mark.skipif(not HAS_PT_CACHE_BACKEND, reason="PTCacheBackend not available") | ||
| @pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available") | ||
| @pytest.mark.skipif( | ||
| torch.cuda.get_device_capability()[0] < 8, reason="CUDA graphs require SM 8.0+" | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and inspect the test file
fd "test_trtllm_attention_cuda_graph.py" -type fRepository: NVIDIA/TensorRT-LLM
Length of output: 233
🏁 Script executed:
# Check if the file exists and view the relevant lines
if [ -f "tests/test_trtllm_attention_cuda_graph.py" ]; then
echo "=== File found, checking lines 240-250 ==="
sed -n '240,250p' tests/test_trtllm_attention_cuda_graph.py
else
echo "File not found at tests/test_trtllm_attention_cuda_graph.py"
# Try to find it
find . -name "*test_trtllm_attention_cuda_graph*" -type f
fiRepository: NVIDIA/TensorRT-LLM
Length of output: 672
🏁 Script executed:
# Search for similar CUDA capability check patterns in the codebase
rg "get_device_capability" -B 3 -A 1 --type py | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 4483
🌐 Web query:
torch.cuda.get_device_capability() CPU-only builds behavior exception
💡 Result:
In CPU-only PyTorch builds (i.e., PyTorch not built with CUDA support), calling torch.cuda.get_device_capability() will typically raise an error like:
AssertionError: Torch not compiled with CUDA enabled
because torch.cuda is lazily initialized and any call that needs CUDA triggers initialization, which fails when CUDA support isn’t compiled in. The docs explicitly note the lazy-init model and that you should use torch.cuda.is_available() to check usability. [1] The get_device_capability API itself is only meaningful when a CUDA device exists. [2]
Robust pattern (works on CPU-only + CUDA builds):
import torch
if torch.backends.cuda.is_built() and torch.cuda.is_available():
cc = torch.cuda.get_device_capability()
else:
cc = None # CPU-only build, or no working CUDA device/drivertorch.backends.cuda.is_built() tells you whether the installed PyTorch binary was built with CUDA support. [3]
If you need “some capability” on CPU-only machines: there is no CUDA compute capability to return—treat it as None (or raise your own clear error) rather than calling torch.cuda.get_device_capability().
Sources: PyTorch CUDA lazy initialization and is_available() guidance [1], torch.cuda.get_device_capability docs [2], torch.backends.cuda.is_built() docs [3], example of the CPU-only build assertion coming from CUDA lazy init [4].
References
[1] https://docs.pytorch.org/docs/stable/cuda.html
[2] https://docs.pytorch.org/docs/2.9/generated/torch.cuda.get_device_capability.html
[3] https://docs.pytorch.org/docs/stable/backends.html
[4] https://bugs.launchpad.net/bugs/2051023
Guard CUDA capability check to prevent import-time failures on CPU-only builds
In @pytest.mark.skipif decorators, each condition expression is evaluated independently at import time. Even though line 245 checks torch.cuda.is_available(), the subsequent decorator on lines 246–248 still evaluates torch.cuda.get_device_capability(), which raises AssertionError on CPU-only PyTorch builds.
Combine the conditions into a single decorator with short-circuit evaluation to prevent the call when CUDA is unavailable:
Suggested fix
`@pytest.mark.skipif`(not HAS_PT_CACHE_BACKEND, reason="PTCacheBackend not available")
-@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")
-@pytest.mark.skipif(
- torch.cuda.get_device_capability()[0] < 8, reason="CUDA graphs require SM 8.0+"
+@pytest.mark.skipif(
+ (not torch.cuda.is_available())
+ or (torch.cuda.get_device_capability()[0] < 8),
+ reason="CUDA not available or SM 8.0+ required for CUDA graphs",
)🤖 Prompt for AI Agents
In `@tests/test_trtllm_attention_cuda_graph.py` around lines 244 - 248, The three
separate pytest skipif decorators cause torch.cuda.get_device_capability() to be
called at import time even on CPU-only builds; update the decorators so the CUDA
availability and device capability checks are combined into a single skipif
using short-circuit logic (e.g. combine torch.cuda.is_available() and
torch.cuda.get_device_capability()[0] < 8 into one condition) while keeping the
HAS_PT_CACHE_BACKEND check as its own decorator; ensure the combined decorator
uses a clear reason like "CUDA graphs require SM 8.0+ or CUDA not available" so
get_device_capability() is only invoked when CUDA is available.
Improve performance of thop.attention CUDA graph support by: - Remove torch.cuda.synchronize() that was blocking CPU - Replace Python loop with .item() calls with vectorized GPU operations using torch.searchsorted and advanced indexing for block offsets - Compute block offsets once on first layer, copy to remaining 31 layers - Move GPU tensor creation outside the layer loop - Use non_blocking=True for device copies This improves throughput from ~1500 TPS to ~5800 TPS with CUDA graphs. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…utation Add pre-allocated GPU buffers to TrtllmAttentionGlobalState for vectorized block offset computation, matching PTCacheBackend's pattern: - _gpu_cu_pages, _gpu_page_positions, _gpu_seq_idx, _gpu_page_idx, _gpu_base_offset buffers allocated once and reused - Use torch.searchsorted/sub/mul with out= parameter to avoid per-call tensor allocations This eliminates allocation overhead in the hot path. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Since KVCacheManager uses a single unified pool for all layers, most metadata tensors are identical across layers. This change shares them: - Add shared tensors to TrtllmAttentionGlobalState (sequence_length, context_lengths, kv_cache_block_offsets, host tensors) - TrtllmLayerState now references shared tensors via init_from_shared() - Only host_kv_cache_pool_mapping remains per-layer (layer offsets) - host_prepare_fn updates shared tensors ONCE instead of 32x This eliminates 32x redundant tensor updates per forward pass, improving throughput from ~5840 to ~6233 TPS (closer to PTCacheBackend). Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Optimize thop.attention metadata preparation to outperform PTCacheBackend: - Add FAST PATH in _prepare_trtllm_metadata: after host_prepare_fn runs, each layer's call just returns pre-computed tensors (almost zero work) - Track host_prepare_called flag to enable fast path during replay - Cache current_num_seq to avoid parsing batch_info during fast path - Move pool pointer initialization to be done once in host_prepare_fn Performance results: - Optimized non-PTCacheBackend: ~6600 TPS - PTCacheBackend: ~6528 TPS - Improvement: ~1.1% faster than PTCacheBackend The key insight is that during CUDA graph replay, host_prepare_fn runs once per forward pass and fills all shared tensors. The 32 per-layer _prepare_trtllm_metadata calls should do almost nothing - just return pre-computed tensor slices. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Optimize host_prepare_fn to reduce tensor allocation overhead: - Add pre-allocated pinned CPU buffers for intermediate computations: - _cpu_input_seq_lens, _cpu_seq_len_with_cache, _cpu_past_kv_lens - _cpu_cu_num_pages, _cpu_pages_per_seq - Use torch.sub/copy with out= parameters to avoid tensor allocation - Replace .item() with int() for faster scalar extraction - Only zero the slice of block_offsets we need ([:, :num_seq, :, :]) Performance results: - Optimized non-PTCacheBackend: ~6645 TPS - PTCacheBackend: ~6527 TPS - Improvement: ~1.8% faster than PTCacheBackend Note: The remaining ~6.5ms in ad_prepare_inputs is dominated by framework code in ad_executor.py (Python list operations for request processing), which is common to both backends. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Clean up code for review by removing the PTCacheBackend alternative: - Remove PTCacheBackend imports and _HAS_PT_CACHE_BACKEND flag - Remove use_pt_cache_backend config option and related code - Remove enable_pt_cache_backend/get_pt_cache_backend/is_pt_cache_backend_enabled - Remove debug SDPA fallback code - Remove debug logging statements - Simplify TrtllmAttentionConfig class - Clean up related code in kvcache.py and interface.py The direct AD pool integration (KVCacheManager) is now the only code path, which is optimized with: - Pre-allocated CPU/GPU buffers - Shared tensors across layers - Vectorized GPU block offset computation - Host prepare function for CUDA graph support Performance: ~6650 TPS (1.8% faster than PTCacheBackend baseline) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Summary by CodeRabbit
New Features
Bug Fixes & Improvements
Documentation
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.