feat(annotation): Torch-TensorRT annotation layer — custom_plugin (QDP/Triton/CuTile/CuTeDSL) by BowenFu · Pull Request #4147 · pytorch/TensorRT

BowenFu · 2026-03-30T09:51:17Z

Motivation

Torch-TRT compiles PyTorch models to TensorRT engines, but today there is no first-class path for users who want to replace a subgraph with their own Triton, CuTile, or CuTeDSL kernel inside the compiled engine. The typical workaround — writing a C++ TRT plugin and registering it manually — requires leaving the Python ecosystem, managing separate build systems, and wiring up the plugin registry by hand. This is a significant barrier for researchers and ML engineers who already have high-performance Python kernels.

TensorRT 10.x introduced Quick Deployable Plugins (QDP), which support AOT-compiled Python kernels (@trtp.aot_impl) that are embedded directly into the TRT engine with no Python required at runtime. This PR adds the descriptor and registration layer that lets users express a custom QDP plugin as a plain Python object and pass it to Torch-TRT — with no changes to any core compiler files.

What's in this PR

Public API (import torch_tensorrt.annotation as tta):

Factory	Returns	Description
`tta.triton(launch_fn, configs)`	`TritonSpec`	Triton kernel descriptor
`tta.cutile(launch_fn, arch, configs)`	`CuTileSpec`	CuTile kernel descriptor (Blackwell sm_100+)
`tta.cutedsl(launch_fn, configs)`	`CuTeDSLSpec`	CuTeDSL kernel descriptor
`tta.custom_plugin(impl)`	`CustomPluginSpec`	AOT QDP plugin descriptor wrapping one or more kernel specs

QDP registration (_custom_plugin/):

_descriptor.py — CustomPluginSpec dataclass + custom_plugin() factory; computes a deterministic op name from the kernel function identity + config hash; register_custom_plugin() registers @trtp.register / @trtp.autotune / @trtp.aot_impl with TRT's process-global QDP registry using double-checked locking for xdist safety.
_lowering.py — lowers a CustomPluginSpec to a TRT plugin layer via ctx.net.add_plugin(trtp.op.<ns>.<name>(*inputs), aot=True); injects weight tensors as add_constant layers.
_qdp_utils.py — deterministic op-name derivation, tactic table building, meta-tensor helpers for symbolic shape inference.
_symbolic.py — SymbolicTensor abstraction for QDP shape/dtype descriptor registration.

AOT backends (_custom_plugin/_aot/):

_triton.py — Triton → PTX via triton.compile; per-config tactic entries.
_cutile.py — CuTile → cubin via tileiras; sm_100+ only.
_cutedsl.py — CuTeDSL → PTX/cubin via nvidia-cutlass-dsl.

Supporting modules:

_specs.py — TritonSpec, CuTileSpec, CuTeDSLSpec, KernelImplSpec frozen dataclasses; triton() / cutile() / cutedsl() factories; normalize_impl_to_spec().
_layer_metadata.py — set_tta_layer_metadata() helper for stamping TRT layer metadata; encode/decode round-trip for custom plugin attribution.
_recorders.py — launch-parameter recording for Triton/CuTile/CuTeDSL AOT backends.
_validation.py — spec and descriptor validation utilities.
_errors.py — TTADiagnosticError structured error type.

Tests (tests/py/annotation/unit/, CPU-only, 46 tests):

test_specs.py — kernel spec construction, validation, cache-key stability.
test_specs_custom_plugin.py — CustomPluginSpec and custom_plugin() factory.
test_layer_metadata.py — metadata encode/decode round-trip.

Design notes

Descriptor-only, zero core impact. CustomPluginSpec is a plain frozen dataclass. No hooks into _compiler.py, _TRTInterpreter.py, or any other existing file. The integration point (passing a descriptor to a converter) is left for a follow-up PR.
Deterministic op naming. The QDP op_name is derived from a hash of the kernel function identity, config set, and weight count. The same descriptor created in two different processes produces the same name, making engine caching safe.
Process-global registration with xdist safety. TRT's QDP registry is process-global. A double-checked locking pattern over a process-level set prevents duplicate registration when pytest-xdist workers share a process.
Multi-tactic autotuning. Multiple configs dicts produce multiple QDP tactics; TRT's autotuner benchmarks all of them at engine-build time.

trt_plugins.custom_op integration — a torch_tensorrt.dynamo.conversion.plugins API that wires a CustomPluginSpec directly to a registered torch.library custom op, so the TRT lowering path is set up with no manual converter code:

import torch_tensorrt.annotation as tta
from torch_tensorrt.dynamo.conversion import plugins as trt_plugins

def my_op_meta(x, y):
    return x.new_empty(x.shape)

def launch_triton(x, y, out, *, BLOCK: int):
    triton_kernel[...](x, y, out, BLOCK=BLOCK)

def launch_cutile(x, y, out, *, TILE: int):
    cutile_kernel.run(x, y, out, TILE=TILE)

triton_spec = tta.triton(launch_fn=launch_triton, configs=[{"BLOCK": 128}, {"BLOCK": 256}])
cutile_spec = tta.cutile(launch_fn=launch_cutile, configs=[{"TILE": 64}, {"TILE": 128}])

MY_IMPL = tta.custom_plugin(
    kernel=[triton_spec, cutile_spec],
    meta_impl=my_op_meta,
)

trt_plugins.custom_op(
    "torchtrt_ex::my_custom_op",
    impl=MY_IMPL,
)

TRT's autotuner benchmarks all tactics across both backends and selects the fastest for the target GPU.

Future work

This PR establishes the descriptor and registration layer. The follow-up work:

Converter integration — wire CustomPluginSpec into the _TRTInterpreter converter dispatch so that annotated subgraphs are lowered to the registered QDP op during torch_tensorrt.compile.
Region annotation API — a tta.lower_as(impl=..., name=...) context manager that tags subgraph regions during torch.export for targeted lowering to custom plugins. The intended end-to-end usage looks like:

import torch
import torch_tensorrt
import torch_tensorrt.annotation as tta

@triton.jit
def _fused_add_relu(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr):
    i = tl.program_id(0) * BLOCK + tl.arange(0, BLOCK)
    mask = i < n
    tl.store(out_ptr + i, tl.maximum(0, tl.load(x_ptr+i, mask=mask) + tl.load(y_ptr+i, mask=mask)), mask=mask)

def launch_fused_add_relu(x, y, out, stream, BLOCK=256):
    _fused_add_relu[(triton.cdiv(x.numel(), BLOCK),)](x, y, out, x.numel(), BLOCK=BLOCK)

fused_add_relu = tta.custom_plugin(tta.triton(launch_fused_add_relu, configs=[{"BLOCK": 128}, {"BLOCK": 256}]))

class ResidualBlock(nn.Module):
    def forward(self, x, y):
        with tta.lower_as(impl=fused_add_relu):
            return torch.relu(x + y)

model = ResidualBlock().cuda().eval()
x = torch.randn(4, 1024, device="cuda")
trt_model = torch_tensorrt.compile(model, inputs=[x, x])

End-to-end tests — GPU tests exercising the full compile → run path for Triton, CuTile, and CuTeDSL backends on representative kernels (fused add+ReLU, RMSNorm, attention).

Test plan

docker exec torch_tensorrt_dev \
  python -m pytest tests/py/annotation/unit/ -n 4 --tb=short -v
# 46 passed

meta-cla · 2026-03-30T09:51:23Z

Hi @BowenFu!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

narendasan

@BowenFu can we split this out in to a PR stack? lets put tta.custom_plugin at the bottom. What I want to focus on is lets say a user already has implemented a custom operator in PyTorch backed by one of these kernels. We want to enable the AOT QDP launch of that kernel without a bunch of boilerplate: Basically this example https://docs.pytorch.org/TensorRT/tutorials/_rendered_examples/dynamo/auto_generate_plugins.html with AOT QDP. There is already facilities for the converter generation and plugin registration from PyTorch Meta Kernel. Once we have that we have a solid base to look at region labeling / manual fusion and other advance usecases.

narendasan · 2026-03-30T17:14:44Z

py/torch_tensorrt/annotation/_custom_plugin/_lowering.py

+# ---------------------------------------------------------------------------
+
+
+def lower_custom_plugin(


Can we merge this stuff with already existing plugin autogeneration in torch_tensorrt.dynamo.conversion.plugin? Like we already automate converter generation key'ed on operator name

I just cleaned up all tta.lower_as/tta.export_as related codes from this PR.

narendasan · 2026-03-30T17:15:08Z

py/torch_tensorrt/annotation/_custom_plugin/_lowering.py

+    return lower_custom_plugin_descriptor(ctx, descriptor, trt_inputs, name)
+
+
+def register_custom_plugin_qdp(


Same here, these facilities already exist, they should be extended not duplicated

narendasan · 2026-03-30T20:55:45Z

Also @BowenFu please follow the instructions I sent for how to get added to the CLA

BowenFu · 2026-03-31T02:36:41Z

@BowenFu can we split this out in to a PR stack? lets put tta.custom_plugin at the bottom. What I want to focus on is lets say a user already has implemented a custom operator in PyTorch backed by one of these kernels. We want to enable the AOT QDP launch of that kernel without a bunch of boilerplate: Basically this example https://docs.pytorch.org/TensorRT/tutorials/_rendered_examples/dynamo/auto_generate_plugins.html with AOT QDP. There is already facilities for the converter generation and plugin registration from PyTorch Meta Kernel. Once we have that we have a solid base to look at region labeling / manual fusion and other advance usecases.

Sure. Will include only tta.custom_plugin in this PR.

…OT integration Adds the torch_tensorrt.annotation (tta) module for authoring TensorRT PluginV3 (QDP) backends using Triton, CuTile, and CuTeDSL kernels. Key features ------------ - tta.custom_plugin(kernel_spec, meta_impl, **attrs): factory that records a deterministic AOT fingerprint and stores kernel + attrs. meta_impl is required (ValueError if None). - tta.triton / tta.cutile / tta.cutedsl: per-backend KernelSpec helpers. - trt_plugins.custom_op(impl=tta.custom_plugin(...)): full Dynamo converter + QDP registration in one call. - Dynamic-shape support via ShapeExpr / SymInt32 in all three backends. - Multi-config tactic selection: TRT benchmarks per-config PTX at engine-build time and picks the fastest tactic automatically. - Multi-output plugins: meta_impl returning tuple -> Tuple[TensorDesc,...] Bug fixes --------- - register_dynamo_plugin: include weight count in QDP num_inputs so trt.add_constant weight tensors are counted in the plugin schema. - _build_desc_fn: pass num_dynamic = num_inputs - num_weights to _build_meta_impl_desc_fn so meta_impl only receives activation descs. Tests ----- - 55 E2E tests covering: Triton/CuTile/CuTeDSL backends, dynamic shapes, multi-output, attrs, tensor weights, bf16/fp16, 3D inputs, production scale, cross-backend same engine. - 41 unit tests.

…on belongs to tta-full

- _symbolic.py: guard _ShapeDim.__int__ for dynamic dims; wrap _strides_from_td in try/except AttributeError; fix scalar numel to SymInt32(1) (not 0); fix shape_dim() to use self._shape[dim]; add negative-dim normalisation to stride(); add cdiv divisor > 0 guard - _qdp_utils.py: extend make_qdp_symbol hash from 8 to 16 hex chars (reduces birthday-collision risk); enrich analyze_launch_args error messages with position, role, and total counts - _cutile.py: scope PATH mutation with try/finally restore; cleanup CUBIN tempfile in finally block; scope PTX name replace to .entry directive via re.sub; add 500 KB scan-limit debug log; pass param_binding_indices to _launch_params_from_trt - _layer_metadata.py: add _validate_attr_key() for encoding-time safety; call it in _format_attrs; extract torch_op via raw.find() (handles spaces in path); return None for empty torch_op; add debug log for unexpected inter-token tokens - _specs.py: remove dead _custom_plugin_spec() factory; tighten input_formats/output_formats type to Optional[Sequence[int]] - _recorders.py: remove unused field import; add 3-element grid/block validation in _CuTeDSLLaunchProxy.launch - _custom_op.py: narrow impl type annotation from Any to Optional[CustomPluginSpec] with TYPE_CHECKING guard - tests: add cross-instance op_name test; add fn_specs round-trip test and empty-torch_op None test in test_layer_metadata; add NVBUG comment before expectedFailure; fix _W_COLUMN_SCALE to use _LLM_H; tighten BF16 tolerance 1e-1 → 2e-2, FP16 tolerance 1e-2 → 1e-3

…ser API

- _descriptor.py: remove _build_attr_params (attrs via field params, never called) and _build_identity_desc_fn (meta_impl is required, None path unreachable) - _qdp_utils.py: remove is_cutedsl_compile_fn (heuristic never called), dtype_token (superseded by inline TRT dtype checks), collect_allowed_formats_for_io (informational only, never drove autotune; documented LIMITATION removed), make_td_from_meta (no callers anywhere in the codebase) - _aot/_cutedsl.py: remove _make_compile_wrapper (dead helper, never called) - _aot/_symbolic.py: remove cdiv (users call triton.cdiv / ct.cdiv directly) - _impl.py: delete entire file — CustomPluginTacticManager superseded by _descriptor.py; zero imports found across the whole codebase - conftest.py: remove _CUDNN_TEST_FILES, _ensure_cudnn_on_ld_path, _IS_CUDNN_SUBPROCESS and the CuDNN subprocess block in pytest_sessionfinish; the referenced test files (test_plugin_e2e.py, test_cudnn_plugin_e2e.py) do not exist — the entire CuDNN path was dead - debug_symint.py, repro_myelin_symintexprs.py: delete standalone debug scripts that were never meant to be in the test tree

Add test_recorders.py and test_errors.py (new), extend test_specs.py and test_layer_metadata.py with targeted tests for all previously uncovered branches. Pure-Python files after this commit: _errors.py 100% (TTADiagnosticError with/without leaf_op/impl_id) _recorders.py 100% (all 3 recorder classes: Triton, CuTile, CuTeDSL) _specs.py 100% (AnnotationMetadata helpers, KernelImplSpec list-kernel branch + all __post_init__ guards) _layer_metadata.py 99% (1 logically dead line: tok_idx>=len guard that requires len==3 and len>=4 simultaneously) TRT-dependent files (_aot/*, _descriptor.py, _qdp_utils.py, _symbolic.py, _lowering.py) remain 0–33%: they import tensorrt at module level so the unit-test process cannot load them; covered by e2e tests only.

narendasan

Marked a bunch of immediate stuff that stood out. the TL;DR is theres a ton of re-implementation here and some stuff (particularly the triton stuff) seems hacky.

My general recommendation is focus on adding aot_impl and perhaps autotune (@bowang007 did you look at autotune at all?) to the existing plugin system which should handle the rest, rather than essentially making a whole second version. If there are limitations you are running into with what is there then I believe that is where the technical discussion should be centered. Like perhaps the locking system (cc: @bowang007)

I would also recommend trying to make the systems for defining launch parameters more generically applicable, then we dont need to do as much work to say add support for pallas or nvrtc kernels.

Also I would recommend that all the sort of kernel encapsulation stuff would nicely fit in a namespace called torch_tensorrt.kernels or torch_tensorrt.dynamo.kernels It would be immediately obvious what you would use the namespace for then.

narendasan · 2026-04-02T15:38:50Z

py/torch_tensorrt/dynamo/conversion/plugins/_custom_op.py

-        requires_output_allocator,
-    )
+    if impl is not None:
+        impl.register_dynamo_plugin(


We dont need a complete second code path for this, lets reuse what we have. For example creating the converter is likely the only user of capability_validator, priority, supports_dynamic_shapes, requires_output_allocator. generate_plugin_converter already creates this converter key'ed on name,

Really I would expect the code to look like:

generate_plugin(op_name) # Generates JIT QDP Plugin generate_plugin_converter(op_name, capability_validator, priority, supports_dynamic_shapes, requires_output_allocator) # Generates the converter that inserts the QDP plugin if impl: #this should be kernel_impl probably impl.generate_plugin_aot_impl()

Updated as suggested.

narendasan · 2026-04-02T15:44:11Z

py/torch_tensorrt/annotation/_specs.py

+    return getattr(fn, _ANNOTATION_METADATA_ATTR, None)
+
+
+# ── Custom kernel specs (Triton / CuTile / CuTeDSL) ──────────────────────────


Do we have one for NVRTC?

@narendasan I would expect one TVM path later after TensorRT actually supports that. NVRTC is too flexible which we need to add lots of constraints on it or bridge work to make it work with the existing QDP path.

There are users of the plugin system who are explicitly using NVRTC which is why I ask

narendasan · 2026-04-02T15:44:42Z

py/torch_tensorrt/annotation/_specs.py

+
+
+@dataclass
+class AnnotationMetadata:


Is this needed in this PR?

Pushed the Metadata related changes to a future PR.

narendasan · 2026-04-02T15:49:21Z

py/torch_tensorrt/annotation/_layer_metadata.py

+# ---------------------------------------------------------------------------
+
+
+def set_tta_layer_metadata(


Why do we need this?

narendasan · 2026-04-02T15:50:49Z

py/torch_tensorrt/annotation/_layer_metadata.py

@@ -0,0 +1,546 @@
+"""TTA metadata stored on TensorRT ``ILayer`` objects as a plain string.


Is this needed in this PR or one of the higher level ones in the PR stack?

layer metadata is not necessary for the functionality. Will defer it to a separate PR. It is used to map back TRT engine layers back to torch codes, which would be used by some other high level annotations (which will come later).

narendasan · 2026-04-02T16:18:44Z