Fix TensorRT compatibility with version 10.x #36309

SaiShashank12 · 2025-09-29T04:15:09Z

Perfect — then you should explicitly reference it so the PR auto-closes it. Here’s the finalized PR description with your issue number plugged in:

Description

This PR introduces a new TensorRTEngineHandlerNumPy implementation that supports TensorRT 10.x with the Tensor API (num_io_tensors, get_tensor_name, execute_async_v3, etc).
It resolves runtime errors such as:

AttributeError: 'ICudaEngine' object has no attribute 'num_bindings'
RuntimeError: This handler requires TensorRT 10.x+. Missing attributes on ICudaEngine: [...]

by replacing the legacy TensorRT < 10 API calls with the new Tensor API.

It also fixes shape validation during dynamic batching and ONNX parsing (onnx_path provided but build_on_worker=False) to allow building engines directly on Beam workers if desired.

Addresses: fixes #36306

Checklist

Mentioned the relevant issue (fixes #36306) so it will auto-close once merged.
Added code that checks for TensorRT 10.x compatibility (_require_tensorrt_10) before engine load.
Replaced num_bindings and execute_async_v2 with num_io_tensors and execute_async_v3.
Validated with dummy ResNet-50 ONNX pipeline on DirectRunner.
Validated with dummy ResNet-50 ONNX pipeline on DataRunner.
Update CHANGES.md with a new entry under Python SDK → ML Inference: “Added TensorRT 10.x handler (TensorRTEngineHandlerNumPy).”
Write down the test cases.

gemini-code-assist · 2025-09-29T04:15:22Z

Summary of Changes

Hello @SaiShashank12, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades Apache Beam's machine learning inference capabilities by introducing a dedicated handler for TensorRT 10.x. It modernizes the integration with TensorRT's latest Tensor API, resolving critical compatibility issues and enhancing the flexibility of deploying ONNX models with dynamic batching within Beam pipelines.

Highlights

TensorRT 10.x Compatibility: Introduces a new TensorRTEngineHandlerNumPy specifically designed to support TensorRT 10.x and its modern Tensor API.
API Modernization: Replaces deprecated TensorRT < 10 API calls (e.g., num_bindings, execute_async_v2) with the new Tensor API methods (e.g., num_io_tensors, get_tensor_name, execute_async_v3, set_input_shape, set_tensor_address).
Error Resolution: Fixes common runtime errors encountered with TensorRT 10.x, such as AttributeError: 'ICudaEngine' object has no attribute 'num_bindings' and RuntimeError: This handler requires TensorRT 10.x+.
Enhanced Dynamic Batching and ONNX Parsing: Improves shape validation during dynamic batching and addresses issues with ONNX parsing, enabling the building of TensorRT engines directly on Beam workers if desired.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

SaiShashank12 · 2025-09-29T04:22:22Z

Python SDK

ML Inference

Added support for TensorRT 10.x with the new TensorRTEngineHandlerNumPy.
This replaces legacy TensorRT < 10 APIs (e.g., num_bindings, execute_async_v2)
with the new Tensor API (num_io_tensors, get_tensor_name, execute_async_v3).
This resolves runtime errors when using TensorRT >= 10 in Beam pipelines.
(See #36306)

…che#36306)

github-actions · 2025-09-29T05:37:14Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

SaiShashank12 · 2025-10-05T01:19:00Z

assign set of reviewers

github-actions · 2025-10-05T01:20:05Z

Assigning reviewers:

R: @jrmccluskey for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

jrmccluskey · 2025-10-07T14:11:13Z

Please fix the formatting, linting, and whitespace precommits.

- Fix ASF license header (add missing blank comment line) - Fix 8 lines exceeding 80 character limit - Simplify chained comparison - Remove unused variable in profile shape check - Remove unused imports from test file All precommit checks now pass: - yapf (formatter): No issues - pylint (linter): 10.00/10 score - whitespace: No trailing whitespace

- Fix import ordering (numpy before threading) - Remove unused type: ignore on line 67 - Add type annotations for min/opt/max_shape tuples to fix mypy errors - Remove unused type: ignore on line 224

- Move threading before typing imports - Separate each typing import on its own line - Place numpy after typing imports - Separate base imports onto individual lines

SaiShashank12 · 2025-10-12T02:53:21Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new TensorRTEngineHandlerNumPy to support TensorRT 10.x, which is a valuable addition. The implementation is well-structured, and the inclusion of tests and dependency updates is appreciated. My review focuses on enhancing the robustness and flexibility of the new handler. I've identified a significant limitation in the handling of multi-input models, which appears to be incomplete. Additionally, I've provided suggestions to improve error handling, make the ONNX building process more configurable, and clean up the test code.

gemini-code-assist · 2025-10-12T02:55:30Z

sdks/python/apache_beam/ml/inference/trt_handler_numpy_compact.py

+    if len(self.input_names) == 1:
+      shapes[self.input_names[0]] = list(batch.shape)
+    else:
+      if not input_shapes:
+        raise RuntimeError(
+            f"Engine expects multiple inputs {self.input_names}; "
+            "provide shapes via "
+            "inference_args={'input_shapes': {name: shape, ...}}")
+      for name in self.input_names:
+        if name not in input_shapes:
+          raise RuntimeError(f"Missing shape for input tensor '{name}'")
+        shapes[name] = list(map(int, input_shapes[name]))
+
+    # Validate and set shapes
+    for name, shp in shapes.items():
+      self._check_shape_in_profile(name, shp)
+      self.context.set_input_shape(name, shp)
+
+    # Allocate first input device buffer (we copy only this from 'batch')
+    in_name = self.input_names[0]
+    in_dtype = self.dtypes[in_name]
+    in_nbytes = int(np.prod(shapes[in_name])) * in_dtype.itemsize
+    if self._device_ptrs.get(in_name) is None or self._in_nbytes != in_nbytes:
+      if self._device_ptrs.get(in_name) is not None:
+        self._free_ptr(self._device_ptrs[in_name])
+      self._device_ptrs[in_name] = _assign_or_fail(cuda.cuMemAlloc(in_nbytes))
+      self._in_nbytes = in_nbytes


The current implementation for multi-input models appears to be incomplete.

While it derives shapes for multiple inputs from inference_args['input_shapes'], it only uses the batch data for the first input tensor (self.input_names[0]). There is no mechanism to provide data for the other input tensors. The _trt10_inference_fn also only copies data for the first input.

The device buffer allocation in ensure_buffers only happens for the first input tensor. It should iterate over all self.input_names and allocate buffers for each.

A common pattern for multi-input models in Beam is to have PCollection elements as dictionaries mapping input names to numpy arrays. The ModelHandler and run_inference function would need to be adapted to handle this. For example, ExampleT could be Dict[str, np.ndarray], and run_inference would receive a batch of such dictionaries.

This is a significant limitation for models with multiple inputs.

gemini-code-assist · 2025-10-12T02:55:30Z

sdks/python/apache_beam/ml/inference/trt_handler_numpy_compact.py

+  # Workbench: ~1GiB workspace (tune for your model/infra)
+  config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)


The workspace size for the ONNX builder is hardcoded to 1GB. While the comment suggests tuning it, it would be more flexible for users if this could be configured through the TensorRTEngineHandlerNumPy constructor. Consider adding a parameter like onnx_builder_config_args to __init__ that can be used to pass arguments to the TensorRT builder config, for example to set the memory pool limit.

gemini-code-assist · 2025-10-12T02:55:30Z

sdks/python/apache_beam/ml/inference/trt_handler_numpy_compact.py

+    try:
+      _assign_or_fail(cuda.cuMemFree(ptr))
+    except Exception:
+      pass


The except Exception: pass is too broad and can hide important errors during memory deallocation. It's better to catch a more specific exception, like RuntimeError which is raised by _assign_or_fail. If the goal is to suppress errors during cleanup, consider logging them as warnings instead of silently ignoring them.

Suggested change

try:

_assign_or_fail(cuda.cuMemFree(ptr))

except Exception:

pass

try:

_assign_or_fail(cuda.cuMemFree(ptr))

except RuntimeError as e:

LOGGER.warning('Failed to free CUDA memory pointer: %s', e)

gemini-code-assist · 2025-10-12T02:55:31Z

sdks/python/apache_beam/ml/inference/trt_handler_numpy_compact_test.py

+SINGLE_FEATURE_EXAMPLES = np.array([[1.0], [5.0], [-3.0], [10.0]],
+                                   dtype=np.float32)
+
+SINGLE_FEATURE_PREDICTIONS = [
+    PredictionResult(
+        SINGLE_FEATURE_EXAMPLES[i], [
+            np.array([SINGLE_FEATURE_EXAMPLES[i][0] * 2.0 + 0.5],
+                     dtype=np.float32)
+        ]) for i in range(len(SINGLE_FEATURE_EXAMPLES))
+]
+
+TWO_FEATURES_EXAMPLES = np.array([[1, 5], [3, 10], [-14, 0], [0.5, 0.5]],
+                                 dtype=np.float32)
+
+TWO_FEATURES_PREDICTIONS = [
+    PredictionResult(
+        TWO_FEATURES_EXAMPLES[i],
+        [
+            np.array([
+                TWO_FEATURES_EXAMPLES[i][0] * 2.0 +
+                TWO_FEATURES_EXAMPLES[i][1] * 3 + 0.5
+            ],
+                     dtype=np.float32)
+        ]) for i in range(len(TWO_FEATURES_EXAMPLES))
+]


The test data variables SINGLE_FEATURE_EXAMPLES, SINGLE_FEATURE_PREDICTIONS, TWO_FEATURES_EXAMPLES, and TWO_FEATURES_PREDICTIONS are defined but not used in any of the tests. Consider removing them to improve code clarity. If they are intended for future tests, they could be moved to those tests or commented out with a TODO.

gemini-code-assist · 2025-10-12T02:55:31Z

sdks/python/apache_beam/ml/inference/trt_handler_numpy_compact_test.py

+        model_copies=3)
+    self.assertEqual(handler.model_copies(), 3)
+
+  @unittest.skipIf(GCSFileSystem is None, 'GCP dependencies are not installed')


The tests test_inference_with_onnx_build_on_worker, test_engine_initialization (line 264), and test_pipeline_with_onnx_model (line 292) are skipped if GCSFileSystem is not available. However, these tests use local temporary files and do not seem to depend on GCS. The FileSystems.open call in the implementation will use LocalFileSystem for local paths. The skipIf condition seems unnecessary and could be removed from these tests.

SaiShashank12 · 2025-10-12T03:08:39Z

Hi @jrmccluskey,
Thanks for the feedback! I've addressed all the formatting, linting, and whitespace issues in my latest commits. The precommit checks should now be passing:

Fixed ASF license header formatting
Corrected lines exceeding 80 characters
Fixed import ordering and separated imports properly
Added proper type annotations for mypy
Cleaned up unused imports and variables

I apologize if the initial submission wasn't quite up to standard—this is my first contribution to an Apache project, and I'm still getting familiar with the code style and precommit requirements. I really appreciate your patience and guidance!
Let me know if there's anything else that needs adjustment.

- Register uses_tensorrt pytest marker in pytest.ini - Add Docker container for TensorRT 10.x GPU testing - Include README with container documentation - Container uses TensorRT 25.01-py3 base image with CUDA 12.x - Supports new TensorRTEngineHandlerNumPy Tensor API

- Fix RAT (Release Audit Tool) pre-commit check failure - Add proper HTML comment format license header to markdown file - Matches existing TensorRT container README format

- Demonstrates ONNX-to-engine on-worker conversion feature - Includes comprehensive CLI with all handler parameters - Shows key advantage over legacy TensorRT handler: eliminates environment alignment issues - Supports both ONNX input (with build_on_worker) and pre-built engines - Includes image preprocessing and top-5 prediction formatting

- Fix code formatting with yapf - Fix import ordering with isort - Ensure compliance with pre-commit hooks

- Adjust line break formatting to match yapf 0.43.0 - Ensures consistency with pre-commit CI checks

- Fix import ordering in trt_handler_numpy_compact_test.py (stdlib before third-party) - Fix line length violations in tensorrt_resnet50_inference.py (lines 40, 151) - All files now comply with pylint 80-character line limit

- Add blank lines after function definitions per yapf style - Combine multi-line @unittest.skipIf decorators to single line - All yapf 0.43.0 formatting now compliant

Remove unused GCSFileSystem import and add pylint disable comment for tensorrt import in availability check function.

jrmccluskey

I'm somewhat confused as to why this exists as a separate slate of files, tests, examples, etc. when the actual structure of the model handler is overwhelmingly the same as the existing TensorRT model handler. This seems like a pretty clear opportunity to either update the existing handler or, if preserving pre-10.x behavior is genuinely valuable, do a refactor to enable better code re-use across different major versions (especially if we'd anticipate having to produce a new TensorRT model handler for each version.) I'm also not sold on the file naming here, trt_handler_numpy_compact.py isn't exactly following our naming convention in the directory. What was the thinking behind this approach?

I have similar thoughts for the dockerfile and readme, I feel like those can live in the existing tensorrt_runinference directory.

SaiShashank12 · 2025-10-16T22:12:53Z

Great questions @jrmccluskey ! Let me explain the architectural reasoning behind the separate implementation and address your concerns about naming and infrastructure.

Note: I actually raised these same architectural concerns on the dev mailing list before implementing this approach - specifically about whether to update the existing handler or create separate files. The feedback I received was to bring this discussion to the PR review, so I'm glad you're raising these points here!

Why Separate Files? The Core Technical Difference

The separation isn't primarily about the API version change (v2 → v3), but rather a fundamental workflow improvement that solves a critical production issue with TensorRT.

The Environment Alignment Problem

TensorRT engine files must be compiled on the exact same hardware, CUDA version, and TensorRT version where inference will run. This is documented by NVIDIA as a hard requirement. Pre-building engines on a dev machine and deploying to Dataflow workers almost always causes compatibility failures.

How the Old Handler Works

# tensorrt_inference.py - Manual workflow
handler = TensorRTEngineHandlerNumPy(...)
network, builder = handler.load_onnx()  # User must call this
engine = handler.build_engine(network, builder)  # Then call this
# Notably: load_onnx() is only used in test files, not production

The New Handler's Key Innovation

# trt_handler_numpy_compact.py - Automatic on-worker building
handler = TensorRTEngineHandlerNumPy(
    onnx_path='gs://bucket/model.onnx',  # Portable ONNX
    build_on_worker=True  # Builds IN the Dataflow worker
)
engine = handler.load_model()  # Engine built automatically on correct hardware!

This eliminates an entire class of deployment failures:

✅ ONNX models are fully portable across environments
✅ Engine built with the exact GPU/CUDA/TensorRT version used for inference
✅ No "works on my machine" issues when deploying to Dataflow
✅ Production-ready workflow (vs test-only load_onnx())

Example Scenario This Solves

Developer machine:          Dataflow worker:
- GPU: RTX 4090            - GPU: Tesla T4
- CUDA: 12.8              - CUDA: 12.6
- TensorRT: 10.7          - TensorRT: 10.5

Pre-built engine → FAILS with cryptic errors ❌
ONNX + on-worker build → Works perfectly ✅

Addressing Your Specific Concerns

1. Naming Inconsistency (`trt_handler_numpy_compact.py`)

You're absolutely right - this doesn't follow the naming convention. I'm happy to rename to maintain consistency:

Option A: tensorrt_inference_v10.py (clear version indicator)
Option B: tensorrt_inference_auto.py (highlights auto-building)
Preferred: tensorrt_inference_v10.py to match tensorrt_inference.py

2. Docker & README Duplication

Agreed - these can absolutely be consolidated into the existing tensorrt_runinference/ directory. I can restructure as:

tensorrt_runinference/
├── README.md (updated with both versions)
├── legacy/
│   └── tensorrt_8x.dockerfile
└── v10/
    └── tensorrt_10x.dockerfile

3. Code Reuse & Future Scalability

There's definitely shared logic (~60-70%) that could be refactored. This was actually one of my key concerns when I reached out to the dev list - whether to prioritize clean separation or code reuse. My concern with merging into the existing handler is:

Risk of breaking existing users who rely on the manual load_onnx()/build_engine() pattern
The API differences are significant enough that conditional logic might make the code harder to follow

Proposed Refactoring Options

Option A: Version-Aware Single Handler (Most maintainable)

# tensorrt_inference.py
class TensorRTEngineHandlerNumPy:
    def __init__(self, ..., tensorrt_version=None, build_on_worker=False):
        if tensorrt_version is None:
            tensorrt_version = self._detect_tensorrt_version()

        if tensorrt_version >= 10:
            self._use_tensor_api = True
        else:
            self._use_tensor_api = False

        self._build_on_worker = build_on_worker

    def load_model(self):
        if self._build_on_worker and self.onnx_path:
            # Automatic ONNX → engine conversion
            return self._build_engine_from_onnx()
        else:
            # Legacy behavior
            return self._load_prebuilt_engine()

Option B: Base Class with Version Subclasses

# tensorrt_inference_base.py - Shared logic
# tensorrt_inference_v9.py - Pre-10.x (legacy buffer API)
# tensorrt_inference_v10.py - 10.x+ (Tensor API + auto-building)

Option C: Keep Separate, Fix Naming (Current + improvements)

Rename to tensorrt_inference_v10.py
Add deprecation notice to old handler encouraging 10.x adoption
Consolidate Docker infrastructure
Extract truly shared code (e.g., metrics, batching) to a utility module

My Recommendation

I'd lean toward Option A if the on-worker building feature is valuable enough to backport to the old handler, or Option C if we want to keep clean separation for stability.

Since the dev list suggested bringing this architectural discussion here, I'm very open to your expertise on what fits best with Beam's patterns. Question for you: Does Beam typically prefer version-aware handlers (like PyTorch's different model loaders) or separate handlers per major version? That would help me align with the project's patterns.

Summary

The architectural decision for separate files was driven by:

Solving a real production problem (environment compatibility via on-worker building)
Enabling a new workflow that wasn't production-ready in the old handler
Avoiding breaking changes to existing users

However, I completely agree with your concerns about:

❌ Naming inconsistency → Will fix
❌ Infrastructure duplication → Will consolidate
❌ Code duplication → Open to refactoring

I'm happy to implement whichever approach aligns best with Beam's architecture patterns. Let me know your preference and I'll refactor accordingly!

github-actions · 2025-10-24T12:16:50Z

Reminder, please take a look at this pr: @jrmccluskey

github-actions · 2025-10-29T12:16:35Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @shunping for label python.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions · 2025-11-05T12:16:40Z

Reminder, please take a look at this pr: @shunping

github-actions · 2025-11-10T12:16:36Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damccorm for label python.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

damccorm · 2025-11-10T14:19:50Z

R: @jrmccluskey

github-actions · 2025-11-10T14:20:58Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

jrmccluskey · 2025-11-20T16:06:00Z

If the on-worker builds are that big of an improvement then I don't see why you wouldn't want to back-port them (which has the benefit of keeping things simple for end users, we would just need to document the new path or allow the old one if the params to the model handler aren't passed in the same way.)

Fix TensorRT compatibility with version 10.x

52fff47

github-actions bot added the python label Sep 29, 2025

Add ASF license header to TRT handler and fix filename (addresses apa…

9bd38ec

…che#36306)

SaiShashank12 force-pushed the fix-trt-compatibility branch from f9cf133 to 9bd38ec Compare September 30, 2025 06:18

SaiShashank12 added 3 commits October 4, 2025 16:06

Add TensorRT handler test files

ef288d7

Fix formatting in TensorRT handler test file

403c37b

Fix formatting in TensorRT handler implementation

80a546f

github-actions bot added the Next Action: Reviewers label Oct 5, 2025

Add TensorRT 10.x fix to CHANGES.md

b3a640e

SaiShashank12 added 6 commits October 11, 2025 19:15

Fix yapf formatting in TensorRT handler test file

d346dd9

Fix CHANGES.md to use issue link instead of PR link

c7858c6

Merge upstream/master and resolve CHANGES.md conflict

062cca9

Fix PythonLint issues: import sorting and mypy type errors

3201345

- Fix import ordering (numpy before threading) - Remove unused type: ignore on line 67 - Add type annotations for min/opt/max_shape tuples to fix mypy errors - Remove unused type: ignore on line 224

Fix import ordering for isort compliance

1f06128

- Move threading before typing imports - Separate each typing import on its own line - Place numpy after typing imports - Separate base imports onto individual lines

gemini-code-assist bot reviewed Oct 12, 2025

View reviewed changes

SaiShashank12 added 4 commits October 14, 2025 22:29

Add Apache license header to TensorRT 10.x README

5649199

- Fix RAT (Release Audit Tool) pre-commit check failure - Add proper HTML comment format license header to markdown file - Matches existing TensorRT container README format

Apply yapf and isort formatting to TensorRT files

b009ea4

- Fix code formatting with yapf - Fix import ordering with isort - Ensure compliance with pre-commit hooks