Skip to content

Conversation

@SaiShashank12
Copy link

Perfect — then you should explicitly reference it so the PR auto-closes it. Here’s the finalized PR description with your issue number plugged in:


Description

This PR introduces a new TensorRTEngineHandlerNumPy implementation that supports TensorRT 10.x with the Tensor API (num_io_tensors, get_tensor_name, execute_async_v3, etc).
It resolves runtime errors such as:

  • AttributeError: 'ICudaEngine' object has no attribute 'num_bindings'
  • RuntimeError: This handler requires TensorRT 10.x+. Missing attributes on ICudaEngine: [...]

by replacing the legacy TensorRT < 10 API calls with the new Tensor API.

It also fixes shape validation during dynamic batching and ONNX parsing (onnx_path provided but build_on_worker=False) to allow building engines directly on Beam workers if desired.

Addresses: fixes #36306


Checklist

  • Mentioned the relevant issue (fixes #36306) so it will auto-close once merged.
  • Added code that checks for TensorRT 10.x compatibility (_require_tensorrt_10) before engine load.
  • Replaced num_bindings and execute_async_v2 with num_io_tensors and execute_async_v3.
  • Validated with dummy ResNet-50 ONNX pipeline on DirectRunner.
  • Validated with dummy ResNet-50 ONNX pipeline on DataRunner.
  • Update CHANGES.md with a new entry under Python SDK → ML Inference: “Added TensorRT 10.x handler (TensorRTEngineHandlerNumPy).”
  • Write down the test cases.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @SaiShashank12, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades Apache Beam's machine learning inference capabilities by introducing a dedicated handler for TensorRT 10.x. It modernizes the integration with TensorRT's latest Tensor API, resolving critical compatibility issues and enhancing the flexibility of deploying ONNX models with dynamic batching within Beam pipelines.

Highlights

  • TensorRT 10.x Compatibility: Introduces a new TensorRTEngineHandlerNumPy specifically designed to support TensorRT 10.x and its modern Tensor API.
  • API Modernization: Replaces deprecated TensorRT < 10 API calls (e.g., num_bindings, execute_async_v2) with the new Tensor API methods (e.g., num_io_tensors, get_tensor_name, execute_async_v3, set_input_shape, set_tensor_address).
  • Error Resolution: Fixes common runtime errors encountered with TensorRT 10.x, such as AttributeError: 'ICudaEngine' object has no attribute 'num_bindings' and RuntimeError: This handler requires TensorRT 10.x+.
  • Enhanced Dynamic Batching and ONNX Parsing: Improves shape validation during dynamic batching and addresses issues with ONNX parsing, enabling the building of TensorRT engines directly on Beam workers if desired.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@SaiShashank12
Copy link
Author

Python SDK

ML Inference

  • Added support for TensorRT 10.x with the new TensorRTEngineHandlerNumPy.
    This replaces legacy TensorRT < 10 APIs (e.g., num_bindings, execute_async_v2)
    with the new Tensor API (num_io_tensors, get_tensor_name, execute_async_v3).
    This resolves runtime errors when using TensorRT >= 10 in Beam pipelines.
    (See #36306)

@github-actions
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@SaiShashank12
Copy link
Author

assign set of reviewers

@github-actions
Copy link
Contributor

github-actions bot commented Oct 5, 2025

Assigning reviewers:

R: @jrmccluskey for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@jrmccluskey
Copy link
Contributor

Please fix the formatting, linting, and whitespace precommits.

- Fix ASF license header (add missing blank comment line)
- Fix 8 lines exceeding 80 character limit
- Simplify chained comparison
- Remove unused variable in profile shape check
- Remove unused imports from test file

All precommit checks now pass:
- yapf (formatter): No issues
- pylint (linter): 10.00/10 score
- whitespace: No trailing whitespace
- Fix import ordering (numpy before threading)
- Remove unused type: ignore on line 67
- Add type annotations for min/opt/max_shape tuples to fix mypy errors
- Remove unused type: ignore on line 224
- Move threading before typing imports
- Separate each typing import on its own line
- Place numpy after typing imports
- Separate base imports onto individual lines
@SaiShashank12
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new TensorRTEngineHandlerNumPy to support TensorRT 10.x, which is a valuable addition. The implementation is well-structured, and the inclusion of tests and dependency updates is appreciated. My review focuses on enhancing the robustness and flexibility of the new handler. I've identified a significant limitation in the handling of multi-input models, which appears to be incomplete. Additionally, I've provided suggestions to improve error handling, make the ONNX building process more configurable, and clean up the test code.

Comment on lines +317 to +343
if len(self.input_names) == 1:
shapes[self.input_names[0]] = list(batch.shape)
else:
if not input_shapes:
raise RuntimeError(
f"Engine expects multiple inputs {self.input_names}; "
"provide shapes via "
"inference_args={'input_shapes': {name: shape, ...}}")
for name in self.input_names:
if name not in input_shapes:
raise RuntimeError(f"Missing shape for input tensor '{name}'")
shapes[name] = list(map(int, input_shapes[name]))

# Validate and set shapes
for name, shp in shapes.items():
self._check_shape_in_profile(name, shp)
self.context.set_input_shape(name, shp)

# Allocate first input device buffer (we copy only this from 'batch')
in_name = self.input_names[0]
in_dtype = self.dtypes[in_name]
in_nbytes = int(np.prod(shapes[in_name])) * in_dtype.itemsize
if self._device_ptrs.get(in_name) is None or self._in_nbytes != in_nbytes:
if self._device_ptrs.get(in_name) is not None:
self._free_ptr(self._device_ptrs[in_name])
self._device_ptrs[in_name] = _assign_or_fail(cuda.cuMemAlloc(in_nbytes))
self._in_nbytes = in_nbytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation for multi-input models appears to be incomplete.

  1. While it derives shapes for multiple inputs from inference_args['input_shapes'], it only uses the batch data for the first input tensor (self.input_names[0]). There is no mechanism to provide data for the other input tensors. The _trt10_inference_fn also only copies data for the first input.
  2. The device buffer allocation in ensure_buffers only happens for the first input tensor. It should iterate over all self.input_names and allocate buffers for each.

A common pattern for multi-input models in Beam is to have PCollection elements as dictionaries mapping input names to numpy arrays. The ModelHandler and run_inference function would need to be adapted to handle this. For example, ExampleT could be Dict[str, np.ndarray], and run_inference would receive a batch of such dictionaries.

This is a significant limitation for models with multiple inputs.

Comment on lines +134 to +135
# Workbench: ~1GiB workspace (tune for your model/infra)
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The workspace size for the ONNX builder is hardcoded to 1GB. While the comment suggests tuning it, it would be more flexible for users if this could be configured through the TensorRTEngineHandlerNumPy constructor. Consider adding a parameter like onnx_builder_config_args to __init__ that can be used to pass arguments to the TensorRT builder config, for example to set the memory pool limit.

Comment on lines +266 to +269
try:
_assign_or_fail(cuda.cuMemFree(ptr))
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The except Exception: pass is too broad and can hide important errors during memory deallocation. It's better to catch a more specific exception, like RuntimeError which is raised by _assign_or_fail. If the goal is to suppress errors during cleanup, consider logging them as warnings instead of silently ignoring them.

Suggested change
try:
_assign_or_fail(cuda.cuMemFree(ptr))
except Exception:
pass
try:
_assign_or_fail(cuda.cuMemFree(ptr))
except RuntimeError as e:
LOGGER.warning('Failed to free CUDA memory pointer: %s', e)

Comment on lines +48 to +72
SINGLE_FEATURE_EXAMPLES = np.array([[1.0], [5.0], [-3.0], [10.0]],
dtype=np.float32)

SINGLE_FEATURE_PREDICTIONS = [
PredictionResult(
SINGLE_FEATURE_EXAMPLES[i], [
np.array([SINGLE_FEATURE_EXAMPLES[i][0] * 2.0 + 0.5],
dtype=np.float32)
]) for i in range(len(SINGLE_FEATURE_EXAMPLES))
]

TWO_FEATURES_EXAMPLES = np.array([[1, 5], [3, 10], [-14, 0], [0.5, 0.5]],
dtype=np.float32)

TWO_FEATURES_PREDICTIONS = [
PredictionResult(
TWO_FEATURES_EXAMPLES[i],
[
np.array([
TWO_FEATURES_EXAMPLES[i][0] * 2.0 +
TWO_FEATURES_EXAMPLES[i][1] * 3 + 0.5
],
dtype=np.float32)
]) for i in range(len(TWO_FEATURES_EXAMPLES))
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test data variables SINGLE_FEATURE_EXAMPLES, SINGLE_FEATURE_PREDICTIONS, TWO_FEATURES_EXAMPLES, and TWO_FEATURES_PREDICTIONS are defined but not used in any of the tests. Consider removing them to improve code clarity. If they are intended for future tests, they could be moved to those tests or commented out with a TODO.

model_copies=3)
self.assertEqual(handler.model_copies(), 3)

@unittest.skipIf(GCSFileSystem is None, 'GCP dependencies are not installed')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The tests test_inference_with_onnx_build_on_worker, test_engine_initialization (line 264), and test_pipeline_with_onnx_model (line 292) are skipped if GCSFileSystem is not available. However, these tests use local temporary files and do not seem to depend on GCS. The FileSystems.open call in the implementation will use LocalFileSystem for local paths. The skipIf condition seems unnecessary and could be removed from these tests.

@SaiShashank12
Copy link
Author

Hi @jrmccluskey,
Thanks for the feedback! I've addressed all the formatting, linting, and whitespace issues in my latest commits. The precommit checks should now be passing:

  • Fixed ASF license header formatting
  • Corrected lines exceeding 80 characters
  • Fixed import ordering and separated imports properly
  • Added proper type annotations for mypy
  • Cleaned up unused imports and variables

I apologize if the initial submission wasn't quite up to standard—this is my first contribution to an Apache project, and I'm still getting familiar with the code style and precommit requirements. I really appreciate your patience and guidance!
Let me know if there's anything else that needs adjustment.

- Register uses_tensorrt pytest marker in pytest.ini
- Add Docker container for TensorRT 10.x GPU testing
- Include README with container documentation
- Container uses TensorRT 25.01-py3 base image with CUDA 12.x
- Supports new TensorRTEngineHandlerNumPy Tensor API
- Fix RAT (Release Audit Tool) pre-commit check failure
- Add proper HTML comment format license header to markdown file
- Matches existing TensorRT container README format
- Demonstrates ONNX-to-engine on-worker conversion feature
- Includes comprehensive CLI with all handler parameters
- Shows key advantage over legacy TensorRT handler: eliminates environment alignment issues
- Supports both ONNX input (with build_on_worker) and pre-built engines
- Includes image preprocessing and top-5 prediction formatting
- Fix code formatting with yapf
- Fix import ordering with isort
- Ensure compliance with pre-commit hooks
- Adjust line break formatting to match yapf 0.43.0
- Ensures consistency with pre-commit CI checks
- Fix import ordering in trt_handler_numpy_compact_test.py (stdlib before third-party)
- Fix line length violations in tensorrt_resnet50_inference.py (lines 40, 151)
- All files now comply with pylint 80-character line limit
- Add blank lines after function definitions per yapf style
- Combine multi-line @unittest.skipIf decorators to single line
- All yapf 0.43.0 formatting now compliant
Remove unused GCSFileSystem import and add pylint disable comment
for tensorrt import in availability check function.
Copy link
Contributor

@jrmccluskey jrmccluskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat confused as to why this exists as a separate slate of files, tests, examples, etc. when the actual structure of the model handler is overwhelmingly the same as the existing TensorRT model handler. This seems like a pretty clear opportunity to either update the existing handler or, if preserving pre-10.x behavior is genuinely valuable, do a refactor to enable better code re-use across different major versions (especially if we'd anticipate having to produce a new TensorRT model handler for each version.) I'm also not sold on the file naming here, trt_handler_numpy_compact.py isn't exactly following our naming convention in the directory. What was the thinking behind this approach?

I have similar thoughts for the dockerfile and readme, I feel like those can live in the existing tensorrt_runinference directory.

@SaiShashank12
Copy link
Author

Great questions @jrmccluskey ! Let me explain the architectural reasoning behind the separate implementation and address your concerns about naming and infrastructure.

Note: I actually raised these same architectural concerns on the dev mailing list before implementing this approach - specifically about whether to update the existing handler or create separate files. The feedback I received was to bring this discussion to the PR review, so I'm glad you're raising these points here!

Why Separate Files? The Core Technical Difference

The separation isn't primarily about the API version change (v2 → v3), but rather a fundamental workflow improvement that solves a critical production issue with TensorRT.

The Environment Alignment Problem

TensorRT engine files must be compiled on the exact same hardware, CUDA version, and TensorRT version where inference will run. This is documented by NVIDIA as a hard requirement. Pre-building engines on a dev machine and deploying to Dataflow workers almost always causes compatibility failures.

How the Old Handler Works

# tensorrt_inference.py - Manual workflow
handler = TensorRTEngineHandlerNumPy(...)
network, builder = handler.load_onnx()  # User must call this
engine = handler.build_engine(network, builder)  # Then call this
# Notably: load_onnx() is only used in test files, not production

The New Handler's Key Innovation

# trt_handler_numpy_compact.py - Automatic on-worker building
handler = TensorRTEngineHandlerNumPy(
    onnx_path='gs://bucket/model.onnx',  # Portable ONNX
    build_on_worker=True  # Builds IN the Dataflow worker
)
engine = handler.load_model()  # Engine built automatically on correct hardware!

This eliminates an entire class of deployment failures:

  • ✅ ONNX models are fully portable across environments
  • ✅ Engine built with the exact GPU/CUDA/TensorRT version used for inference
  • ✅ No "works on my machine" issues when deploying to Dataflow
  • ✅ Production-ready workflow (vs test-only load_onnx())

Example Scenario This Solves

Developer machine:          Dataflow worker:
- GPU: RTX 4090            - GPU: Tesla T4
- CUDA: 12.8              - CUDA: 12.6
- TensorRT: 10.7          - TensorRT: 10.5

Pre-built engine → FAILS with cryptic errors ❌
ONNX + on-worker build → Works perfectly ✅

Addressing Your Specific Concerns

1. Naming Inconsistency (trt_handler_numpy_compact.py)

You're absolutely right - this doesn't follow the naming convention. I'm happy to rename to maintain consistency:

  • Option A: tensorrt_inference_v10.py (clear version indicator)
  • Option B: tensorrt_inference_auto.py (highlights auto-building)
  • Preferred: tensorrt_inference_v10.py to match tensorrt_inference.py

2. Docker & README Duplication

Agreed - these can absolutely be consolidated into the existing tensorrt_runinference/ directory. I can restructure as:

tensorrt_runinference/
├── README.md (updated with both versions)
├── legacy/
│   └── tensorrt_8x.dockerfile
└── v10/
    └── tensorrt_10x.dockerfile

3. Code Reuse & Future Scalability

There's definitely shared logic (~60-70%) that could be refactored. This was actually one of my key concerns when I reached out to the dev list - whether to prioritize clean separation or code reuse. My concern with merging into the existing handler is:

  • Risk of breaking existing users who rely on the manual load_onnx()/build_engine() pattern
  • The API differences are significant enough that conditional logic might make the code harder to follow

Proposed Refactoring Options

Option A: Version-Aware Single Handler (Most maintainable)

# tensorrt_inference.py
class TensorRTEngineHandlerNumPy:
    def __init__(self, ..., tensorrt_version=None, build_on_worker=False):
        if tensorrt_version is None:
            tensorrt_version = self._detect_tensorrt_version()

        if tensorrt_version >= 10:
            self._use_tensor_api = True
        else:
            self._use_tensor_api = False

        self._build_on_worker = build_on_worker

    def load_model(self):
        if self._build_on_worker and self.onnx_path:
            # Automatic ONNX → engine conversion
            return self._build_engine_from_onnx()
        else:
            # Legacy behavior
            return self._load_prebuilt_engine()

Option B: Base Class with Version Subclasses

# tensorrt_inference_base.py - Shared logic
# tensorrt_inference_v9.py - Pre-10.x (legacy buffer API)
# tensorrt_inference_v10.py - 10.x+ (Tensor API + auto-building)

Option C: Keep Separate, Fix Naming (Current + improvements)

  • Rename to tensorrt_inference_v10.py
  • Add deprecation notice to old handler encouraging 10.x adoption
  • Consolidate Docker infrastructure
  • Extract truly shared code (e.g., metrics, batching) to a utility module

My Recommendation

I'd lean toward Option A if the on-worker building feature is valuable enough to backport to the old handler, or Option C if we want to keep clean separation for stability.

Since the dev list suggested bringing this architectural discussion here, I'm very open to your expertise on what fits best with Beam's patterns. Question for you: Does Beam typically prefer version-aware handlers (like PyTorch's different model loaders) or separate handlers per major version? That would help me align with the project's patterns.

Summary

The architectural decision for separate files was driven by:

  1. Solving a real production problem (environment compatibility via on-worker building)
  2. Enabling a new workflow that wasn't production-ready in the old handler
  3. Avoiding breaking changes to existing users

However, I completely agree with your concerns about:

  • ❌ Naming inconsistency → Will fix
  • ❌ Infrastructure duplication → Will consolidate
  • ❌ Code duplication → Open to refactoring

I'm happy to implement whichever approach aligns best with Beam's architecture patterns. Let me know your preference and I'll refactor accordingly!

@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @jrmccluskey

@github-actions
Copy link
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @shunping for label python.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

Reminder, please take a look at this pr: @shunping

@github-actions
Copy link
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damccorm for label python.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@damccorm
Copy link
Contributor

R: @jrmccluskey

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

@jrmccluskey
Copy link
Contributor

If the on-worker builds are that big of an improvement then I don't see why you wouldn't want to back-port them (which has the benefit of keeping things simple for end users, we would just need to document the new path or allow the old one if the params to the model handler aren't passed in the same way.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: TensorRT 10.x compatibility issues in TensorRTEngineHandlerNumPy

3 participants