-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Fix TensorRT compatibility with version 10.x #36309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @SaiShashank12, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades Apache Beam's machine learning inference capabilities by introducing a dedicated handler for TensorRT 10.x. It modernizes the integration with TensorRT's latest Tensor API, resolving critical compatibility issues and enhancing the flexibility of deploying ONNX models with dynamic batching within Beam pipelines. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Python SDKML Inference
|
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
f9cf133 to
9bd38ec
Compare
|
assign set of reviewers |
|
Assigning reviewers: R: @jrmccluskey for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
Please fix the formatting, linting, and whitespace precommits. |
- Fix ASF license header (add missing blank comment line) - Fix 8 lines exceeding 80 character limit - Simplify chained comparison - Remove unused variable in profile shape check - Remove unused imports from test file All precommit checks now pass: - yapf (formatter): No issues - pylint (linter): 10.00/10 score - whitespace: No trailing whitespace
- Fix import ordering (numpy before threading) - Remove unused type: ignore on line 67 - Add type annotations for min/opt/max_shape tuples to fix mypy errors - Remove unused type: ignore on line 224
- Move threading before typing imports - Separate each typing import on its own line - Place numpy after typing imports - Separate base imports onto individual lines
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new TensorRTEngineHandlerNumPy to support TensorRT 10.x, which is a valuable addition. The implementation is well-structured, and the inclusion of tests and dependency updates is appreciated. My review focuses on enhancing the robustness and flexibility of the new handler. I've identified a significant limitation in the handling of multi-input models, which appears to be incomplete. Additionally, I've provided suggestions to improve error handling, make the ONNX building process more configurable, and clean up the test code.
| if len(self.input_names) == 1: | ||
| shapes[self.input_names[0]] = list(batch.shape) | ||
| else: | ||
| if not input_shapes: | ||
| raise RuntimeError( | ||
| f"Engine expects multiple inputs {self.input_names}; " | ||
| "provide shapes via " | ||
| "inference_args={'input_shapes': {name: shape, ...}}") | ||
| for name in self.input_names: | ||
| if name not in input_shapes: | ||
| raise RuntimeError(f"Missing shape for input tensor '{name}'") | ||
| shapes[name] = list(map(int, input_shapes[name])) | ||
|
|
||
| # Validate and set shapes | ||
| for name, shp in shapes.items(): | ||
| self._check_shape_in_profile(name, shp) | ||
| self.context.set_input_shape(name, shp) | ||
|
|
||
| # Allocate first input device buffer (we copy only this from 'batch') | ||
| in_name = self.input_names[0] | ||
| in_dtype = self.dtypes[in_name] | ||
| in_nbytes = int(np.prod(shapes[in_name])) * in_dtype.itemsize | ||
| if self._device_ptrs.get(in_name) is None or self._in_nbytes != in_nbytes: | ||
| if self._device_ptrs.get(in_name) is not None: | ||
| self._free_ptr(self._device_ptrs[in_name]) | ||
| self._device_ptrs[in_name] = _assign_or_fail(cuda.cuMemAlloc(in_nbytes)) | ||
| self._in_nbytes = in_nbytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation for multi-input models appears to be incomplete.
- While it derives shapes for multiple inputs from
inference_args['input_shapes'], it only uses thebatchdata for the first input tensor (self.input_names[0]). There is no mechanism to provide data for the other input tensors. The_trt10_inference_fnalso only copies data for the first input. - The device buffer allocation in
ensure_buffersonly happens for the first input tensor. It should iterate over allself.input_namesand allocate buffers for each.
A common pattern for multi-input models in Beam is to have PCollection elements as dictionaries mapping input names to numpy arrays. The ModelHandler and run_inference function would need to be adapted to handle this. For example, ExampleT could be Dict[str, np.ndarray], and run_inference would receive a batch of such dictionaries.
This is a significant limitation for models with multiple inputs.
| # Workbench: ~1GiB workspace (tune for your model/infra) | ||
| config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workspace size for the ONNX builder is hardcoded to 1GB. While the comment suggests tuning it, it would be more flexible for users if this could be configured through the TensorRTEngineHandlerNumPy constructor. Consider adding a parameter like onnx_builder_config_args to __init__ that can be used to pass arguments to the TensorRT builder config, for example to set the memory pool limit.
| try: | ||
| _assign_or_fail(cuda.cuMemFree(ptr)) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The except Exception: pass is too broad and can hide important errors during memory deallocation. It's better to catch a more specific exception, like RuntimeError which is raised by _assign_or_fail. If the goal is to suppress errors during cleanup, consider logging them as warnings instead of silently ignoring them.
| try: | |
| _assign_or_fail(cuda.cuMemFree(ptr)) | |
| except Exception: | |
| pass | |
| try: | |
| _assign_or_fail(cuda.cuMemFree(ptr)) | |
| except RuntimeError as e: | |
| LOGGER.warning('Failed to free CUDA memory pointer: %s', e) |
| SINGLE_FEATURE_EXAMPLES = np.array([[1.0], [5.0], [-3.0], [10.0]], | ||
| dtype=np.float32) | ||
|
|
||
| SINGLE_FEATURE_PREDICTIONS = [ | ||
| PredictionResult( | ||
| SINGLE_FEATURE_EXAMPLES[i], [ | ||
| np.array([SINGLE_FEATURE_EXAMPLES[i][0] * 2.0 + 0.5], | ||
| dtype=np.float32) | ||
| ]) for i in range(len(SINGLE_FEATURE_EXAMPLES)) | ||
| ] | ||
|
|
||
| TWO_FEATURES_EXAMPLES = np.array([[1, 5], [3, 10], [-14, 0], [0.5, 0.5]], | ||
| dtype=np.float32) | ||
|
|
||
| TWO_FEATURES_PREDICTIONS = [ | ||
| PredictionResult( | ||
| TWO_FEATURES_EXAMPLES[i], | ||
| [ | ||
| np.array([ | ||
| TWO_FEATURES_EXAMPLES[i][0] * 2.0 + | ||
| TWO_FEATURES_EXAMPLES[i][1] * 3 + 0.5 | ||
| ], | ||
| dtype=np.float32) | ||
| ]) for i in range(len(TWO_FEATURES_EXAMPLES)) | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test data variables SINGLE_FEATURE_EXAMPLES, SINGLE_FEATURE_PREDICTIONS, TWO_FEATURES_EXAMPLES, and TWO_FEATURES_PREDICTIONS are defined but not used in any of the tests. Consider removing them to improve code clarity. If they are intended for future tests, they could be moved to those tests or commented out with a TODO.
| model_copies=3) | ||
| self.assertEqual(handler.model_copies(), 3) | ||
|
|
||
| @unittest.skipIf(GCSFileSystem is None, 'GCP dependencies are not installed') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests test_inference_with_onnx_build_on_worker, test_engine_initialization (line 264), and test_pipeline_with_onnx_model (line 292) are skipped if GCSFileSystem is not available. However, these tests use local temporary files and do not seem to depend on GCS. The FileSystems.open call in the implementation will use LocalFileSystem for local paths. The skipIf condition seems unnecessary and could be removed from these tests.
|
Hi @jrmccluskey,
I apologize if the initial submission wasn't quite up to standard—this is my first contribution to an Apache project, and I'm still getting familiar with the code style and precommit requirements. I really appreciate your patience and guidance! |
- Register uses_tensorrt pytest marker in pytest.ini - Add Docker container for TensorRT 10.x GPU testing - Include README with container documentation - Container uses TensorRT 25.01-py3 base image with CUDA 12.x - Supports new TensorRTEngineHandlerNumPy Tensor API
- Fix RAT (Release Audit Tool) pre-commit check failure - Add proper HTML comment format license header to markdown file - Matches existing TensorRT container README format
- Demonstrates ONNX-to-engine on-worker conversion feature - Includes comprehensive CLI with all handler parameters - Shows key advantage over legacy TensorRT handler: eliminates environment alignment issues - Supports both ONNX input (with build_on_worker) and pre-built engines - Includes image preprocessing and top-5 prediction formatting
- Fix code formatting with yapf - Fix import ordering with isort - Ensure compliance with pre-commit hooks
- Adjust line break formatting to match yapf 0.43.0 - Ensures consistency with pre-commit CI checks
- Fix import ordering in trt_handler_numpy_compact_test.py (stdlib before third-party) - Fix line length violations in tensorrt_resnet50_inference.py (lines 40, 151) - All files now comply with pylint 80-character line limit
- Add blank lines after function definitions per yapf style - Combine multi-line @unittest.skipIf decorators to single line - All yapf 0.43.0 formatting now compliant
Remove unused GCSFileSystem import and add pylint disable comment for tensorrt import in availability check function.
jrmccluskey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm somewhat confused as to why this exists as a separate slate of files, tests, examples, etc. when the actual structure of the model handler is overwhelmingly the same as the existing TensorRT model handler. This seems like a pretty clear opportunity to either update the existing handler or, if preserving pre-10.x behavior is genuinely valuable, do a refactor to enable better code re-use across different major versions (especially if we'd anticipate having to produce a new TensorRT model handler for each version.) I'm also not sold on the file naming here, trt_handler_numpy_compact.py isn't exactly following our naming convention in the directory. What was the thinking behind this approach?
I have similar thoughts for the dockerfile and readme, I feel like those can live in the existing tensorrt_runinference directory.
|
Great questions @jrmccluskey ! Let me explain the architectural reasoning behind the separate implementation and address your concerns about naming and infrastructure. Note: I actually raised these same architectural concerns on the dev mailing list before implementing this approach - specifically about whether to update the existing handler or create separate files. The feedback I received was to bring this discussion to the PR review, so I'm glad you're raising these points here! Why Separate Files? The Core Technical DifferenceThe separation isn't primarily about the API version change (v2 → v3), but rather a fundamental workflow improvement that solves a critical production issue with TensorRT. The Environment Alignment ProblemTensorRT engine files must be compiled on the exact same hardware, CUDA version, and TensorRT version where inference will run. This is documented by NVIDIA as a hard requirement. Pre-building engines on a dev machine and deploying to Dataflow workers almost always causes compatibility failures. How the Old Handler Works# tensorrt_inference.py - Manual workflow
handler = TensorRTEngineHandlerNumPy(...)
network, builder = handler.load_onnx() # User must call this
engine = handler.build_engine(network, builder) # Then call this
# Notably: load_onnx() is only used in test files, not productionThe New Handler's Key Innovation# trt_handler_numpy_compact.py - Automatic on-worker building
handler = TensorRTEngineHandlerNumPy(
onnx_path='gs://bucket/model.onnx', # Portable ONNX
build_on_worker=True # Builds IN the Dataflow worker
)
engine = handler.load_model() # Engine built automatically on correct hardware!This eliminates an entire class of deployment failures:
Example Scenario This SolvesAddressing Your Specific Concerns1. Naming Inconsistency (
|
|
Reminder, please take a look at this pr: @jrmccluskey |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @shunping for label python. Available commands:
|
|
Reminder, please take a look at this pr: @shunping |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @damccorm for label python. Available commands:
|
|
R: @jrmccluskey |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
|
If the on-worker builds are that big of an improvement then I don't see why you wouldn't want to back-port them (which has the benefit of keeping things simple for end users, we would just need to document the new path or allow the old one if the params to the model handler aren't passed in the same way.) |
Perfect — then you should explicitly reference it so the PR auto-closes it. Here’s the finalized PR description with your issue number plugged in:
Description
This PR introduces a new
TensorRTEngineHandlerNumPyimplementation that supports TensorRT 10.x with the Tensor API (num_io_tensors,get_tensor_name,execute_async_v3, etc).It resolves runtime errors such as:
AttributeError: 'ICudaEngine' object has no attribute 'num_bindings'RuntimeError: This handler requires TensorRT 10.x+. Missing attributes on ICudaEngine: [...]by replacing the legacy TensorRT < 10 API calls with the new Tensor API.
It also fixes shape validation during dynamic batching and ONNX parsing (
onnx_path provided but build_on_worker=False) to allow building engines directly on Beam workers if desired.Addresses: fixes #36306
Checklist
fixes #36306) so it will auto-close once merged._require_tensorrt_10) before engine load.num_bindingsandexecute_async_v2withnum_io_tensorsandexecute_async_v3.CHANGES.mdwith a new entry under Python SDK → ML Inference: “Added TensorRT 10.x handler (TensorRTEngineHandlerNumPy).”