[6241485] Add support for ONNX Q/DQ node placement for DLA#1661
[6241485] Add support for ONNX Q/DQ node placement for DLA#1661gcunhase wants to merge 3 commits into
Conversation
📝 WalkthroughWalkthroughAdds a ChangesDLA Targeting Feature
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
modelopt/onnx/quantization/int8.py (3)
188-189: 💤 Low valueUpdate condition comment for consistency.
Similar to the GEMV exclusion above, the conv exclusion is now skipped for both
autotuneandtarget_dla. Consider adding a brief comment explaining why both conditions skip this exclusion, for maintainability.📝 Suggested comment
if not (autotune or kwargs.get("target_dla", False)): + # Skip conv exclusion for autotune (runtime-driven) and target_dla (comprehensive Q/DQ coverage) nodes_to_exclude.extend(find_nodes_from_convs_to_exclude(graph, quantize_mode="int8"))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/onnx/quantization/int8.py` around lines 188 - 189, The conv-exclusion condition in int8.py currently reads `if not (autotune or kwargs.get("target_dla", False)):` and lacks the explanatory comment present for the GEMV exclusion; add a short comment immediately above this condition explaining that conv nodes are only excluded when neither autotune nor target_dla are active (because autotune/target DLA require keeping convs for their tuning/DLA support), referencing the `find_nodes_from_convs_to_exclude` call and `nodes_to_exclude` list so future readers understand why the exclusion is skipped in both modes.
168-174: 💤 Low valueUpdate comment to mention target_dla.
The comment at line 172 states "this check will be skipped if Autotune is enabled," but after this change, the check is also skipped when
target_dla=True. Please update the comment to reflect both conditions.📝 Suggested comment update
- # Note that this check will be skipped if Autotune is enabled as Q/DQ node placements - # will be decided according to TensorRT's runtime measurements. + # Note that this check will be skipped if Autotune or target_dla is enabled. + # Autotune decides Q/DQ placements via runtime measurements; target_dla requires + # comprehensive Q/DQ coverage for DLA scale requirements.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/onnx/quantization/int8.py` around lines 168 - 174, The comment above the conditional that checks enable_gemv_detection_for_trt (the block starting with if enable_gemv_detection_for_trt and not (autotune or kwargs.get("target_dla", False))) is out of date: update the comment to state that the GEMV/TensorCore check is skipped when either Autotune is enabled or when target_dla=True. Locate the block that logs "Detecting GEMV patterns for TRT optimization" and modify the explanatory comment to mention both conditions (autotune and kwargs.get("target_dla", False)) as reasons the check will be skipped.
155-156: ⚡ Quick winConsider documenting the override behavior.
When
target_dla=True, this code replaces any user-providedop_types_to_quantizelist with all node types from the graph. While this is likely intentional for comprehensive DLA scale coverage, the override happens silently and could surprise users who provided a custom list.Consider adding a log message or updating the docstring to make this behavior explicit.
💡 Suggested improvement
if kwargs.get("target_dla", False): + logger.info("target_dla=True: quantizing all op types for DLA scale coverage") op_types_to_quantize = list({node.op_type for node in onnx_model.graph.node})🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/onnx/quantization/int8.py` around lines 155 - 156, The code silently overrides a user-supplied op_types_to_quantize when kwargs.get("target_dla", False) is true by setting op_types_to_quantize = list({node.op_type for node in onnx_model.graph.node}); update the function handling these kwargs (the block that reads target_dla and sets op_types_to_quantize) to either (a) emit a clear log message (e.g., logger.info/warn) stating that target_dla=True causes op_types_to_quantize to be replaced with all graph node types, or (b) update the function docstring to explicitly document this override behavior; implement the logging approach so callers see the override at runtime and keep the existing assignment to list({node.op_type for node in onnx_model.graph.node}) when target_dla is true.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@modelopt/onnx/quantization/int8.py`:
- Around line 188-189: The conv-exclusion condition in int8.py currently reads
`if not (autotune or kwargs.get("target_dla", False)):` and lacks the
explanatory comment present for the GEMV exclusion; add a short comment
immediately above this condition explaining that conv nodes are only excluded
when neither autotune nor target_dla are active (because autotune/target DLA
require keeping convs for their tuning/DLA support), referencing the
`find_nodes_from_convs_to_exclude` call and `nodes_to_exclude` list so future
readers understand why the exclusion is skipped in both modes.
- Around line 168-174: The comment above the conditional that checks
enable_gemv_detection_for_trt (the block starting with if
enable_gemv_detection_for_trt and not (autotune or kwargs.get("target_dla",
False))) is out of date: update the comment to state that the GEMV/TensorCore
check is skipped when either Autotune is enabled or when target_dla=True. Locate
the block that logs "Detecting GEMV patterns for TRT optimization" and modify
the explanatory comment to mention both conditions (autotune and
kwargs.get("target_dla", False)) as reasons the check will be skipped.
- Around line 155-156: The code silently overrides a user-supplied
op_types_to_quantize when kwargs.get("target_dla", False) is true by setting
op_types_to_quantize = list({node.op_type for node in onnx_model.graph.node});
update the function handling these kwargs (the block that reads target_dla and
sets op_types_to_quantize) to either (a) emit a clear log message (e.g.,
logger.info/warn) stating that target_dla=True causes op_types_to_quantize to be
replaced with all graph node types, or (b) update the function docstring to
explicitly document this override behavior; implement the logging approach so
callers see the override at runtime and keep the existing assignment to
list({node.op_type for node in onnx_model.graph.node}) when target_dla is true.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 72715d5e-ad7b-4b45-afdc-fb96b39940a5
📒 Files selected for processing (4)
CHANGELOG.rstmodelopt/onnx/quantization/__main__.pymodelopt/onnx/quantization/int8.pymodelopt/onnx/quantization/quantize.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1661 +/- ##
==========================================
+ Coverage 67.73% 76.53% +8.80%
==========================================
Files 511 511
Lines 56169 56174 +5
==========================================
+ Hits 38044 42993 +4949
+ Misses 18125 13181 -4944
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
/claude review |
b04c9e2 to
3086897
Compare
There was a problem hiding this comment.
Claude review — --target_dla for ONNX Q/DQ placement.
Findings: CRITICAL: 0, IMPORTANT: 2, SUGGESTION: 3.
Most impactful:
- [IMPORTANT Compatibility] in
quantize.py:628—find_nodes_from_mha_to_excludeis skipped solely ontarget_dla, regardless ofquantize_mode. The docstring and CLI help both promise "INT8 only", so combining--target_dlawith--quantize_mode=fp8silently changes FP8 MHA Q/DQ placement (head_size/fMHA-v2 exclusions are bypassed). Either gate the skip onint8or document the cross-mode effect. - [IMPORTANT Compatibility] in
int8.py:155— whentarget_dla=True, any user-suppliedop_types_to_quantizeis silently overwritten with everyop_typein the graph, including non-quantizable types (Reshape,Identity,Constant,Cast, …). Suggest logging a warning on override and intersecting with the quantizable set. - Three suggestions: (1)
kwargs[\"target_dla\"]propagation is asymmetric vs. the int4 path, (2) test models use unseeded RNG, (3) test only covers conv-exclusion skip — GEMV/MHA skip branches are uncovered.
Risk: low-medium. Feature is opt-in and backward-compatible by default. The two IMPORTANT items are about scope creep and silent overrides under the new flag, both fixable with small guards/log messages.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/onnx/quantization/test_qdq_rules_int8.py (1)
288-315: 💤 Low valueConsider clarifying the comment on line 313.
The comment states "only the 1st Conv is quantized" but the test only checks that the first Conv is quantized, without explicitly verifying that the other Conv nodes (dw_conv1, dw_conv2, dw_conv3) are not quantized. If other Convs can also be quantized when
target_dla=False, consider updating the comment to be more precise (e.g., "Check that at least the 1st Conv is quantized"). If only the first Conv should be quantized, consider adding assertions to verify the other Convs are not quantized.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/onnx/quantization/test_qdq_rules_int8.py` around lines 288 - 315, The comment "only the 1st Conv is quantized" is ambiguous because the test only asserts the first Conv in conv_nodes is quantized; either update that comment to "Check that at least the 1st Conv is quantized" or add explicit negative assertions to ensure the remaining Conv nodes are not quantized (e.g., use assert_nodes_are_not_quantized on conv_nodes[1:] or by referencing the specific depthwise conv nodes if available) inside test_target_dla so the test clearly enforces the intended behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/unit/onnx/quantization/test_qdq_rules_int8.py`:
- Around line 288-315: The comment "only the 1st Conv is quantized" is ambiguous
because the test only asserts the first Conv in conv_nodes is quantized; either
update that comment to "Check that at least the 1st Conv is quantized" or add
explicit negative assertions to ensure the remaining Conv nodes are not
quantized (e.g., use assert_nodes_are_not_quantized on conv_nodes[1:] or by
referencing the specific depthwise conv nodes if available) inside
test_target_dla so the test clearly enforces the intended behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ea1160ae-9324-42a2-9710-41e3faa9442f
📒 Files selected for processing (6)
CHANGELOG.rstmodelopt/onnx/quantization/__main__.pymodelopt/onnx/quantization/int8.pymodelopt/onnx/quantization/quantize.pytests/_test_utils/onnx/lib_test_models.pytests/unit/onnx/quantization/test_qdq_rules_int8.py
✅ Files skipped from review due to trivial changes (1)
- CHANGELOG.rst
🚧 Files skipped from review as they are similar to previous changes (3)
- modelopt/onnx/quantization/main.py
- modelopt/onnx/quantization/int8.py
- modelopt/onnx/quantization/quantize.py
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
b1a924e to
9460382
Compare
All resolved. |
|
/claude review |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/onnx/quantization/quantize.py (1)
673-673:target_dlaforwarding into FP8 is redundant (FP8 ignores it).
modelopt/onnx/quantization/fp8.py::quantize()accepts**kwargsbut only readsenable_gemv_detection_for_trt,op_types_needing_output_quant, andno_quantize_inputs; it does not referencetarget_dla. So the docstring (“only has effect in INT8 quantization”) matches behavior. Optional: gatekwargs["target_dla"]to onlyquantize_mode == "int8"for clarity.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/onnx/quantization/quantize.py` at line 673, The assignment of kwargs["target_dla"] is redundant for FP8 paths because fp8.py::quantize() does not use target_dla; restrict forwarding by only setting kwargs["target_dla"] when quantize_mode == "int8" (or the INT8-specific code path) so INT8 quantization receives it but FP8 does not; update the code around the kwargs population in quantize.py to gate the target_dla assignment on quantize_mode == "int8".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@modelopt/onnx/quantization/quantize.py`:
- Line 673: The assignment of kwargs["target_dla"] is redundant for FP8 paths
because fp8.py::quantize() does not use target_dla; restrict forwarding by only
setting kwargs["target_dla"] when quantize_mode == "int8" (or the INT8-specific
code path) so INT8 quantization receives it but FP8 does not; update the code
around the kwargs population in quantize.py to gate the target_dla assignment on
quantize_mode == "int8".
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 80acee62-eda4-432e-a9ee-7244d22aa3d3
📒 Files selected for processing (6)
CHANGELOG.rstmodelopt/onnx/quantization/__main__.pymodelopt/onnx/quantization/int8.pymodelopt/onnx/quantization/quantize.pytests/_test_utils/onnx/lib_test_models.pytests/unit/onnx/quantization/test_qdq_rules_int8.py
✅ Files skipped from review due to trivial changes (1)
- CHANGELOG.rst
🚧 Files skipped from review as they are similar to previous changes (3)
- modelopt/onnx/quantization/main.py
- modelopt/onnx/quantization/int8.py
- tests/_test_utils/onnx/lib_test_models.py
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Adds a --target_dla option to the ONNX quantization path that places Q/DQ nodes on all tensors for optimal DLA INT8 deployment. The change is small (+301/-12), cohesive, and backward compatible: target_dla defaults to False, preserving existing behavior, and the three guarded branches (GEMV/MatMul exclusion in int8.py, conv-exclusion, and MHA exclusion in quantize.py) are only bypassed when target_dla=True and quantize_mode == "int8".
Correctness checks:
target_dlais plumbed throughquantize.py→quantize_int8via kwargs and read withkwargs.get("target_dla", False). It's also injected into the fp8 path's kwargs, butfp8.quantizeaccepts**kwargsand ignores it, and the docstring/help correctly state the flag only affects INT8 — so no functional impact there.- When
target_dla=Trueand noop_types_to_quantizeis supplied, the code sets it to the full set of model op types. This correctly interacts withconfigure_ort, which only removes non-quantizable ops (Relu, Sigmoid, etc.) from the QDQ registry if they're NOT inop_types_to_quantize— so including all op types is consistent with DLA's "quantize everything" intent.
Tests: Two new parametrized tests (target_dla=False/True) cover both the Conv/Mul quantization expansion and the GEMV (MatMul m=1) exclusion bypass, with dedicated model builders. Assertions are meaningful and verify the QDQ placement differences between modes.
Docstring, --target_dla CLI help, and CHANGELOG are all updated. No licensing files or headers touched. No prompt-injection attempts in the PR content.
|
Could you check what is the accuracy impact if we enable quantizers for all layers? |
What does this PR do?
Type of change: New feature
On DLA, the whole DLA-eligible region is compiled as one node, which runs in INT8 or FP16, and it expects scales to be present throughout. A tensor without a usable scale typically forces either that region to run in FP16 or a GPU fallback (if enabled) — otherwise the build fails.
With IQ (implicit quantization) being deprecated in TensorRT, users are migrating to ModelOpt for quantization/calibration. However, this breaks the DLA workflow since DLA still only supports IQ. The suggested workflow is then to:
calib.cacheandlayer_arg.txtfiles, which can be used with the non-quantized model to generate a DLA loadable.A study on Yolov5 has shown that EQ can achieve perf parity with IQ on DLA if Q/DQ nodes are inserted at every layer, making sure all tensors have INT8 scales. From the study: "With this option, all layers’ scales can be obtained during model fine-tuning. However, this method may potentially disrupt TensorRT fusion strategy with Q/DQ layers when running inference on GPU and lead to higher latency on the GPU. For DLA, on the other hand, the rule of thumb with PTQ scales is, “The more available scales, the lower the latency.” "
This PR aims to enable a quantization path targeting DLA.
Usage
Testing
Before your PR is "Ready for review"
CONTRIBUTING.md: N/AAdditional info
Related blogpost: https://developer.nvidia.com/blog/deploying-yolov5-on-nvidia-jetson-orin-with-cudla-quantization-aware-training-to-inference/#adding_qdq_nodes
Summary by CodeRabbit
New Features
Behavior Changes
Tests
Documentation